Variable Selection in Bioinformatics: Methods and Algorithms

Bhushan M.; Kulkarni S.; Modak S.; Valadi Krishnamoorthy Jayaraman

In Chemo & Bioinformatics problems, machine learning algorithms have been found to be very useful. In these problems, both supervised and unsupervised techniques are frequently employed. For example, in protein function annotation, it is possible to extract different types of features including sequence-based, structure-based, expression profile-based, composition-based, etc. In these large compendia of features, many may be noisy and may not contain information related to the function annotation problem at hand. With the presence of these irrelevant and noisy features, the classifier or regressor may not be able to build an optimal model and the performance may not be very good. This is because the model finds it difficult to differentiate between signal and noise. With the increase in the information content of the data set, the prediction accuracy will also increase and provide better generalization. Over fitting will also reduce with increase in informative features. The computational complexity will decrease, and the speed of training and testing will increase manifold. In bioinformatics, model interpretability increases with the usage of a small subset of relevant features and may provide important domain information in the form of identifiable biomarkers. Due to the importance of feature selection, several methodologies are available for feature selection. In this chapter, we discuss in detail different classes of feature selection methods and algorithms and their applications in Chemo & Bioinformatics. Feature selection techniques can be broadly classified as filter, wrapper, and embedding techniques. © 2021 River Publishers.