Feature Select: Applying Optimization Algorithms To Improve Model Accuracy
When one starts off with Machine Learning, more often than not they are introduced to small-scaled, preprocessed datasets, which generally have a small number of attributes/features associated with them, but that is not the case in real life. More complex datasets available over the internet, and real ones come with a large number of data points, and various attributes associated with them. Then there is the question, why collect such a large number of attributes rather than just collecting the most relevant ones? One of the key reasons for it is that it’s not very apparent which features are relevant to the task in hand, another way to look at it is the various tasks that can be performed using the subset of features collected. Hence comes the term, Feature Selection, which basically means to select the subset of most relevant features from the complete set of features given in a dataset. One way to achieve this goal is to simply analyze the effect of each feature on the target variable by use of statistical methods, but when the feature set comprises of say, 500 or so features, it becomes a very tedious task to perform.
There are various methods proposed in the past, famously “All but X” which implies an exhaustive enumeration of the set of all features except one and taking into account the effect of removal of that feature. This method is both costly in terms of computation time as well as ineffective, since it doesn't take into account the relation of various features with each other, rather just the relation of a feature with the target feature.
Optimization algorithms are used to select the best possible solution or tend towards the best solution possible by comparing the results of various solutions. It has been famously used in mathematics to find the optimized result for various problems. They reduce the search space by moving in the direction of better solutions and tend to look into a smaller subset of search space, hence both reducing the computation time and being effective in finding optimal or at least near-optimal solutions.
This idea of finding the optimal solution can be applied to the problem at hand. Using this concept, and various algorithms resonating with this idea, we can formulate the problem of searching for the best subset of features for a given dataset, as a mathematical optimization problem. Hence is the idea of Feature Select.
Feature Select is a python package that can be used to perform the task of selecting the best features based on the results they tend to produce. It works currently for a numerical set of features only.
Under the hood, it uses the concept of optimization algorithms to select the best set of features for a given dataset and machine learning algorithm. As of now, it uses Differential Evolution, Genetic Algorithm, Particle Swarm Optimization, and Simulated Annealing to perform the task.
To try it yourself, you need to install it through the pip command:
pip install featureselect
Using the optimizer is as simple as importing the optimizer functions from the library and calling it. All the optimizer functions need to be provided a subset of examples with the entire feature set, along with the intended target variable; the class machine learning algorithm used; the name of methods used to train and evaluate the algorithm; and lastly the number of epochs for which the algorithm should run.
As it could be seen in the above example, within just 10 iterations, the accuracy of the model improves by a significant value of nearly 10%, while reducing the number of features from 34 to 17, half of it.
Hence Feature Select proves to be a good choice for the task of, well, selecting features for large and small datasets alike.
Check out the code repository of the library at: