Feature Selection/Elimination using Scikit-learn in Python

What is feature selection?

One of the most significant steps in machine learning is feature selection. It is the method of narrowing down a subset of features to be used without sacrificing the true knowledge in predictive modelling

Why feature selection?

1. It improves model performance: these features act as a noise when you have irrelevant features in your details, which makes the models of machine learning perform poorly.

2. It leads to faster machine learning models.

3. It prevents overfitting: We would be able to perfectly match our training data if we have more columns in the data than the number of rows, but it will not generalize to the new samples. And thus we learn absolutely nothing.

4. Removing Garbage: We'll have several non-informative features much of the time. Name or ID variables, for instance. Poor-quality input will generate poor-quality output.

Different Methods of Feature Selection/Elimination:

1. Variance threshold

This method removes features with variation below a certain cutoff. The idea is when a feature doesn't vary much within itself, it generally has very little predictive power. Variance Threshold doesn't consider the relationship of features with the target variable.

2. Univariate feature selection

Using univariate statistical tests such as chi-square, Univariate feature selection works by choosing the best characteristics. It independently tests each feature to assess the intensity of the feature 's relationship with the response variable. One of the univariate methods that eliminates all but the specified number of highest scoring features is SelectKBest.

3. Recursive feature elimination

RFE begins by fitting a model for each predictor on the entire set of features and computing an important score. The weakest features are then removed, the model is re-fitted, and once the specified number of features are used, significant scores are computed again. Features important score are ranked by the coef or function value attributes of the model and a limited number of features per loop are recursively removed.

4. PCA (Principal Component Analysis)

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

5. Correlation

Correlation is a statistical term which in common usage refers to how close two variables are to having a linear relationship with each other.

Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.

Let's start digging!!

1. First of all download your own dataset with many numerical features with different range value to see the effect of feature selection. Here, I am using this dataset.

2. Importing required packages.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

3. Importing dataset

df=pd.read_csv('data.csv')
df

4. Data Exploration

df.dtypes

5. Data Cleaning

df['Gender'].unique()

array(['Male', 'Female'], dtype=object)

df.loc[df['Gender'] == 'Male', 'Gender'] = 1
df.loc[df['Gender'] == 'Female', 'Gender'] = 0

df['Vehicle_Age'].unique()

array(['> 2 Years', '1-2 Year', '< 1 Year'], dtype=object)

df.loc[df['Vehicle_Age'] == '> 2 Years', 'Vehicle_Age'] = 2
df.loc[df['Vehicle_Age'] == '1-2 Year', 'Vehicle_Age'] = 1
df.loc[df['Vehicle_Age'] == '< 1 Year', 'Vehicle_Age'] = 0

df['Vehicle_Damage'].unique()

array(['Yes', 'No'], dtype=object)

df.loc[df['Vehicle_Damage'] == 'Yes', 'Vehicle_Damage'] = 1
df.loc[df['Vehicle_Damage'] == 'No', 'Vehicle_Damage'] = 0

for col in df.columns:
    df[col] = df[col].astype(np.int32)
df.dtypes

6. Creating training and test dataset

x = df[['Gender', 'Age', 'Driving_License', 'Region_Code','Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium','Policy_Sales_Channel', 'Vintage']]
y = df['Response']
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size=0.2)

7. Applying KNN model before Feature Selection.

knn=KNeighborsClassifier()
knn.fit(X_train,Y_train)
accuracy_score(Y_test,knn.predict(X_test))

accuracy_score = 0.8568523523392196

8. Appling Univariate feature selection

X_best= SelectKBest(chi2, k=2).fit(X_train, Y_train)
mask = X_best.get_support() #list of booleans for selected features
new_feat = []
for bool, feature in zip(mask, X_train.columns):
 if bool:
     new_feat.append(feature)

new_feat

The best 5 features are

['Age',
 'Previously_Insured',
 'Vehicle_Damage',
 'Annual_Premium',
 'Policy_Sales_Channel']

x_train = X_train[new_feat]
x_test = X_test[new_feat]

del knn
knn=KNeighborsClassifier()
knn.fit(x_train,Y_train)
accuracy_score(Y_test,knn.predict(x_test))

accuracy_score = 0.8611555718821338

8. Appling Recursive feature elimination

estimator = RandomForestClassifier()
selector = RFE(estimator, n_features_to_select = 5)
selector = selector.fit(X_train, Y_train)
rfe_mask = selector.get_support() #list of booleans for selected features

new_feat = [] 
for bool, feature in zip(rfe_mask, X_train.columns):
 if bool:
     new_feat.append(feature)
new_feat # The list of your 5 best features

The best 5 features are

['Age', 'Region_Code', 'Vehicle_Damage', 'Annual_Premium', 'Vintage']

del x_train
del x_test
x_train = X_train[new_feat]
x_test = X_test[new_feat]

del knn
knn=KNeighborsClassifier()
knn.fit(x_train,Y_train)
accuracy_score(Y_test,knn.predict(x_test))

accuracy_score = 0.8593057122615518

9. Appling PCA (Principal Component Analysis)

pca = PCA(0.95)
pca.fit(X_train)

del x_train
del x_test
x_train = pca.transform(X_train)
x_test = pca.transform(X_test)

del knn
knn=KNeighborsClassifier()
knn.fit(x_train,Y_train)
accuracy_score(Y_test,knn.predict(x_test))

accuracy_score = 0.8628873553567211

10. Using Correlation

corr = df.corr()
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot=True)

abs(corr['Response']).sort_values(ascending = False)

Selecting top 5 feature

x_train = X_train[['Vehicle_Damage','Previously_Insured','Vehicle_Age','Policy_Sales_Channel','Age']]
x_test = X_test[['Vehicle_Damage','Previously_Insured','Vehicle_Age','Policy_Sales_Channel','Age']]

del knn
knn=KNeighborsClassifier()
knn.fit(x_train,Y_train)
accuracy_score(Y_test,knn.predict(x_test))

accuracy_score = 0.8558159061688226

From the above results we can see that applying different feature selection technique outputs different features and selecting different features results in deferent accuracy score.

The python file

Please like, Share and comment.

References:

1. https://medium.com/analytics-vidhya/feature-selection-using-scikit-learn-5b4362e0c19b

2. https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

Data Science Practicals

Pages

Practical 2