Data Preprocessing in Python using Scikit Learn

What is Data Preprocessing?

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Therefore, certain steps are executed to convert the data into a small clean data set. This technique is performed before the execution of the Iterative Analysis. The set of steps is known as Data Preprocessing. It includes –
Data Cleaning
Data Integration
Data Transformation
Data Reduction

Need of Data Preprocessing

The format of the data must be in a proper way to obtain better outcomes from the implemented model in Machine Learning and Deep Learning projects, this is where the Data Preparation is used.

Some specified Machine Learning and Deep Learning model need information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values has to be managed from the original raw data set.

Various data pre-processing techniques:

Standardization:
Data standardization is the method by which one or more attributes are rescaled such that they have a mean value of 0 and a standard deviation of 1.

Normalization:
The aim of normalization is to adjust the numeric column values to a standard scale in the dataset, without distorting the variations in the value ranges.

One-hot Encoding:
One hot encoding is a process that transforms categorical data into a type that could be given to ML algorithms to do a better prediction job. It only accepts numerical information as an input. So, by using Label Encoder, the categorical data that needs to be encoded is transformed into a numerical form.

Discretization:
Discretization refers to the method of converting or partitioning discretized or nominal attributes / features / variables / intervals from continuous attributes, features or variables.

Imputation:
For missing data, the imputation technique develops fair guesses. When the amount of missing data is tiny, it's most beneficial. If the portion of missing information is too large, there is no natural variance in the results that could result in an efficient model.

What is Scikit Learn?

Scikit-learn is a Python library that provides a broad range of algorithms for supervised and unsupervised learning.

Scikit Learn is built on top of many Python libraries of common data and math. Such a design makes the integration between them all super simple. You can transfer numpy arrays and pandas data frames straight to Scikit's ML algorithms. It uses the following libraries
NumPy: For any work with matrices, especially math operations
SciPy: Scientific and technical computing
Matplotlib: Data visualization
IPython: Interactive console for Python
Sympy: Symbolic mathematics
Pandas: Data handling, manipulation, and analysis

Let's start digging!!

1. First of all download your own dataset with many numerical features with different range value to see the effect of data preprocessing. Here, I am using this dataset.

2. Importing required packages.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

3. Importing dataset

df=pd.read_csv('data.csv',na_values=['?'])
df

4. Data Exploration

df.hist(figsize=[18,18])

5. Data Cleaning

df.fillna(df.median(),inplace=True)
df = df.drop_duplicates()

6. Creating training and test dataset

x=df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach','exang', 'oldpeak', 'slope', 'thal']]
y = df['num']
X_train,X_test,Y_train,Y_test = train_test_split(df,y,test_size=0.2)

7. Applying KNN model before preprocessing.

knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,Y_train)
accuracy_score(Y_test,knn.predict(X_test))

accuracy_score= 0.7288135593220338

8. Now applying feature scaling.

min_max=MinMaxScaler()
X_train_minmax=min_max.fit_transform(X_train[['trestbps','chol','thalach']])
X_test_minmax=min_max.fit_transform(X_test[['trestbps','chol','thalach']])

9. Applying KNN model after feature scaling.

del knn
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_minmax,Y_train)

accuracy_score(Y_test,knn.predict(X_test_minmax))

accuracy_score= 0.7457627118644068 (2% increase)

10. Applying Feature Standardization

X_train_scale = scale(X_train[['trestbps','chol','thalach']])
X_test_scale = scale(X_test[['trestbps','chol','thalach']])
log=LogisticRegression()
log.fit(X_train_scale,Y_train)
accuracy_score(Y_test,log.predict(X_test_scale))

accuracy_score= 0.6949152542372882 (3% decrease)

11. Applying One-Hot Encoding

enc=OneHotEncoder(sparse=False)
X_train_1=X_train
X_test_1=X_test
columns=['sex', 'cp','fbs', 'restecg', 'exang', 'oldpeak', 'slope', 'thal']
for col in columns:
    data=X_train[[col]].append(X_test[[col]])
    enc.fit(data)
    # Fitting One Hot Encoding on train data
    temp = enc.transform(X_train[[col]])
    # Changing the encoded features into a data frame with new column names
    temp=pd.DataFrame(temp,columns=[(col+"_"+str(i)) for i in data[col].value_counts().index])
    # In side by side concatenation index values should be same
    # Setting the index values similar to the X_train data frame
    temp=temp.set_index(X_train.index.values)
    # adding the new One Hot Encoded varibales to the train data frame
    X_train_1=pd.concat([X_train_1,temp],axis=1)
    # fitting One Hot Encoding on test data
    temp = enc.transform(X_test[[col]])
    # changing it into data frame and adding column names
    temp=pd.DataFrame(temp,columns=[(col+"_"+str(i)) for i in data[col].value_counts().index])
    # Setting the index for proper concatenation
    temp=temp.set_index(X_test.index.values)
    # adding the new One Hot Encoded varibales to test data frame
    X_test_1=pd.concat([X_test_1,temp],axis=1)
X_train_1.columns

X_train_scale=scale(X_train_1)
X_test_scale=scale(X_test_1)
log.fit(X_train_scale,Y_train)
accuracy_score(Y_test,log.predict(X_test_scale))

accuracy_score = 1.0 (28% increase)

From the above results we can see that applying different preprocessing technique produces different results.

The python file

Question and Answers

How to decide variance threshold in data reduction?
The estimation of the variance threshold depends on a specific distribution's probability density function.

Does the output result same even after applying model on encoded data v\s original data?
No, because most machine learning algorithms require numerical input and output variables. So, After applying one-hot encoding accuracy of ML algorithm increases.

Please like, Share and comment.

References:
https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/
https://towardsdatascience.com/an-introduction-to-scikit-learn-the-gold-standard-of-python-machine-learning-e2b9238a98ab
https://www.kaggle.com/sanskrutipanda/heart-disease-prediction

Data Science Practicals

Pages

Practical 1