Data Preparation For Machine Learning Model Building

4 min readMay 3, 2022

In the analytics process building an ML model for prescriptive or predictive analysis is very common. We need to understand the context of the business objectives and align those with our data.

Understanding data and preparing it for the model can be easily automated nowadays with so many libraries present in Python only. Having said that, there are many moving parts thus we should know what each step is and what’s gonna be the output for the same.

Open your Jupyter Notebook or Google collab for reading the python files:

Reading in collab :

import pandas as pd

from google.colab import files
uploaded = files.upload()
import io
data = pd.read_csv(io.BytesIO(uploaded['BlackFriday.csv']))
print(data.head(5))

Reading in Jupyter :

import pandas as pd

data = pd.read_csv('BlackFriday.csv')
print(data.head(5))User_ID Product_ID Gender  ... Product_Category_2  Product_Category_3 Purchase
0  1000001  P00069042      F  ...                NaN                 NaN     8370
1  1000001  P00248942      F  ...                6.0                14.0    15200
2  1000001  P00087842      F  ...                NaN                 NaN     1422
3  1000001  P00085442      F  ...               14.0                 NaN     1057
4  1000002  P00285442      M  ...                NaN                 NaN     7969

Understanding the data type of each column :

data.info()RangeIndex: 537577 entries, 0 to 537576
Data columns (total 12 columns):
User_ID                       537577 non-null int64
Product_ID                    537577 non-null object
Gender                        537577 non-null object
Age                           537577 non-null object
Occupation                    537577 non-null int64
City_Category                 537577 non-null object
Stay_In_Current_City_Years    537577 non-null object
Marital_Status                537577 non-null int64
Product_Category_1            537577 non-null int64
Product_Category_2            370591 non-null float64
Product_Category_3            164278 non-null float64
Purchase                      537577 non-null int64
dtypes: float64(2), int64(5), object(5)

As we can see that we have 5 categorical columns and 7 continuous variables which means if we want to feed the same data into model we have to convert categorical columns into continuous variables since ML models are mathematical inherently .

Now let’s understand the data column by column for each type:

import seaborn as sns
sns.distplot(data['Purchase'])  #Particular product sold for less rate . Thats y the spike

*Particular product sold for less rate . Thats y the spike*

fig,ax = plt.subplots(2,1,figsize=(12,5),sharex=True)
ax[0].hist(data['Purchase'][data['Gender']=='M'])
ax[1].hist(data['Purchase'][data['Gender']=='F'],color='Red')

data['Age'].value_counts().plot.bar()

fig,ax =plt.subplots(1,2,figsize=(12,5))
data['Age'].value_counts().plot.pie(explode=[0.1,0,0,0,0,0,0],ax=ax[0])
ax[0].set_title('Age Group Pie')
ax[1].set_title('Age Group Bar')

sns.countplot('Age',data=data,ax=ax[1])

Now that we have seen there are mutiple categories present in different columns and also the distribution is almost normal in various continuous variables let’s move to next step.

Convert categorical to continuous variables :

By using n-1 dummies

data2= data['Age'].to_frame()
data1 = pd.get_dummies(data2,columns=['Age'],drop_first=True)
data1.head()

In this step we have converted each category into column and 0 or 1 values represent the absence or presence of the value.In the similar way we can convert other categories also so that data can be represented in mathematical form and can be fed to the model.

By using label encoding

from sklearn.preprocessing import LabelEncoder


data_age = data[['Age']]
data_gender  =  data[['Gender']]
le =  LabelEncoder()
data['Age_code'] = le.fit_transform(data_age)
data['Age_code'].head()0    0
1    0
2    0
3    0
4    6
Name: Age_code, dtype: int64

Next step , standardise the data .Since there can be different scales in the data we must standardise or normalise the data first :

from sklearn.preprocessing import StandardScaler,MinMaxScaler  
mm = MinMaxScaler()
sd = StandardScaler() 
data_p = data[['Purchase']] 
dfmm = mm.fit_transform(data_p['Purchase'].values.reshape(-1,1)) dfsd  = sd.fit_transform(data_p['Purchase'].values.reshape(-1,1)) data_p['Purchase_mm'] = dfmm 
data_p['Purchase_sd'] = dfsd 
data_p.head()

Split the data into train and test :

Now that data is fully mathematically present we should split the data for training and testing purposes :

from sklearn.model_selection import train_test_split
import numpy as np
x=data.drop('Purchase',axis=1)
y=data['Purchase']

X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=42)
print(X_train.shape,X_test.shape)

Output:

(376303, 12) (161274, 12)

Now your data is ready for model and can be used for either classification or regression as per your analytics problem.

Thanks !

Data Preparation For Machine Learning Model Building

Written by Amit Bhardwaj

Responses (1)