Data Preparation For Machine Learning Model Building
In the analytics process building an ML model for prescriptive or predictive analysis is very common. We need to understand the context of the business objectives and align those with our data.
Understanding data and preparing it for the model can be easily automated nowadays with so many libraries present in Python only. Having said that, there are many moving parts thus we should know what each step is and what’s gonna be the output for the same.
Open your Jupyter Notebook or Google collab for reading the python files:
Reading in collab :
import pandas as pd
from google.colab import files
uploaded = files.upload()
import io
data = pd.read_csv(io.BytesIO(uploaded['BlackFriday.csv']))
print(data.head(5))
Reading in Jupyter :
import pandas as pd
data = pd.read_csv('BlackFriday.csv')
print(data.head(5))User_ID Product_ID Gender ... Product_Category_2 Product_Category_3 Purchase
0 1000001 P00069042 F ... NaN NaN 8370
1 1000001 P00248942 F ... 6.0 14.0 15200
2 1000001 P00087842 F ... NaN NaN 1422
3 1000001 P00085442 F ... 14.0 NaN 1057
4 1000002 P00285442 M ... NaN NaN 7969
Understanding the data type of each column :
data.info()RangeIndex: 537577 entries, 0 to 537576
Data columns (total 12 columns):
User_ID 537577 non-null int64
Product_ID 537577 non-null object
Gender 537577 non-null object
Age 537577 non-null object
Occupation 537577 non-null int64
City_Category 537577 non-null object
Stay_In_Current_City_Years 537577 non-null object
Marital_Status 537577 non-null int64
Product_Category_1 537577 non-null int64
Product_Category_2 370591 non-null float64
Product_Category_3 164278 non-null float64
Purchase 537577 non-null int64
dtypes: float64(2), int64(5), object(5)
As we can see that we have 5 categorical columns and 7 continuous variables which means if we want to feed the same data into model we have to convert categorical columns into continuous variables since ML models are mathematical inherently .
Now let’s understand the data column by column for each type:
import seaborn as sns
sns.distplot(data['Purchase']) #Particular product sold for less rate . Thats y the spike
fig,ax = plt.subplots(2,1,figsize=(12,5),sharex=True)
ax[0].hist(data['Purchase'][data['Gender']=='M'])
ax[1].hist(data['Purchase'][data['Gender']=='F'],color='Red')
data['Age'].value_counts().plot.bar()
fig,ax =plt.subplots(1,2,figsize=(12,5))
data['Age'].value_counts().plot.pie(explode=[0.1,0,0,0,0,0,0],ax=ax[0])
ax[0].set_title('Age Group Pie')
ax[1].set_title('Age Group Bar')
sns.countplot('Age',data=data,ax=ax[1])
Now that we have seen there are mutiple categories present in different columns and also the distribution is almost normal in various continuous variables let’s move to next step.
Convert categorical to continuous variables :
By using n-1 dummies
data2= data['Age'].to_frame()
data1 = pd.get_dummies(data2,columns=['Age'],drop_first=True)
data1.head()
In this step we have converted each category into column and 0 or 1 values represent the absence or presence of the value.In the similar way we can convert other categories also so that data can be represented in mathematical form and can be fed to the model.
By using label encoding
from sklearn.preprocessing import LabelEncoder
data_age = data[['Age']]
data_gender = data[['Gender']]
le = LabelEncoder()
data['Age_code'] = le.fit_transform(data_age)
data['Age_code'].head()0 0
1 0
2 0
3 0
4 6
Name: Age_code, dtype: int64
Next step , standardise the data .Since there can be different scales in the data we must standardise or normalise the data first :
from sklearn.preprocessing import StandardScaler,MinMaxScaler
mm = MinMaxScaler()
sd = StandardScaler()
data_p = data[['Purchase']]
dfmm = mm.fit_transform(data_p['Purchase'].values.reshape(-1,1)) dfsd = sd.fit_transform(data_p['Purchase'].values.reshape(-1,1)) data_p['Purchase_mm'] = dfmm
data_p['Purchase_sd'] = dfsd
data_p.head()
Split the data into train and test :
Now that data is fully mathematically present we should split the data for training and testing purposes :
from sklearn.model_selection import train_test_split
import numpy as np
x=data.drop('Purchase',axis=1)
y=data['Purchase']
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=42)
print(X_train.shape,X_test.shape)
Output:
(376303, 12) (161274, 12)
Now your data is ready for model and can be used for either classification or regression as per your analytics problem.
Thanks !