Kaggle Project: Categorical Feature Engineering

6 minute read

About

This dataset is obtained from kaggle Credit Cards Categorical Feature Encoding Challenge II that can be found here: https://www.kaggle.com/c/cat-in-the-dat-ii

Content

Hello all, in this dataset we are given a bunch of categorical features like the previous one; except that this one has missing values. So we are going to approach the problem in 3 steps. First one is to fill these missing values,then we’re going to encode them. Finally, we will fit a model. In this dataset, I chose to use Logistic Regression with regularization. We have binary, nominal and ordinal features. Lets begin.

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

train = pd.read_csv("../input/cat-in-the-dat-ii/train.csv")
test = pd.read_csv("../input/cat-in-the-dat-ii/test.csv")

# Subset
target = train['target']
train_id = train['id']
test_id = test['id']
train.drop(['id'], axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

#Null values
null_df = pd.DataFrame({'Percentile':train.isnull().sum()/len(train), 'Count':train.isnull().sum()})
print(null_df)

All variables has missing values around 3%.

my_df = pd.concat([train, test])
my_df.isnull().sum()

bin_0 29795
bin_1 30041
bin_3 29965
bin_4 29998
day 29977
month 29972
nom_0 30314
nom_1 30103
nom_2 30214
nom_3 30297
nom_4 30028
nom_5 29690
nom_6 30143
nom_7 30006
nom_8 29711
nom_9 30133
ord_0 30181
ord_1 30208
ord_2 30180
ord_3 29969
ord_4 29863
ord_5 29760
target 400000
dtype: int64

Preprocessing

Start with binary ones.

my_df["bin_3"] = my_df["bin_3"].apply(lambda x: 1 if x=='T' else (0 if x=='F' else None))
my_df["bin_4"] = my_df["bin_4"].apply(lambda x: 1 if x=='Y' else (0 if x=='N' else None))

In nominal and ordinal ones, first we have to convert to string by .astype. But when we do this, NaN values are converted to string “nan”. So we will write them None afterwards.

for enc in ["nom_0","nom_1","nom_2","nom_3","nom_4","day","month","ord_3"]:#,"ord_4","ord_5"]:
    my_df[enc] = my_df[enc].astype("str")
    my_df[enc] = my_df[enc].apply(lambda x: x.lower())
    my_df[enc] = my_df[enc].apply(lambda x: None if x=='nan' else x)

for enc in ["ord_4","ord_5"]:
    my_df[enc] = my_df[enc].astype("str")
    my_df[enc] = my_df[enc].apply(lambda x: None if x=='nan' else x)

ord_1 and ord_2 can be encoded like the following. They include ordering such as Warm>Cold>Freezing etc.

my_df["ord_1"] = my_df["ord_1"].apply(lambda x: 1 if x=='Novice' else (2 if x=='Contributor' else (3 if x=='Expert' else (4 if x=='Master' else (5 if x=='Grandmaster' else None)))))
my_df["ord_2"] = my_df["ord_2"].apply(lambda x: 1 if x=='Freezing' else (2 if x=='Cold' else (3 if x=='Warm' else (4 if x=='Hot' else (5 if x=='Boiling Hot' else (6 if x=='Lava Hot' else None))))))

Now here comes the important part. As Bojan mentioned here , the most important features are the weird ones, which are nom_5, nom_6, nom_7, nom_8 and nom_9. I filled the missing values of these with mode and used these features to impute the missing in other variables with similar values. The reason that I filled with mode is, when we use pandas’ group by function, if one of the grouped variables are missing, then it returns NaN. So they have to be complete.

for col in ["nom_5","nom_6","nom_7","nom_8","nom_9"]:
    mode = my_df[col].mode()[0]
    my_df[col] = my_df[col].astype(str)
    my_df[col] = my_df[col].apply(lambda x: mode if x=='nan' else x)

Get the other variables list. And fill wtih similar values.

columns = list(my_df.columns)
for col in ["target","nom_5","nom_6","nom_7","nom_8","nom_9"]:
    columns.remove(col)

I used nom_7 to group only since it looks like the most important variable according to the feature importances from the link I mentioned above. Check this pic:

feature importances

The reason we didn’t use more variables to group by is, grouping with high-dimensional features is not efficient. Sometimes it creates bias and result in overfitting. This way we are kinda safe. After grouping, we will fill with the mode value of that group.

for col in columns:
    my_df[col] = my_df.groupby(["nom_7"])[col].transform(lambda x: x.fillna(x.mode()[0]))

Now check the missing values.

my_df.isnull().sum()

bin_0 0
bin_1 0
bin_2 0
bin_3 0
bin_4 0
day 0
month 0
nom_0 0
nom_1 0
nom_2 0
nom_3 0
nom_4 0
nom_5 0
nom_6 0
nom_7 0
nom_8 0
nom_9 0
ord_0 0
ord_1 0
ord_2 0
ord_3 0
ord_4 0
ord_5 0
target 400000

No missing values left except for test set’s target. Now, lets encode the remaining variables. I will convert ordinal ones with ordinal encoder. It will assign 1,2,3… according to alphabetic order of string variables. Note that ord_4 and ord_5 include both upper and lower case letters. They might include information. So lets use ordinal encoder on them too.

from sklearn.preprocessing import OrdinalEncoder
oencoder = OrdinalEncoder(dtype=np.int16)
for enc in ["ord_3","ord_4","ord_5"]:
    my_df[enc] = oencoder.fit_transform(np.array(my_df[enc]).reshape(-1,1))

We can get dummies for the rest of the categoric variables. For high dimensional ones, we will only create dummies for 2 of them.

for category in ["nom_5","nom_6","nom_7","nom_8","nom_9"]:
    print("{} has {} unique values".format(category,len(np.unique(my_df[category]))))

nom_5 has 1220 unique values
nom_6 has 1520 unique values
nom_7 has 222 unique values
nom_8 has 222 unique values
nom_9 has 2218 unique values

nom_7 and nom_8 has lower dimension. We can use pandas’ dummy encoder on them. If we use it on all, it will increase the width of our train set too much and it will create bugs/errors on direct model fitting. This way still performs well. We will use another encoding for high-dim variables later.

for enc in ["nom_0","nom_1","nom_2","nom_3","nom_4","day","month","nom_7","nom_8"]:
    enc1 = pd.get_dummies(my_df[enc], prefix=enc)
    my_df.drop(columns=enc, inplace=True)
    my_df = pd.concat([my_df,enc1], axis=1)

Scale the ordinal and binary variables (-1,1). This can help linear models.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler((-1,1))
for feat in ["ord_0","ord_1","ord_2","ord_3","ord_4","ord_5"]:
    my_df[feat] = scaler.fit_transform(np.array(my_df[feat]).reshape(-1,1))


for feat in ["bin_0","bin_1","bin_2","bin_3","bin_4"]:
    my_df[feat] = scaler.fit_transform(np.array(my_df[feat]).reshape(-1,1))

Create train and test.

test = my_df[my_df["target"].isnull()]
test.drop(columns='target', inplace=True)

train = my_df[my_df["target"].isnull()==False]
target = train["target"]

Before dropping target column from train set, I will use LeaveOneEncoder on high dimensional nominal variables.

from category_encoders import  LeaveOneOutEncoder
leaveOneOut_encoder = LeaveOneOutEncoder()
for nom in ["nom_5","nom_6","nom_9"]:
    train[nom] = leaveOneOut_encoder.fit_transform(train[nom], train["target"])
    test[nom] = leaveOneOut_encoder.transform(test[nom])

Now lets drop the target and reset indexes.

train.drop(columns='target', inplace=True)

train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
target.reset_index(drop=True, inplace=True)

Modelling

Import the libraries we gonna need.

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score as auc
from sklearn.linear_model import LogisticRegression

def run_cv_model(train, test, target, model_fn, params={}, eval_fn=None, label='model'):
    kf = KFold(n_splits=5)
    fold_splits = kf.split(train, target)
    cv_scores = []
    pred_full_test = 0
    pred_train = np.zeros((train.shape[0]))
    i = 1
    for dev_index, val_index in fold_splits:
        print('Started ' + label + ' fold ' + str(i) + '/5')
        dev_X, val_X = train.loc[dev_index], train.loc[val_index]
        dev_y, val_y = target.loc[dev_index], target.loc[val_index]
        params2 = params.copy()
        pred_val_y, pred_test_y = model_fn(dev_X, dev_y, val_X, val_y, test, params2)
        pred_full_test = pred_full_test + pred_test_y
        pred_train[val_index] = pred_val_y
        if eval_fn is not None:
            cv_score = eval_fn(val_y, pred_val_y)
            cv_scores.append(cv_score)
            print(label + ' cv score {}: {}'.format(i, cv_score))
        i += 1
    print('{} cv scores : {}'.format(label, cv_scores))
    print('{} cv mean score : {}'.format(label, np.mean(cv_scores)))
    print('{} cv std score : {}'.format(label, np.std(cv_scores)))
    pred_full_test = pred_full_test / 5.0
    results = {'label': label,
              'train': pred_train, 'test': pred_full_test,
              'cv': cv_scores}
    return results



def runLR(train_X, train_y, test_X, test_y, test_X2, params):
    print('Train LR')
    model = LogisticRegression(**params)
    model.fit(train_X, train_y)
    print('Predict 1/2')
    pred_test_y = model.predict_proba(test_X)[:, 1]
    print('Predict 2/2')
    pred_test_y2 = model.predict_proba(test_X2)[:, 1]
    return pred_test_y, pred_test_y2

Set inverse regularization parameter C=0.1 and initialize.

lr_params = {'solver': 'lbfgs', 'C':0.1, 'max_iter':500}

results = run_cv_model(train, test, target, runLR, lr_params, auc, 'lr')

lr cv score 5: 0.7847225367977033
lr cv scores : [0.7844204265213018, 0.7848292810796071, 0.7853701511296699, 0.7853865447776116, 0.7847225367977033]
lr cv mean score : 0.7849457880611789
lr cv std score : 0.0003778280688249727

Our mean CV value is 78,5. For this much feature engineering, it is sufficient I suppose :) Let’s submit the results.

submission = pd.DataFrame({'id': test_id, 'target': results['test']})
submission.to_csv('submission.csv', index=False)

Share on

Twitter Facebook LinkedIn

Cem Sarıer

Kaggle Project: Categorical Feature Engineering

About

Content

Preprocessing

Modelling

Share on

You may also enjoy

A Neural Network Function v.1

Movie Recommendation System

Monotonic WOE Binning Algorithm for Credit Scoring

Dimension Reduction: Variable Clustering Algorithm for Credit Scoring