Detailed feature engineering of kaggle Elo merchant competition - Top 3% solution šŸ”„

ā€” 217 minute read

TLDR; Complete notebook is available here

I'm writing an exhaustive list of steps with code on feature engineering done for Kaggles' Elo merchant recommendation competition.

Elo Merchant Competition permalink

Elo is one of the largest payment brands in Brazil. It has built partnerships with merchants in order to offer promotions or discounts to cardholders. But do these promotions work for either the consumer or the merchant? Do customers enjoy their experience? Do merchants see repeat business? Personalization is key.

In this competition our challenge is to develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty. Our input will help Elo reduce unwanted campaigns, to create the right experience for customers.

What files do we need?

All the data can be downloaded from here We will need, at a minimum, the train.csv and test.csv files. These contain the card_ids that we'll be using for training and prediction.

What should we expect the data format to be?

The data is formatted as follows:

train.csv and test.csv contain card_ids and information about the card itself - the first month the card was active, etc. train.csv also contains the target.

historical_transactions.csv and new_merchant_transactions.csv are designed to be joined with train.csv, test.csv, and merchants.csv. They contain information about transactions for each card, as described above.

merchants.csv can be joined with the transaction sets to provide additional merchant-level information.

What are we predicting?

We are predicting a loyalty score for each card_id represented in test.csv and sample_submission.csv.

File descriptions

train.csv - the training set

test.csv - the test set

sample_submission.csv - a sample submission file in the correct format - contains all card_ids we are expected to predict for.

historical_transactions.csv - up to 3 months' worth of historical transactions for each card_id

merchants.csv - additional information about all merchants / merchant_ids in the dataset.

new_merchant_transactions.csv - two months' worth of data for each card_id containing ALL purchases that card_id made at merchant_ids that were not visited in the historical data.

Data fields

Data field descriptions are provided in Data Dictionary.xlsx.

I downloaded the files to my local directory. Let's get started!

Understanding the data permalink

Though Elo mentioned that all data is simulated and fictitious, and is not real customer data, it was later hinted that data was highly transformed & modified using statistical techniques to make it hard for data leaks.

PATH = '../data/elo/'
files = ['historical_transactions', 'new_merchant_transactions']
hist_trans, new_hist_trans = [pd.read_csv(f'{PATH}{c}.csv') for c in files]

Let's have a look at the transactions data available.

hist_trans.head().T
01234
authorized_flagYYYYY
card_idC_ID_4e6213e9bcC_ID_4e6213e9bcC_ID_4e6213e9bcC_ID_4e6213e9bcC_ID_4e6213e9bc
city_id8888888888
category_1NNNNN
installments00000
category_3AAAAA
merchant_category_id803678056080
merchant_idM_ID_e020e9b302M_ID_86ec983688M_ID_979ed661fcM_ID_e6d5ae8ea6M_ID_e020e9b302
month_lag-8-7-6-5-11
purchase_amount-0.703331-0.733128-0.720386-0.735352-0.722865
purchase_date2017-06-25 15:33:072017-07-15 12:10:452017-08-09 22:04:292017-09-02 10:06:262017-03-10 01:14:19
category_211111
state_id1616161616
subsector_id3716373437
new_hist_trans.head().T
01234
authorized_flagYYYYY
card_idC_ID_415bb3a509C_ID_415bb3a509C_ID_415bb3a509C_ID_415bb3a509C_ID_ef55cf8d4b
city_id107140330-1-1
category_1NNNYY
installments11111
category_3BBBBB
merchant_category_id307307507661166
merchant_idM_ID_b0c793002cM_ID_88920c89e8M_ID_ad5237ef6bM_ID_9e84cda3b1M_ID_3c86fa3831
month_lag11211
purchase_amount-0.557574-0.56958-0.551037-0.671925-0.659904
purchase_date2018-03-11 14:57:362018-03-19 18:53:372018-04-26 14:08:442018-03-07 09:43:212018-03-22 21:07:53
category_2111NaNNaN
state_id999-1-1
subsector_id191914829

Let's see data dictionary to find out what each of these fields mean.

train.csv
card_idUnique card identifier
first_active_monthYYYY-MM', month of first purchase
feature_1Anonymized card categorical feature
feature_2Anonymized card categorical feature
feature_3Anonymized card categorical feature
targetLoyalty numerical score calculated 2 months after historical and evaluation period
historical_transactions.csv
card_idCard identifier
month_lagmonth lag to reference date
purchase_datePurchase date
authorized_flagY' if approved, 'N' if denied
category_3anonymized category
installmentsnumber of installments of purchase
category_1anonymized category
merchant_category_idMerchant category identifier (anonymized )
subsector_idMerchant category group identifier (anonymized )
merchant_idMerchant identifier (anonymized)
purchase_amountNormalized purchase amount
city_idCity identifier (anonymized )
state_idState identifier (anonymized )
category_2anonymized category
new_merchant_period.csv
card_idCard identifier
month_lagmonth lag to reference date
purchase_datePurchase date
authorized_flagY' if approved, 'N' if denied
category_3anonymized category
installmentsnumber of installments of purchase
category_1anonymized category
merchant_category_idMerchant category identifier (anonymized )
subsector_idMerchant category group identifier (anonymized )
merchant_idMerchant identifier (anonymized)
purchase_amountNormalized purchase amount
city_idCity identifier (anonymized )
state_idState identifier (anonymized )
category_2anonymized category
merchants.csv
merchant_idUnique merchant identifier
merchant_group_idMerchant group (anonymized )
merchant_category_idUnique identifier for merchant category (anonymized )
subsector_idMerchant category group (anonymized )
numerical_1anonymized measure
numerical_2anonymized measure
category_1anonymized category
most_recent_sales_rangeRange of revenue (monetary units) in last active month --> A > B > C > D > E
most_recent_purchases_rangeRange of quantity of transactions in last active month --> A > B > C > D > E
avg_sales_lag3Monthly average of revenue in last 3 months divided by revenue in last active month
avg_purchases_lag3Monthly average of transactions in last 3 months divided by transactions in last active month
active_months_lag3Quantity of active months within last 3 months
avg_sales_lag6Monthly average of revenue in last 6 months divided by revenue in last active month
avg_purchases_lag6Monthly average of transactions in last 6 months divided by transactions in last active month
active_months_lag6Quantity of active months within last 6 months
avg_sales_lag12Monthly average of revenue in last 12 months divided by revenue in last active month
avg_purchases_lag12Monthly average of transactions in last 12 months divided by transactions in last active month
active_months_lag12Quantity of active months within last 12 months
category_4anonymized category
city_idCity identifier (anonymized )
state_idState identifier (anonymized )
category_2anonymized category

Let's start with transactions data. We will come back to merchants & cards data later. We have ~1.9m new & ~2.9m historical transactions data. This can be joined with cards & merchants data on card_id and merchant_id respectively. We will join all the tables after doing feature engineering on each of them separately.

new_hist_trans.shape, hist_trans.shape
((1963031, 14), (29112361, 14))

Data preprocessing consists of:

  1. Treatment of missing values
  2. Mapping of categorical columns

Starting with treatment of missing values for each of the tables in the transactions data:

hist_trans.isnull().sum()/len(new_hist_trans)
    authorized_flag         0.000000
card_id 0.000000
city_id 0.000000
category_1 0.000000
installments 0.000000
category_3 0.090757
merchant_category_id 0.000000
merchant_id 0.070544
month_lag 0.000000
purchase_amount 0.000000
purchase_date 0.000000
category_2 1.351412
state_id 0.000000
subsector_id 0.000000
dtype: float64
new_hist_trans.isnull().sum()/len(new_hist_trans)
    authorized_flag         0.000000
card_id 0.000000
city_id 0.000000
category_1 0.000000
installments 0.000000
category_3 0.028488
merchant_category_id 0.000000
merchant_id 0.013355
month_lag 0.000000
purchase_amount 0.000000
purchase_date 0.000000
category_2 0.056925
state_id 0.000000
subsector_id 0.000000
dtype: float64

It's a standard practice to remove the rows which have missing data. Doing that had negative impact in the leaderboard score for me in the competition. So I decided to fill in the missing values selectively with mode for categorical columns & for installments with -1 & 999.

Now the second step of data preprocessing, mapping the categorical columns.

Let's map category_1 & category_3 as categorical columns with key & value.

def fill_nas_for_transactions_df(df):
# Fill nas for category_3 with mode
df['category_2'].fillna(1.0,inplace=True)
df['category_3'].fillna('A',inplace=True)
df['merchant_id'].fillna('M_ID_00a6ca8a8a',inplace=True)
df['installments'].replace(-1, np.nan,inplace=True)
df['installments'].replace(999, np.nan,inplace=True)
df['category_1'] = df['category_1'].map({'Y':1, 'N':0})
df['category_3'] = df['category_3'].map({'A':0, 'B':1, 'C':2})
return df

We are also skipping another step which reduces the size of the dataframe by manually mapping the columns to lower data-type wherever possible without data loss.

dfs = [hist_trans, new_hist_trans]

hist_trans, new_hist_trans = [fill_nas_for_transactions_df(df) for df in dfs]

Feature engineering permalink

Feature engineering consists of various logical steps. Feature engineering played a very big role for this competition in particular. I can classify broadly into following categories:

  1. Interpreting dates and adding new features dependant on date & time like time since, first, latest, difference between occurances etc.
  2. Statistical aggregates on numerical columns like avg, percentile, max, min, peak-to-peak (ptp) etc.
  3. Statistical aggregates grouped by categorical columns.
  4. Feature interactions amongst core features of date, transaction amounts etc
  5. Feature interactions on the aggregated features of different dataframes like merchants, cards & transactions.
  6. Feature interactions between new & old transactions.
  7. Data mining for reverse engineering the normalized features like purchase_amount & observed_date.
  8. Advanced aggregrates grouped by month & later aggregated turned out be futile. Nevertheless I'm including them here.

Feature: Add date features meta info permalink

fast.ai library provides excellent utility functions. We will be leveraging functions from fast.ai throughout for data manipulation & feature engineering starting with date field. add_datepart function reads a date and generates additional interpreted fields which are very useful for feature engineering.

add_datepart(hist_trans, 'purchase_date', drop=False, time=True)
add_datepart(new_hist_trans, 'purchase_date', drop=False, time=True)
new_hist_trans.head().T

After adding the date features our dataframe looks like this:

01234
authorized_flagYYYYY
card_idC_ID_415bb3a509C_ID_415bb3a509C_ID_415bb3a509C_ID_415bb3a509C_ID_ef55cf8d4b
city_id107140330-1-1
category_100011
installments11111
category_311111
merchant_category_id307307507661166
merchant_idM_ID_b0c793002cM_ID_88920c89e8M_ID_ad5237ef6bM_ID_9e84cda3b1M_ID_3c86fa3831
month_lag11211
purchase_amount-0.557574-0.56958-0.551037-0.671925-0.659904
purchase_date2018-03-11 14:57:362018-03-19 18:53:372018-04-26 14:08:442018-03-07 09:43:212018-03-22 21:07:53
category_211111
state_id999-1-1
subsector_id191914829
purchase_Year20182018201820182018
purchase_Month33433
purchase_Week1012171012
purchase_Day111926722
purchase_Dayofweek60323
purchase_Dayofyear70781166681
purchase_Is_month_endFalseFalseFalseFalseFalse
purchase_Is_month_startFalseFalseFalseFalseFalse
purchase_Is_quarter_endFalseFalseFalseFalseFalse
purchase_Is_quarter_startFalseFalseFalseFalseFalse
purchase_Is_year_endFalseFalseFalseFalseFalse
purchase_Is_year_startFalseFalseFalseFalseFalse
purchase_Hour141814921
purchase_Minute57538437
purchase_Second3637442153
purchase_Elapsed15207802561521485617152475172415204158011521752873

Feature: Add derived date features permalink

add_datepart already added columns like 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start' to our dataframe. Let's add some more columns like if the transaction has been done on weekday or weekend, month_diff which says the months since the reference date. We will find out later that these columns turned out to be pretty strong features.

sns.kdeplot(new_hist_trans['purchase_amount'])
# sns.kdeplot(new_hist_trans['purchase_amount'], bw=.2, label="bw: 0.2")
# sns.kdeplot(new_hist_trans['purchase_amount'])
plt.legend();
plt.xlim(-20,20)
sns.distplot(new_hist_trans['purchase_amount'], bins=20, kde=True, rug=False);

png

We will add more features like if the purchases were made on weekend or weekday, if the transaction was authorized or not. We will also define month_diff as the months from the transaction has happened.

def add_extra_cols(df):
df['purchased_on_weekend'] = (df.purchase_Dayofweek >=5).astype(int)
df['purchased_on_weekday'] = (df.purchase_Dayofweek <5).astype(int)
df['month_diff'] = ((datetime.datetime.today() - df['purchase_date']).dt.days)//30
df['month_diff'] += df['month_lag']
df['authorized_flag'] = df['authorized_flag'].map({'Y':1, 'N':0})
# Trim the purchase_amount
# df['purchase_amount'] = df['purchase_amount'].apply(lambda x: min(x, 0.8))
return df
hist_trans, new_hist_trans = [add_extra_cols(df) for df in dfs]
hist_trans.shape, new_hist_trans.shape
    ((29112361, 33), (1963031, 33))

The new dataframe after adding features are:

new_hist_trans.head().T
01234
authorized_flag11111
card_idC_ID_415bb3a509C_ID_415bb3a509C_ID_415bb3a509C_ID_415bb3a509C_ID_ef55cf8d4b
city_id107140330-1-1
category_100011
installments11111
category_311111
merchant_category_id307307507661166
merchant_idM_ID_b0c793002cM_ID_88920c89e8M_ID_ad5237ef6bM_ID_9e84cda3b1M_ID_3c86fa3831
month_lag11211
purchase_amount-0.557574-0.56958-0.551037-0.671925-0.659904
purchase_date2018-03-11 14:57:362018-03-19 18:53:372018-04-26 14:08:442018-03-07 09:43:212018-03-22 21:07:53
category_211111
state_id999-1-1
subsector_id191914829
purchase_Year20182018201820182018
purchase_Month33433
purchase_Week1012171012
purchase_Day111926722
purchase_Dayofweek60323
purchase_Dayofyear70781166681
purchase_Is_month_endFalseFalseFalseFalseFalse
purchase_Is_month_startFalseFalseFalseFalseFalse
purchase_Is_quarter_endFalseFalseFalseFalseFalse
purchase_Is_quarter_startFalseFalseFalseFalseFalse
purchase_Is_year_endFalseFalseFalseFalseFalse
purchase_Is_year_startFalseFalseFalseFalseFalse
purchase_Hour141814921
purchase_Minute57538437
purchase_Second3637442153
purchase_Elapsed15207802561521485617152475172415204158011521752873
purchased_on_weekend10000
purchased_on_weekday01111
month_diff1918181918

Feature: Reverse engineering purchase amount column permalink

The purchase_amount column is normalized and heavily transformed. Let's try to reverse engineer it to get the actual amount. One kaggler in the competition solved it by using a simple optimisation function to make the least value of purchase amount to be non-negative and tuning the regression co-efficients to round purchase_amount to 2 decimals. Let's see a different approach for achieving the same.

We start with basic assumption that the purchase amount was normalized using scaling & transforming.

data = pd.concat((historical_transactions, new_transactions))
data.purchase_amount.mean()
-0.0007032266623490891

The mean is indeed zero. They have been normalized!! Purchase amounts can be never negative, so let's start it with being zero.

data['purchase_amount_new'] = (data.purchase_amount - data.purchase_amount.min())

Let's sort them and look at the head:

indexpurchase_amount_new
00
10.000015
20.00003
30.000045
40.00006
50.000075
60.00009
70.000105
80.00012
90.000135

Let's compute the successive diff between the purchase amounts:

data['delta'] = data.purchase_amount_new.diff(1)
data.head(10)
indexpurchase_amount_newdelta
00NaN
10.0000150.000015
20.000030.000015
30.0000450.000015
40.000060.000015
50.0000750.000015
60.000090.000015
70.0001050.000015
80.000120.000015
90.0001350.000015

Hmm. All the deltas are looking same. That's an interesting find. Does is apply for all?

data[data.delta > 2e-5].head()
indexpurchase_amount_newdelta
526230.7907550.00003
545320.8194560.00003
574070.8626720.00003
605920.9105470.00003
607570.9130410.00003

They tail to 0.0003 after 2e-5. Let's get the mean of it.

data.delta.mean()
1.5026511915168561e-05

This should be the least delta between two prices. Let's assume the least denomination difference between our prices to be 100 cents and each cent is approximately equal to 1.5026511915168561e-05. Dividing our purchase_amount_new with 100*delta_mean should get approximate price.

data['purchase_amount_new'] = data.purchase_amount_new / (100 * data.delta.mean())

Now lets' look at the most frequent values in our new purchase amount:

data.purchase_amount_new.value_counts().head(10)

50.000000 735619
20.000004 640964
30.000003 547680
10.000005 444249
100.000001 418773
15.000001 379041
40.000002 271846
12.000004 233231
25.000000 232732
5.000003 208044
Name: purchase_amount_new, dtype: int64

Woohoo! They appear close. Let's round them to 2 decimals

data['two_decimal_amount'] = np.round(data.purchase_amount_new, 2)

Feature: Aggregates grouped by month & card_id permalink

Now that we have our purchase amounts, lets' calculate mean, sum, max, peak to peak aggregates per month grouped by card_id.

def aggregate_per_month(history):
grouped = history.groupby(['card_id', 'month_diff'])['purchase_amount']

agg_func = {
'purchase_amount': ['count', 'sum', 'max', 'mean'],
}

intermediate_group = grouped.agg(agg_func)
intermediate_group.columns = ['_'.join(col).strip() for col in intermediate_group.columns.values]
intermediate_group.reset_index(inplace=True)

final_group = intermediate_group.groupby('card_id').agg(['mean', 'sum', np.ptp, 'max'])
final_group.columns = ['_'.join(col).strip() for col in final_group.columns.values]
final_group.reset_index(inplace=True)

return final_group

Feature: Time between successive transactions permalink

Adding another new feature - time between successive transactions. The time taken by a customer between his transactions. For this we group our transactions by card_id and find the time diff between successive transactions.

def time_diff(df):
df['purchase_date_successive_diff'] = df.sort_values('purchase_date').groupby('card_id')['purchase_date'].diff().dt.total_seconds()
df['purchase_date_successive_diff'].fillna(0, inplace=True)
return df

hist_trans, new_hist_trans = [time_diff(df) for df in dfs]

Save the data frame to disk in feather format. This is our first milestone.

hist_trans.to_feather('hist_trans_beta')
new_hist_trans.to_feather('new_hist_trans_beta')
# hist_trans = feather.read_dataframe('hist_trans_beta')
new_hist_trans = feather.read_dataframe('new_hist_trans_beta')
dfs = [hist_trans, new_hist_trans]

Feature: Holiday features permalink

Adding more features like holidays in Brazil. Typically on the holidays merchants see a spike in sales and have very good offers. We will also add other features like what was the EMI paid every month which is purchase_amount divided by installments. Another feature duration which is a feature interaction between purchase_amount and month_diff (months from reference date). We also have amount_month_ratio which is also a feature interaction purchase_amount between month_diff which signifies amount spent by a customer from the reference date. This one adds more weight to the spendings close to the reference date.

def additional_feats(hist_df):
hist_df['price'] = hist_df['purchase_amount'] / hist_df['installments']
#Christmas : December 25 2017
hist_df['Christmas_Day_2017']=(pd.to_datetime('2017-12-25')-hist_df['purchase_date']).dt.days.apply(lambda x: x if x > 0 and x < 100 else 0)
#Mothers Day: May 14 2017
hist_df['Mothers_Day_2017']=(pd.to_datetime('2017-06-04')-hist_df['purchase_date']).dt.days.apply(lambda x: x if x > 0 and x < 100 else 0)
#fathers day: August 13 2017
hist_df['fathers_day_2017']=(pd.to_datetime('2017-08-13')-hist_df['purchase_date']).dt.days.apply(lambda x: x if x > 0 and x < 100 else 0)
#Childrens day: October 12 2017
hist_df['Children_day_2017']=(pd.to_datetime('2017-10-12')-hist_df['purchase_date']).dt.days.apply(lambda x: x if x > 0 and x < 100 else 0)
#Valentine's Day : 12th June, 2017
hist_df['Valentine_Day_2017']=(pd.to_datetime('2017-06-12')-hist_df['purchase_date']).dt.days.apply(lambda x: x if x > 0 and x < 100 else 0)
#Black Friday : 24th November 2017
hist_df['Black_Friday_2017']=(pd.to_datetime('2017-11-24') - hist_df['purchase_date']).dt.days.apply(lambda x: x if x > 0 and x < 100 else 0)
#2018
#Mothers Day: May 13 2018
hist_df['Mothers_Day_2018']=(pd.to_datetime('2018-05-13')-hist_df['purchase_date']).dt.days.apply(lambda x: x if x > 0 and x < 100 else 0)
hist_df['duration'] = hist_df['purchase_amount']*hist_df['month_diff']
hist_df['amount_month_ratio'] = hist_df['purchase_amount']/hist_df['month_diff']
return hist_df

After adding the additional features we copy them to existing dataframes to save space.

hist_trans, new_hist_trans = [additional_feats(df) for df in dfs]
hist_trans.shape, new_hist_trans.shape
((29112361, 52), (1963031, 52))

Feature: Latest & First 5 feature aggregates permalink

As seen in previous additional feature amount_month_ratio which adds more weight to purchases close to reference date, let's add features like sum/max of last 5 purchases etc.

def head_sum(x):
return x.head().sum()

def head_max(x):
return x.head().max()

def tail_sum(x):
return x.tail().sum()

def tail_max(x):
return x.tail().max()

We will sort all our transactions by purchase date chronologically.

%time new_hist_trans = new_hist_trans.sort_values('purchase_date')


CPU times: user 864 ms, sys: 681 ms, total: 1.54 s
Wall time: 358 ms
%time hist_trans = hist_trans.sort_values('purchase_date')


CPU times: user 4.94 s, sys: 3.11 s, total: 8.05 s
Wall time: 6.06 s
# hist_trans.head()
new_hist_trans = new_hist_trans.reset_index().drop('index', axis=1)
hist_trans.reset_index(inplace=True)
hist_trans.drop('index', axis=1, inplace=True)
hist_trans.head()
hist_trans.drop('level_0', axis=1, inplace=True)

Feature: Merchants features permalink

merchants = pd.read_csv('data/elo/merchants.csv')
merchants.head().T
01234
merchant_idM_ID_838061e48cM_ID_9339d880adM_ID_e726bbae1eM_ID_a70e9c5f81M_ID_64456c37ce
merchant_group_id8353318444750262228
merchant_category_id792840690792222
subsector_id9201921
numerical_1-0.0574706-0.0574706-0.0574706-0.0574706-0.0574706
numerical_2-0.0574706-0.0574706-0.0574706-0.0574706-0.0574706
category_1NNNYY
most_recent_sales_rangeEEEEE
most_recent_purchases_rangeEEEEE
avg_sales_lag3-0.4-0.72-82.13NaNNaN
avg_purchases_lag39.666671.752601.666670.5
active_months_lag333233
avg_sales_lag6-2.25-0.74-82.13NaNNaN
avg_purchases_lag618.66671.291672604.666670.361111
active_months_lag666266
avg_sales_lag12-2.32-0.57-82.13NaNNaN
avg_purchases_lag1213.91671.68752603.833330.347222
active_months_lag12121221212
category_4NNNYY
city_id24222-1-1-1
state_id9165-1-1
category_2115NaNNaN

Let's have a look at the dataset summary

DataFrameSummary(merchants).summary().T
countmeanstdmin25%50%75%maxcountsuniquesmissingmissing_perctypes
merchant_idNaNNaNNaNNaNNaNNaNNaNNaN33469633463300%categorical
merchant_group_id33469631028.731623136121990051707.211258633469610939100%numeric
merchant_category_id334696423.132252.898-122237368389133469632400%numeric
subsector_id33469625.11649.80737-1192733413346964100%numeric
numerical_13346960.01147641.09815-0.0574706-0.0574706-0.0574706-0.0475558183.73533469695400%numeric
numerical_23346960.008103111.0705-0.0574706-0.0574706-0.0574706-0.0475558182.07933469694700%numeric
category_1NaNNaNNaNNaNNaNNaNNaNNaN334696200%bool
most_recent_sales_rangeNaNNaNNaNNaNNaNNaNNaNNaN334696500%categorical
most_recent_purchases_rangeNaNNaNNaNNaNNaNNaNNaNNaN334696500%categorical
avg_sales_lag333468313.8332395.49-82.130.8811.168518453346833372130.00%numeric
avg_purchases_lag3334696infNaN0.3334950.923651.016671.14652inf33469610000300%numeric
active_months_lag33346962.994110.095247513333334696300%numeric
avg_sales_lag633468321.65083947.11-82.130.851.011.231.51396e+063346834507130.00%numeric
avg_purchases_lag6334696infNaN0.1670450.9022471.026961.21558inf33469613520200%numeric
active_months_lag63346965.94740.39493616666334696600%numeric
avg_sales_lag1233468325.22775251.84-82.130.851.021.292.56741e+063346835009130.00%numeric
avg_purchases_lag12334696infNaN0.09832950.8983331.043361.26648inf33469617291700%numeric
active_months_lag1233469611.59931.520141121212123346961200%numeric
category_4NaNNaNNaNNaNNaNNaNNaNNaN334696200%bool
city_id334696102.918107.091-1-16918234733469627100%numeric
state_id33469611.86096.17689-19916243346962500%numeric
category_23228092.381.56266112453228095118873.55%numeric

We see that we have 334696 rows but only 334633 merchants. Maybe some of them are duplicated. Let's drop duplicates and check again.

merchant_details_cols = ['merchant_id', 'merchant_group_id', 'merchant_category_id', 'subsector_id', 'category_1',
'category_4', 'city_id', 'state_id', 'category_2']
merchant_details = merchants[merchant_details_cols]

# Delete duplicates
merchant_details = merchant_details.drop_duplicates()
DataFrameSummary(merchant_details).summary().T
countmeanstdmin25%50%75%maxcountsuniquesmissingmissing_perctypes
merchant_idNaNNaNNaNNaNNaNNaNNaNNaN33464533463300%categorical
merchant_group_id33464531032.531623.213625199085171611258633464510939100%numeric
merchant_category_id334645423.123252.905-122237368389133464532400%numeric
subsector_id33464525.11719.80706-1192733413346454100%numeric
category_1NaNNaNNaNNaNNaNNaNNaNNaN334645200%bool
category_4NaNNaNNaNNaNNaNNaNNaNNaN334645200%bool
city_id334645102.925107.093-1-16918234733464527100%numeric
state_id33464511.86166.17629-19916243346452500%numeric
category_23227782.380051.56268112453227785118673.55%numeric

We still see that out of 334645 rows we have only 334633 unique merchant ids. drop_duplicates compares values of each and every column before marking them as duplicate. These merchants aren't duplicate after all, they may be sharing a different city, region or feature like franchises etc. Let's dig deeper.

merchant_details.loc[merchant_details['merchant_id'].duplicated()]
merchant_idmerchant_group_idmerchant_category_idsubsector_idcategory_1category_4city_idstate_idcategory_2
3393M_ID_bd49e37dda417069221NN51161.0
4182M_ID_ef233cff262879956034NY6991.0
7585M_ID_dbbf07ebf03527837NY17224.0
29465M_ID_30340088f23554429NY6991.0
47804M_ID_645a6af169191408727NN29151.0
166813M_ID_ebbdb42da6353832YY-1-1NaN
210654M_ID_c2b9ac2ea43555425YY-1-1NaN
221181M_ID_992a180b15856855425NY17224.0
228275M_ID_d123532c724909438517YY-1-1NaN
330958M_ID_42697d5d44356901NN27191.0
333904M_ID_6464db3b453521035YY-1-1NaN
334071M_ID_1802942aaf7296330222NN9691.0

Let's fill in the missing values before digging further

merchants['category_2'] = merchants['category_2'].fillna(0).astype(int)
merchants.loc[merchants['city_id'] == -1, 'city_id'] = 0
merchants.loc[merchants['state_id'] == -1, 'state_id'] = 0

We will create a unique vector string which is just concatenation of our need to be unique columns. We will not use merchant_group_id as it is not present in transactions table.

merchant_address_id = merchants['merchant_id'].map(str)+ '_' + \
merchants['merchant_category_id'].map(str) + '_'+ \
merchants['subsector_id'].map(str) + '_'+ \
merchants['city_id'].map(str) + '_'+ \
merchants['state_id'].map(str) + '_'+ \
merchants['category_2'].map(str)

merchants['merchant_address_id'] = merchant_address_id
# merchants.drop(['merchant_id', 'merchant_category_id', 'subsector_id',
# 'city_id', 'state_id', 'category_2'], axis=1, inplace=True)

merchants.head().T
01234
merchant_idM_ID_838061e48cM_ID_9339d880adM_ID_e726bbae1eM_ID_a70e9c5f81M_ID_64456c37ce
merchant_group_id8353318444750262228
merchant_category_id792840690792222
subsector_id9201921
numerical_1-0.0574706-0.0574706-0.0574706-0.0574706-0.0574706
numerical_2-0.0574706-0.0574706-0.0574706-0.0574706-0.0574706
category_1NNNYY
most_recent_sales_rangeEEEEE
most_recent_purchases_rangeEEEEE
avg_sales_lag3-0.4-0.72-82.13NaNNaN
avg_purchases_lag39.666671.752601.666670.5
active_months_lag333233
avg_sales_lag6-2.25-0.74-82.13NaNNaN
avg_purchases_lag618.66671.291672604.666670.361111
active_months_lag666266
avg_sales_lag12-2.32-0.57-82.13NaNNaN
avg_purchases_lag1213.91671.68752603.833330.347222
active_months_lag12121221212
category_4NNNYY
city_id24222-1-1-1
state_id9165-1-1
category_2115NaNNaN
merchant_address_idM_ID_838061e48c_792_9_242_9_1.0M_ID_9339d880ad_840_20_22_16_1.0M_ID_e726bbae1e_690_1_-1_5_5.0M_ID_a70e9c5f81_792_9_-1_-1_nanM_ID_64456c37ce_222_21_-1_-1_nan

Now the merchants data! We have already pre-processed and added some features to the merchants dataframe. Let's load it in and merge with our transactions data.

merchants = feather.read_dataframe('merchants_df')
new_hist_trans = new_hist_trans.merge(merchants, on='merchant_id', how='left')
hist_trans = hist_trans.merge(merchants, on='merchant_id', how='left')
hist_trans.shape, new_hist_trans.shape
hist_trans.to_feather('hist_trans_beta')
new_hist_trans.to_feather('new_hist_trans_beta')
hist_trans = feather.read_dataframe('hist_trans_beta')
new_hist_trans = feather.read_dataframe('new_hist_trans_beta')

After merging the merchants data to our transactions data, let's see the final list of columns.

new_hist_trans.head().T
01234
authorized_flag11111
card_idC_ID_7c628841cbC_ID_25d399500cC_ID_e3542c52f1C_ID_fabd47ca44C_ID_6f9a771d17
city_id69191996996
category_100000
installments11001
category_311001
merchant_category_id8030730780178
merchant_idM_ID_c03b62d83dM_ID_2445d76702M_ID_b16ae63c45M_ID_b35d0757d1M_ID_b6b9b8ed67
month_lag11111
purchase_amount46.4174.01505.5550
purchase_date2017-03-01 03:24:512017-03-01 11:01:062017-03-01 11:27:392017-03-01 11:55:112017-03-01 12:37:26
category_211412
state_id9914924
subsector_id3719193729
purchase_Year20172017201720172017
purchase_Month33333
purchase_Week99999
purchase_Day11111
purchase_Dayofweek22222
purchase_Dayofyear6060606060
purchase_Is_month_endFalseFalseFalseFalseFalse
purchase_Is_month_startTrueTrueTrueTrueTrue
purchase_Is_quarter_endFalseFalseFalseFalseFalse
purchase_Is_quarter_startFalseFalseFalseFalseFalse
purchase_Is_year_endFalseFalseFalseFalseFalse
purchase_Is_year_startFalseFalseFalseFalseFalse
purchase_Hour311111112
purchase_Minute241275537
purchase_Second516391126
purchase_Elapsed14883386911488366066148836765914883693111488371846
purchased_on_weekend00000
purchased_on_weekday11111
month_diff2424242424
purchase_date_successive_diff00000
price46.4174.01infinf550
Christmas_Day_201700000
Mothers_Day_20179494949494
fathers_day_201700000
Children_day_201700000
Valentine_Day_201700000
Black_Friday_201700000
Mothers_Day_201800000
duration1113.64176.24120013213200
amount_month_ratio1.933337.250422.083330.22916722.9167
category_2_mean130.934130.934125.997130.934127.287
category_2_min0.010.010.010.010.01
category_2_max1756261756264000017562622610
category_2_sum1.53191e+081.53191e+082.25018e+071.53191e+088.35804e+06
category_3_mean93.449393.4493102.429102.42993.4493
category_3_min0.010.010.010.010.01
category_3_max48713.648713.6650566505648713.6
category_3_sum7.81402e+077.81402e+071.00192e+081.00192e+087.81402e+07
category_410010
merchant_address_idM_ID_c03b62d83d_80_37_69_9_1M_ID_2445d76702_307_19_0_9_1M_ID_b16ae63c45_307_19_0_14_4M_ID_b35d0757d1_80_37_69_9_1M_ID_b6b9b8ed67_178_29_0_24_2
numerical_range1732200737
merchant_rating1012161512

You can see that the features we compute earlier for merchants like merchant_address_id, numerical_range. Other merchant features didn't turn out to be useful appended to our transactions data. Now our transactions data has 56 features in total. numerical_range is feature interaction between numerical_1 & numerical_2.

hist_trans.shape, new_hist_trans.shape

((29112368, 56), (1963031, 56))

Aggregate: Aggregate by card_id permalink

Let's replace the 0 in our transactions data with 0.0001 to avoid numerical inconsistencies.

hist_trans.loc[hist_trans['purchase_amount'] == 0, 'purchase_amount'] = 0.0001

Now we aggregate on the transactions data grouped by card_id as we have to calculate the loyalty score for each card_id. The following code is pruned after some experimentation and removal of aggregates which didn't have so much feature importance. Most of the aggregates are self-explanatory. We have some special aggregates grouped by category & we calculate purchase_amount aggregates on the grouped by category data.

def aggregate_hist_trans(df):
unique_cols = ['subsector_id', 'merchant_id', 'merchant_category_id', 'merchant_address_id']

col_seas = ['purchase_Month','purchase_Week', 'purchase_Dayofweek', 'purchase_Day', 'purchase_Hour',
'merchant_rating']

aggs = {}
for c in unique_cols:
aggs[c] = ['nunique']
for c in col_seas:
aggs[c] = ['nunique', 'mean', 'min', 'max']

aggs['purchase_amount'] = [('sum', 'sum'), ('pct_75', lambda x: np.percentile(x, q = 75)),
('pct_25', lambda x: np.percentile(x, q = 25)), ('mean', 'mean'),
('median', 'median'), ('max', 'max'), ('min', 'min'), ('var', 'var'),
('skew', 'skew'), ('head_sum', head_sum), ('head_max', head_max),
('tail_sum', tail_sum), ('tail_max', tail_max), ('gmean', scipy.stats.gmean ),
('hmean', scipy.stats.hmean)]
aggs['installments'] = [('sum', 'sum'), ('pct_75', lambda x: np.percentile(x, q = 75)),
('pct_25', lambda x: np.percentile(x, q = 25)), ('mean', 'mean'),
('median', 'median'), ('max', 'max'), ('min', 'min'), ('var', 'var'),
('skew', 'skew'), ('head_sum', head_sum), ('head_max', head_max),
('tail_sum', tail_sum), ('tail_max', tail_max)]
aggs['purchase_date'] = ['max','min']
aggs['month_lag'] = ['max','min','mean','var','skew']
aggs['month_diff'] = ['max','min','mean','var','skew']
aggs['authorized_flag'] = ['mean']
aggs['purchased_on_weekend'] = ['mean'] # overwrite
aggs['purchase_Dayofweek'] = ['mean'] # overwrite
aggs['purchase_Day'] = ['nunique', 'mean', 'min'] # overwrite
aggs['category_1'] = ['mean']
aggs['category_2'] = ['mean']
aggs['category_3'] = ['mean']
aggs['category_4'] = ['mean']
aggs['numerical_range'] = ['mean', 'min', 'max', 'skew']
aggs['card_id'] = ['size','count']
aggs['price'] = ['sum','mean','max','min','var', 'skew']
aggs['Christmas_Day_2017'] = ['mean']
aggs['Mothers_Day_2017'] = ['mean']
aggs['fathers_day_2017'] = ['mean']
aggs['Children_day_2017'] = ['mean']
aggs['Valentine_Day_2017'] = ['mean']
aggs['Black_Friday_2017'] = ['mean']
aggs['Mothers_Day_2018'] = ['mean']
aggs['duration']=['mean','min','max','var','skew']
aggs['amount_month_ratio']=['mean','min','max','var','skew']

#exta
aggs['purchase_date_successive_diff'] = ['mean']

# aggs['purchase_date_successive_diff'] = ['mean', 'median', 'max', 'min', 'var', 'skew']
for col in ['category_2','category_3']:
df[col+'_mean'] = df.groupby([col])['purchase_amount'].transform('mean')
df[col+'_min'] = df.groupby([col])['purchase_amount'].transform('min')
df[col+'_max'] = df.groupby([col])['purchase_amount'].transform('max')
df[col+'_sum'] = df.groupby([col])['purchase_amount'].transform('sum')
aggs[col+'_mean'] = ['mean']

new_df = df.groupby(['card_id']).agg(aggs)
new_df.columns = ['_'.join(col).strip() for col in new_df.columns.values]
new_df.reset_index(inplace=True)
other_df = (df.groupby('card_id')
.size()
.reset_index(name='transactions_count'))

new_df = pd.merge(other_df, new_df, on='card_id', how='left')

new_df['purchase_date_diff'] = (new_df['purchase_date_max'] - new_df['purchase_date_min']).dt.days
new_df['purchase_date_average'] = new_df['purchase_date_diff']/new_df['card_id_size']
new_df['purchase_date_uptonow'] = (datetime.datetime.today() - new_df['purchase_date_max']).dt.days
new_df['purchase_date_uptomin'] = (datetime.datetime.today() - new_df['purchase_date_min']).dt.days
return new_df

We are not calculating the same aggregates for both historical & new transactional data, as some of the features add little value and will be removed during feature selection.

def aggregate_new_trans(df):
unique_cols = ['subsector_id', 'merchant_id', 'merchant_category_id', 'merchant_address_id']

col_seas = ['purchase_Month', 'purchase_Week', 'purchase_Dayofweek','purchase_Day', 'purchase_Hour',
'merchant_rating']

aggs = {}
for c in unique_cols:
aggs[c] = ['nunique']
for c in col_seas:
aggs[c] = ['nunique', 'mean', 'min', 'max']

aggs['purchase_amount'] = [('sum', 'sum'), ('pct_75', lambda x: np.percentile(x, q = 75)),
('pct_25', lambda x: np.percentile(x, q = 25)), ('mean', 'mean'),
('median', 'median'), ('max', 'max'), ('min', 'min'), ('var', 'var'),
('skew', 'skew'), ('head_sum', head_sum), ('head_max', head_max),
('tail_sum', tail_sum), ('tail_max', tail_max), ('gmean', scipy.stats.gmean ),
('hmean', scipy.stats.hmean)]
aggs['installments'] = [('sum', 'sum'), ('pct_75', lambda x: np.percentile(x, q = 75)),
('pct_25', lambda x: np.percentile(x, q = 25)), ('mean', 'mean'),
('median', 'median'), ('max', 'max'), ('min', 'min'), ('var', 'var'),
('skew', 'skew'), ('head_sum', head_sum), ('head_max', head_max),
('tail_sum', tail_sum), ('tail_max', tail_max)]
aggs['purchase_date'] = ['max','min']
aggs['month_lag'] = ['max','min','mean','var','skew']
aggs['month_diff'] = ['mean','var','skew']
aggs['purchased_on_weekend'] = ['mean']
aggs['purchase_Month'] = ['mean', 'min', 'max']
aggs['purchase_Dayofweek'] = ['mean', 'min', 'max']
aggs['category_1'] = ['mean']
aggs['category_2'] = ['mean']
aggs['category_3'] = ['mean']
aggs['category_4'] = ['mean']
aggs['numerical_range'] = ['mean', 'min', 'max', 'skew']
aggs['card_id'] = ['size','count']
aggs['price'] = ['mean','max','min','var']
aggs['Christmas_Day_2017'] = ['mean']
aggs['Children_day_2017'] = ['mean']
aggs['Black_Friday_2017'] = ['mean']
aggs['Mothers_Day_2018'] = ['mean']
aggs['duration']=['mean','min','max','var','skew']
aggs['amount_month_ratio']=['mean','min','max','var','skew']

#extra
aggs['purchase_date_successive_diff'] = ['mean']

# aggs['purchase_date_successive_diff'] = ['mean', 'median', 'max', 'min', 'var', 'skew']
for col in ['category_2','category_3']:
df[col+'_mean'] = df.groupby([col])['purchase_amount'].transform('mean')
df[col+'_min'] = df.groupby([col])['purchase_amount'].transform('min')
df[col+'_max'] = df.groupby([col])['purchase_amount'].transform('max')
df[col+'_sum'] = df.groupby([col])['purchase_amount'].transform('sum')
aggs[col+'_mean'] = ['mean']

new_df = df.groupby(['card_id']).agg(aggs)
new_df.columns = ['_'.join(col).strip() for col in new_df.columns.values]
new_df.reset_index(inplace=True)
other_df = (df.groupby('card_id')
.size()
.reset_index(name='transactions_count'))

new_df = pd.merge(other_df, new_df, on='card_id', how='left')

new_df['purchase_date_diff'] = (new_df['purchase_date_max'] - new_df['purchase_date_min']).dt.days
new_df['purchase_date_average'] = new_df['purchase_date_diff']/new_df['card_id_size']
new_df['purchase_date_uptonow'] = (datetime.datetime.today() - new_df['purchase_date_max']).dt.days
new_df['purchase_date_uptomin'] = (datetime.datetime.today() - new_df['purchase_date_min']).dt.days
return new_df
%time hist_trans_agg = aggregate_hist_trans(hist_trans)

CPU times: user 14min 28s, sys: 53.6 s, total: 15min 22s
Wall time: 14min 13s

%time new_hist_trans_agg = aggregate_new_trans(new_hist_trans)

CPU times: user 8min 44s, sys: 1.25 s, total: 8min 46s
Wall time: 8min 16s

Feature: Add exta interpreted columns on aggregates permalink

Adding extra features on top of the aggregates. I know it's overwhelming but these are important. card_id_size is the number of transactions done by the card. Most of the features are self-explanatory.

def add_extra_cols_on_agg(df):
df['inverse_avg_transactions_per_day'] = df['purchase_date_diff']/df['card_id_size']
df['repurchase_merchant_rate'] = df['transactions_count']/df['merchant_id_nunique']
df['merchant_category_repurchase'] = df['merchant_category_id_nunique']/df['merchant_id_nunique']
df['avg_spend_per_merchant'] = df['purchase_amount_sum']/df['merchant_id_nunique']
df['avg_trans_per_merchant'] = df['transactions_count']/df['merchant_id_nunique']
df['avg_spend_per_transaction'] = df['purchase_amount_sum']/df['transactions_count']
return df
[hist_trans_agg, new_hist_trans_agg] = [add_extra_cols_on_agg(df) for df in [hist_trans_agg,
new_hist_trans_agg]]
hist_trans_agg.to_feather('hist_trans_agg_beta')
new_hist_trans_agg.to_feather('new_hist_trans_agg_beta')
hist_trans_agg = feather.read_dataframe('hist_trans_agg_beta')
new_hist_trans_agg = feather.read_dataframe('new_hist_trans_agg_beta')

We now have 114, 108 features for old & new transactional aggregates data. We will adding some more features on top of this.

hist_trans_agg.shape, new_hist_trans_agg.shape


((325540, 114), (290001, 108))

Aggregate: Aggregate on categories permalink

Some more aggregates on the categories. Pivot tables are a common approach when calculating aggregates grouped by more than one conditions. Here we are delving deep in to the finer categorical spends aggregates. Our category_1 had a cardinality of 2 whereas category_2 had a cardinality of 6. We will calculate aggregates grouped by each of those possible values for the category. We will later see that these aggregates don't add much feature importance, but this was a good exercise for me to flex my pandas skills :D

def agg_on_cat(df, category, feature):
temp_df = df.pivot_table(index='card_id', columns=category, aggfunc={feature: ['sum', 'mean']})
cols = [category + '_{0[2]}_{0[0]}_{0[1]}'.format(col) for col in temp_df.columns.tolist()]
temp_df.columns = cols
return temp_df
def get_cat_agg(df):
agg_df = agg_on_cat(df, 'category_1', 'purchase_amount')
agg_df = pd.merge(agg_df, agg_on_cat(df, 'category_2', 'purchase_amount'), on='card_id', how='left')
agg_df = pd.merge(agg_df, agg_on_cat(df, 'category_3', 'purchase_amount'), on='card_id', how='left')
agg_df = pd.merge(agg_df, agg_on_cat(df, 'authorized_flag', 'purchase_amount'), on='card_id', how='left')
return agg_df
%time hist_trans_agg_cat, new_hist_trans_agg_cat = [get_cat_agg(df) for df in [hist_trans, new_hist_trans]]
hist_trans_agg_cat.shape, new_hist_trans_agg_cat.shape

Let's have a look at our newly calculated aggregates

new_hist_trans_agg_cat.head().T
card_idC_ID_00007093c1C_ID_0001238066C_ID_0001506ef0C_ID_0001793786C_ID_000183fdda
category_1_0_purchase_amount_mean-0.664262-0.564558-0.723677-0.007407-0.599162
category_1_1_purchase_amount_meanNaN-0.650332NaNNaNNaN
category_1_0_purchase_amount_sum-1.328524-13.549391-1.447354-0.229620-6.590778
category_1_1_purchase_amount_sumNaN-1.300665NaNNaNNaN
category_2_1.0_purchase_amount_mean-0.656749-0.580966NaN0.139747NaN
category_2_2.0_purchase_amount_meanNaNNaNNaN-0.344766NaN
category_2_3.0_purchase_amount_mean-0.671775NaN-0.7236770.102887-0.599162
category_2_4.0_purchase_amount_meanNaNNaNNaNNaNNaN
category_2_5.0_purchase_amount_meanNaN-0.495945NaN-0.361628NaN
category_2_1.0_purchase_amount_sum-0.656749-13.362220NaN2.375707NaN
category_2_2.0_purchase_amount_sumNaNNaNNaN-2.758131NaN
category_2_3.0_purchase_amount_sum-0.671775NaN-1.4473540.514433-6.590778
category_2_4.0_purchase_amount_sumNaNNaNNaNNaNNaN
category_2_5.0_purchase_amount_sumNaN-1.487835NaN-0.361628NaN
category_3_0_purchase_amount_meanNaN-0.152008-0.723677-0.007407-0.107680
category_3_1_purchase_amount_mean-0.664262-0.625781NaNNaN-0.696173
category_3_2_purchase_amount_meanNaN-0.389160NaNNaN-0.576515
category_3_0_purchase_amount_sumNaN-0.152008-1.447354-0.229620-0.107680
category_3_1_purchase_amount_sum-1.328524-13.141406NaNNaN-4.177040
category_3_2_purchase_amount_sumNaN-1.556641NaNNaN-2.306059
authorized_flag_1_purchase_amount_mean-0.664262-0.571156-0.723677-0.007407-0.599162
authorized_flag_1_purchase_amount_sum-1.328524-14.850055-1.447354-0.229620-6.590778
hist_trans_agg_cat.reset_index().to_feather('hist_trans_agg_cat')
new_hist_trans_agg_cat.reset_index().to_feather('new_hist_trans_agg_cat')
hist_trans_agg_cat = feather.read_dataframe('hist_trans_agg_cat')
new_hist_trans_agg_cat = feather.read_dataframe('new_hist_trans_agg_cat')

Aggregate: Aggregate on month permalink

More aggregates grouped by month, (month_diff is the months since the reference date, we want to add more weight to our recent transactions (i.e., new customers)) & calculate aggregates over the customer spending.

def aggregate_per_month(history):
grouped = history.groupby(['card_id', 'month_diff'])['purchase_amount']

agg_func = {
'purchase_amount': ['count', 'sum', 'max', 'mean'],
}

intermediate_group = grouped.agg(agg_func)
intermediate_group.columns = ['_month_diff_'.join(col).strip() for col in intermediate_group.columns.values]
intermediate_group.reset_index(inplace=True)

return intermediate_group
# aggregate_per_month(new_hist_trans)
%time hist_trans_agg_month, new_hist_trans_agg_month = [aggregate_per_month(df) for df in [hist_trans, new_hist_trans]]

CPU times: user 3min 36s, sys: 12.5 s, total: 3min 49s
Wall time: 3min 29s
hist_trans_agg_month.shape, new_hist_trans_agg_month.shape
((618851, 6), (431268, 6))

The new aggregates grouped by month & purchase amount for each card are:

new_hist_trans_agg_month.head().T
01234
card_idC_ID_00007093c1C_ID_0001238066C_ID_0001238066C_ID_0001506ef0C_ID_0001793786
month_diff1211121215
purchase_amount_month_diff_count2521218
purchase_amount_month_diff_sum110857.422183.5730.9212340.8
purchase_amount_month_diff_max60250444.94212580
purchase_amount_month_diff_mean55171.484103.9815.46685.599

Feature: Reverse engineering observed date aka reference date permalink

Let's load the train & test datasets to dataframes.

PATH = 'data/elo/'
train, test = [pd.read_csv(f'{PATH}{c}') for c in ['train.csv', 'test.csv']]
train.head().T
01234
first_active_month2017-062017-012016-082017-092017-11
card_idC_ID_92a2005557C_ID_3d0044924fC_ID_d639edf6cdC_ID_186d6a6901C_ID_cdbd2c0db2
feature_154241
feature_221233
feature_310000
target-0.8202830.3929130.6880560.142495-0.159749
test.head().T
01234
first_active_month2017-042017-012017-082017-122015-12
card_idC_ID_0ab67a22abC_ID_130fd0cbddC_ID_b709037bc5C_ID_d27d835a9fC_ID_2b5e3df5c2
feature_132525
feature_233111
feature_310101
Most of the historical transactions tailed at month end & new transactions dataset started at month start. Let's start with an assumption that observed_date should be during the start fo the month.

We will calculate the latest spendings of the customer on historical transactions , & the earlies first month's spendings in the new transactions.

last_hist_transaction = hist_trans.groupby('card_id').agg({'month_lag' : 'max', 'purchase_date' : 'max'}).reset_index()
last_hist_transaction.columns = ['card_id', 'hist_month_lag', 'hist_purchase_date']
first_new_transaction = new_hist_trans.groupby('card_id').agg({'month_lag' : 'min', 'purchase_date' : 'min'}).reset_index()
first_new_transaction.columns = ['card_id', 'new_month_lag', 'new_purchase_date']
last_hist_transaction['hist_purchase_date'] = pd.to_datetime(last_hist_transaction['hist_purchase_date'])
first_new_transaction['new_purchase_date'] = pd.to_datetime(first_new_transaction['new_purchase_date'])
last_hist_transaction.head()
card_idhist_month_laghist_purchase_date
0C_ID_00007093c102018-02-27 05:14:57
1C_ID_000123806602018-02-27 16:18:59
2C_ID_0001506ef002018-02-17 12:33:56
3C_ID_000179378602017-10-31 20:20:18
4C_ID_000183fdda02018-02-25 20:57:08
first_new_transaction.head()
card_idnew_month_lagnew_purchase_date
0C_ID_00007093c122018-04-03 11:13:35
1C_ID_000123806612018-03-01 16:48:27
2C_ID_0001506ef012018-03-16 22:21:58
3C_ID_000179378612017-11-15 15:44:20
4C_ID_000183fdda12018-03-02 12:26:26

Our suspicions might be true. We already knew that month_lag is the number of months from the reference date. Let's add observation_date by offsetting the last transaction with month_lag for historical & new transactions.

last_hist_transaction['observation_date'] = \
last_hist_transaction.apply(lambda x: x['hist_purchase_date'] - pd.DateOffset(months=x['hist_month_lag']), axis=1)

first_new_transaction['observation_date'] = \
first_new_transaction.apply(lambda x: x['new_purchase_date'] - pd.DateOffset(months=x['new_month_lag']-1), axis=1)
last_hist_transaction.head(5)
card_idhist_month_laghist_purchase_dateobservation_date
0C_ID_00007093c102018-02-27 05:14:572018-02-27 05:14:57
1C_ID_000123806602018-02-27 16:18:592018-02-27 16:18:59
2C_ID_0001506ef002018-02-17 12:33:562018-02-17 12:33:56
3C_ID_000179378602017-10-31 20:20:182017-10-31 20:20:18
4C_ID_000183fdda02018-02-25 20:57:082018-02-25 20:57:08
first_new_transaction.head()
card_idnew_month_lagnew_purchase_dateobservation_date
0C_ID_00007093c122018-04-03 11:13:352018-03-03 11:13:35
1C_ID_000123806612018-03-01 16:48:272018-03-01 16:48:27
2C_ID_0001506ef012018-03-16 22:21:582018-03-16 22:21:58
3C_ID_000179378612017-11-15 15:44:202017-11-15 15:44:20
4C_ID_000183fdda12018-03-02 12:26:262018-03-02 12:26:26

Let's round off the observation_date to month and remove the time info.

last_hist_transaction['observation_date'] = last_hist_transaction['observation_date'].dt.to_period('M').dt.to_timestamp() + pd.DateOffset(months=1)
first_new_transaction['observation_date'] = first_new_transaction['observation_date'].dt.to_period('M').dt.to_timestamp()
last_hist_transaction.head()

After rounding off our data looks like this:

card_idhist_month_laghist_purchase_dateobservation_date
0C_ID_00007093c102018-02-27 05:14:572018-03-01
1C_ID_000123806602018-02-27 16:18:592018-03-01
2C_ID_0001506ef002018-02-17 12:33:562018-03-01
3C_ID_000179378602017-10-31 20:20:182017-11-01
4C_ID_000183fdda02018-02-25 20:57:082018-03-01
first_new_transaction.head()
card_idnew_month_lagnew_purchase_dateobservation_date
0C_ID_00007093c122018-04-03 11:13:352018-03-01
1C_ID_000123806612018-03-01 16:48:272018-03-01
2C_ID_0001506ef012018-03-16 22:21:582018-03-01
3C_ID_000179378612017-11-15 15:44:202017-11-01
4C_ID_000183fdda12018-03-02 12:26:262018-03-01

Let's validate if our calculation is true by comparing the observed_date between historical & new transactions. This should be true for all card_ids.

first_new_transaction.drop(['new_month_lag', 'new_purchase_date'], axis=1, inplace=True)
last_hist_transaction.drop(['hist_month_lag', 'hist_purchase_date'], axis=1, inplace=True)
validate = last_hist_transaction.merge(first_new_transaction, on = 'card_id')
all(validate['observation_date_x'] == validate['observation_date_y'])

True

They indeed are same! We will redo some of our features based on observation_date later.

Aggregates: Merge train & test with new & old transactions history permalink

hist_trans_agg.shape, hist_trans_agg_cat.shape, last_hist_transaction.shape, hist_trans_agg_month.shape

((325540, 114), (325540, 25), (325540, 2), (618851, 6))

Merging all our aggregates & features with train & test data frames

def join_dfs(left, right, left_on, right_on=None, suffix='_old'):
if right_on is None: right_on = left_on
return left.merge(right, how='left', left_on=left_on, right_on=right_on, suffixes=("", suffix))

Our train & test datasets shape:

train.shape, test.shape
((201917, 6), (123623, 5))

Let's go ahead and join all our aggregates on the train & test datasets.

train_df = join_dfs(train, new_hist_trans_agg, left_on='card_id')
train_df = join_dfs(train_df, hist_trans_agg, left_on='card_id', suffix='_old')
# train_df = join_dfs(train_df, hist_trans_agg_month, left_on='card_id', suffix='_old')
train_df = join_dfs(train_df, hist_trans_agg_cat, left_on='card_id', suffix='_old')
train_df = join_dfs(train_df, last_hist_transaction, left_on='card_id', suffix='_old')
# train_df = join_dfs(train_df, new_hist_trans_agg_month, left_on='card_id')
train_df = join_dfs(train_df, new_hist_trans_agg_cat, left_on='card_id')
train_df = join_dfs(train_df, first_new_transaction, left_on='card_id')
test_df = join_dfs(test, new_hist_trans_agg, left_on='card_id')
test_df = join_dfs(test_df, first_new_transaction, left_on='card_id')
# test_df = join_dfs(test_df, new_hist_trans_agg_month, left_on='card_id')
test_df = join_dfs(test_df, new_hist_trans_agg_cat, left_on='card_id')
# test_df = join_dfs(test_df, hist_trans_agg_month, left_on='card_id', suffix='_old')
test_df = join_dfs(test_df, hist_trans_agg_cat, left_on='card_id', suffix='_old')
test_df = join_dfs(test_df, hist_trans_agg, left_on='card_id', suffix='_old')
test_df = join_dfs(test_df, last_hist_transaction, left_on='card_id', suffix='_old')
test_df.shape, train_df.shape

((123623, 273), (201917, 274))

Feature: Adding features based on observed_date permalink

Adding feature interactions between feature_1, feature_2 & feature_3 of the card & the time elapsed since the observed date aka reference date.

def add_days_feature_interaction(df):
# to datetime
df['first_active_month'] = pd.to_datetime(df['first_active_month'])
df['observation_date_old'] = pd.to_datetime(df['observation_date_old'])
# datetime features
df['quarter'] = df['first_active_month'].dt.quarter
# df['elapsed_time'] = (datetime.datetime.today() - df['first_active_month']).dt.days
df['observed_elapsed_time'] = (df['observation_date_old'] - df['first_active_month']).dt.days
df['days_feature1'] = df['observed_elapsed_time'] * df['feature_1']
df['days_feature2'] = df['observed_elapsed_time'] * df['feature_2']
df['days_feature3'] = df['observed_elapsed_time'] * df['feature_3']

df['days_feature1_ratio'] = df['feature_1'] / df['observed_elapsed_time']
df['days_feature2_ratio'] = df['feature_2'] / df['observed_elapsed_time']
df['days_feature3_ratio'] = df['feature_3'] / df['observed_elapsed_time']

df['feature_sum'] = df['feature_1'] + df['feature_2'] + df['feature_3']
df['feature_mean'] = df['feature_sum']/3
df['feature_max'] = df[['feature_1', 'feature_2', 'feature_3']].max(axis=1)
df['feature_min'] = df[['feature_1', 'feature_2', 'feature_3']].min(axis=1)
df['feature_var'] = df[['feature_1', 'feature_2', 'feature_3']].std(axis=1)
return df
train_df, test_df = [add_days_feature_interaction(d) for d in [train_df, test_df]]

Feature: Add features based on old & new transactions permalink

train_df.shape, test_df.shape
((201917, 286), (123623, 285))

We now have around 285 features adding all our aggregates. Can we squeeze in more? Let's add some feature interactions with old & new transactions data. I also added new features like CLV (my own Customer Lifetime Value formula) which is basically feature interaction between number of transactions, age & sum of the transactions.

for df in [train_df, test_df]:
df['card_id_total'] = df['card_id_size']+df['card_id_size_old']
df['card_id_count_total'] = df['card_id_count']+df['card_id_count_old']
df['card_id_count_ratio'] = df['card_id_count'] / df['card_id_count_old']
df['purchase_amount_total'] = df['purchase_amount_sum_old']+df['purchase_amount_sum']
df['purchase_amount_total_mean'] = df['purchase_amount_mean']+df['purchase_amount_mean_old']
df['purchase_amount_total_max'] = df['purchase_amount_max']+df['purchase_amount_max_old']
df['purchase_amount_total_min'] = df['purchase_amount_min']+df['purchase_amount_min_old']
df['purchase_amount_sum_ratio'] = df['purchase_amount_sum'] / df['purchase_amount_sum_old']
df['hist_first_buy'] = (df['purchase_date_min_old'] - df['first_active_month']).dt.days
df['new_first_buy'] = (df['purchase_date_min'] - df['first_active_month']).dt.days
df['hist_last_buy'] = (df['purchase_date_max_old'] - df['first_active_month']).dt.days
df['new_last_buy'] = (df['purchase_date_max'] - df['first_active_month']).dt.days
df['avg_spend_per_transaction'] = df['purchase_amount_total']/df['card_id_total']
df['purchased_before_issue'] = df['hist_first_buy'] < 0
df['month_diff_mean_total'] = df['month_diff_mean']+df['month_diff_mean_old']
df['month_diff_ratio'] = df['month_diff_mean']/df['month_diff_mean_old']
df['month_lag_mean_total'] = df['month_lag_mean']+df['month_lag_mean_old']
df['month_lag_max_total'] = df['month_lag_max']+df['month_lag_max_old']
df['month_lag_min_total'] = df['month_lag_min']+df['month_lag_min_old']
df['category_1_mean_total'] = df['category_1_mean']+df['category_1_mean_old']
df['category_4_mean_total'] = df['category_4_mean']+df['category_4_mean_old']
df['category_4_mean_ratio'] = df['category_4_mean']/df['category_4_mean_old']
df['category_1_mean_ratio'] = df['category_1_mean']/df['category_1_mean_old']
df['numerical_range_mean_total'] = df['numerical_range_mean']+df['numerical_range_mean_old']
df['numerical_range_mean_ratio'] = df['numerical_range_mean']/df['numerical_range_mean_old']
df['merchant_rating_mean_ratio'] = df['merchant_rating_mean']/df['merchant_rating_mean_old']
df['installments_total'] = df['installments_sum']+df['installments_sum_old']
df['installments_mean_total'] = df['installments_mean']+df['installments_mean_old']
df['installments_max_total'] = df['installments_max']+df['installments_max_old']
df['installments_ratio'] = df['installments_sum']/df['installments_sum_old']
df['price_total'] = df['purchase_amount_total'] / df['installments_total']
df['price_mean'] = df['purchase_amount_mean'] / df['installments_mean']
df['price_max'] = df['purchase_amount_max'] / df['installments_max']
df['duration_mean_total'] = df['duration_mean']+df['duration_mean_old']
df['duration_min_total'] = df['duration_min']+df['duration_min_old']
df['duration_max_total'] = df['duration_max']+df['duration_max_old']
df['amount_month_ratio_mean_total']=df['amount_month_ratio_mean']+df['amount_month_ratio_mean_old']
df['amount_month_ratio_min_total']=df['amount_month_ratio_min']+df['amount_month_ratio_min_old']
df['amount_month_ratio_max_total']=df['amount_month_ratio_max']+df['amount_month_ratio_max_old']
df['CLV'] = df['card_id_count'] * df['purchase_amount_sum'] / df['month_diff_mean']
df['CLV_old'] = df['card_id_count_old'] * df['purchase_amount_sum_old'] / df['month_diff_mean_old']
df['CLV_ratio'] = df['CLV'] / df['CLV_old']
df['category_1_0_purchase_amount_mean_total'] = df['category_1_0_purchase_amount_mean'] + df['category_1_0_purchase_amount_mean_old']
df['category_1_1_purchase_amount_mean_total'] = df['category_1_1_purchase_amount_mean'] + df['category_1_1_purchase_amount_mean_old']
df['category_1_0_purchase_amount_sum_total'] = df['category_1_0_purchase_amount_sum'] + df['category_1_0_purchase_amount_sum_old']
df['category_1_1_purchase_amount_sum_total'] = df['category_1_1_purchase_amount_sum'] + df['category_1_1_purchase_amount_sum_old']
df['category_2_1.0_purchase_amount_mean_total'] = df['category_2_1.0_purchase_amount_mean'] + df['category_2_1.0_purchase_amount_mean_old']
df['category_2_2.0_purchase_amount_mean_total'] = df['category_2_2.0_purchase_amount_mean'] + df['category_2_2.0_purchase_amount_mean_old']
df['category_2_3.0_purchase_amount_mean_total'] = df['category_2_3.0_purchase_amount_mean'] + df['category_2_3.0_purchase_amount_mean_old']
df['category_2_4.0_purchase_amount_mean_total'] = df['category_2_4.0_purchase_amount_mean'] + df['category_2_4.0_purchase_amount_mean_old']
df['category_2_5.0_purchase_amount_mean_total'] = df['category_2_5.0_purchase_amount_mean'] + df['category_2_5.0_purchase_amount_mean_old']
df['category_2_1.0_purchase_amount_sum_total'] = df['category_2_1.0_purchase_amount_sum'] + df['category_2_1.0_purchase_amount_sum_old']
df['category_2_2.0_purchase_amount_sum_total'] = df['category_2_2.0_purchase_amount_sum'] + df['category_2_2.0_purchase_amount_sum_old']
df['category_2_3.0_purchase_amount_sum_total'] = df['category_2_3.0_purchase_amount_sum'] + df['category_2_3.0_purchase_amount_sum_old']
df['category_2_4.0_purchase_amount_sum_total'] = df['category_2_4.0_purchase_amount_sum'] + df['category_2_4.0_purchase_amount_sum_old']
df['category_2_5.0_purchase_amount_sum_total'] = df['category_2_5.0_purchase_amount_sum'] + df['category_2_5.0_purchase_amount_sum_old']
df['category_3_0_purchase_amount_mean_total'] = df['category_3_0_purchase_amount_mean'] + df['category_3_0_purchase_amount_mean_old']
df['category_3_1_purchase_amount_mean_total'] = df['category_3_1_purchase_amount_mean'] + df['category_3_1_purchase_amount_mean_old']
df['category_3_2_purchase_amount_mean_total'] = df['category_3_2_purchase_amount_mean'] + df['category_3_2_purchase_amount_mean_old']
df['category_3_0_purchase_amount_sum_total'] = df['category_3_0_purchase_amount_sum'] + df['category_3_0_purchase_amount_sum_old']
df['category_3_1_purchase_amount_sum_total'] = df['category_3_1_purchase_amount_sum'] + df['category_3_1_purchase_amount_sum_old']
df['category_3_2_purchase_amount_sum_total'] = df['category_3_2_purchase_amount_sum'] + df['category_3_2_purchase_amount_sum_old']
df['category_1_0_purchase_amount_mean_ratio'] = df['category_1_0_purchase_amount_mean'] / df['category_1_0_purchase_amount_mean_old']
df['category_1_1_purchase_amount_mean_ratio'] = df['category_1_1_purchase_amount_mean'] / df['category_1_1_purchase_amount_mean_old']
df['category_1_0_purchase_amount_sum_ratio'] = df['category_1_0_purchase_amount_sum'] / df['category_1_0_purchase_amount_sum_old']
df['category_1_1_purchase_amount_sum_ratio'] = df['category_1_1_purchase_amount_sum'] / df['category_1_1_purchase_amount_sum_old']
df['category_2_1.0_purchase_amount_mean_ratio'] = df['category_2_1.0_purchase_amount_mean'] / df['category_2_1.0_purchase_amount_mean_old']
df['category_2_2.0_purchase_amount_mean_ratio'] = df['category_2_2.0_purchase_amount_mean'] / df['category_2_2.0_purchase_amount_mean_old']
df['category_2_3.0_purchase_amount_mean_ratio'] = df['category_2_3.0_purchase_amount_mean'] / df['category_2_3.0_purchase_amount_mean_old']
df['category_2_4.0_purchase_amount_mean_ratio'] = df['category_2_4.0_purchase_amount_mean'] / df['category_2_4.0_purchase_amount_mean_old']
df['category_2_5.0_purchase_amount_mean_ratio'] = df['category_2_5.0_purchase_amount_mean'] / df['category_2_5.0_purchase_amount_mean_old']
df['category_2_1.0_purchase_amount_sum_ratio'] = df['category_2_1.0_purchase_amount_sum'] / df['category_2_1.0_purchase_amount_sum_old']
df['category_2_2.0_purchase_amount_sum_ratio'] = df['category_2_2.0_purchase_amount_sum'] / df['category_2_2.0_purchase_amount_sum_old']
df['category_2_3.0_purchase_amount_sum_ratio'] = df['category_2_3.0_purchase_amount_sum'] / df['category_2_3.0_purchase_amount_sum_old']
df['category_2_4.0_purchase_amount_sum_ratio'] = df['category_2_4.0_purchase_amount_sum'] / df['category_2_4.0_purchase_amount_sum_old']
df['category_2_5.0_purchase_amount_sum_ratio'] = df['category_2_5.0_purchase_amount_sum'] / df['category_2_5.0_purchase_amount_sum_old']
df['category_3_0_purchase_amount_mean_ratio'] = df['category_3_0_purchase_amount_mean'] / df['category_3_0_purchase_amount_mean_old']
df['category_3_1_purchase_amount_mean_ratio'] = df['category_3_1_purchase_amount_mean'] / df['category_3_1_purchase_amount_mean_old']
df['category_3_2_purchase_amount_mean_ratio'] = df['category_3_2_purchase_amount_mean'] / df['category_3_2_purchase_amount_mean_old']
df['category_3_0_purchase_amount_sum_ratio'] = df['category_3_0_purchase_amount_sum'] / df['category_3_0_purchase_amount_sum_old']
df['category_3_1_purchase_amount_sum_ratio'] = df['category_3_1_purchase_amount_sum'] / df['category_3_1_purchase_amount_sum_old']
df['category_3_2_purchase_amount_sum_ratio'] = df['category_3_2_purchase_amount_sum'] / df['category_3_2_purchase_amount_sum_old']
df['purchase_amount_sum_total'] = df['purchase_amount_sum'] + df['purchase_amount_sum_old']
df['purchase_amount_sum_ratio'] = df['purchase_amount_sum'] / df['purchase_amount_sum_old']
df['purchase_amount_pct_75_ratio'] = df['purchase_amount_pct_75'] / df['purchase_amount_pct_75_old']
df['purchase_amount_pct_25_ratio'] = df['purchase_amount_pct_25'] / df['purchase_amount_pct_25_old']
df['purchase_amount_inter_quartile'] = df['purchase_amount_pct_75'] - df['purchase_amount_pct_25']
df['purchase_amount_inter_quartile_old'] = df['purchase_amount_pct_75_old'] - df['purchase_amount_pct_25_old']
df['purchase_amount_inter_quartile_ratio'] = df['purchase_amount_inter_quartile'] / df['purchase_amount_inter_quartile_old']
df['purchase_amount_mean_total'] = df['purchase_amount_mean'] + df['purchase_amount_mean_old']
df['purchase_amount_mean_ratio'] = df['purchase_amount_mean'] / df['purchase_amount_mean_old']
df['purchase_amount_median_ratio'] = df['purchase_amount_median'] / df['purchase_amount_median_old']
df['purchase_amount_max_total'] = df['purchase_amount_max'] + df['purchase_amount_max_old']
df['purchase_amount_min_total'] = df['purchase_amount_min'] + df['purchase_amount_min_old']
df['purchase_amount_skew_ratio'] = df['purchase_amount_skew'] / df['purchase_amount_skew_old']
df['purchase_amount_before_after_ratio'] = df['purchase_amount_head_sum'] / df['purchase_amount_tail_sum_old']
df['purchase_amount_first_last_diff'] = df['purchase_amount_tail_sum'] - df['purchase_amount_head_max_old']
df['purchase_amount_fi_last_old_total'] = df['purchase_amount_tail_sum'] - df['purchase_amount_head_sum']
df['purchase_amount_fi_last_new_total'] = df['purchase_amount_tail_sum_old'] - df['purchase_amount_head_max_old']
df['purchase_amount_gmean_total'] = df['purchase_amount_gmean'] + df['purchase_amount_gmean_old']
df['purchase_amount_hmean_total'] = df['purchase_amount_hmean'] + df['purchase_amount_hmean_old']
df['purchase_amount_gmean_ratio'] = df['purchase_amount_gmean'] / df['purchase_amount_gmean_old']
df['purchase_amount_hmean_ratio'] = df['purchase_amount_hmean'] / df['purchase_amount_hmean_old']

Feature: Redo some date features with observed time permalink

Initially we calculated purchase_date_uptonow and purchase_date_uptomin as the age of the max and minimum transaction amount till date by the customer. This is a dwindling number as the result will change as time progresses. Let's calculate it against some fixed date. What about our observation_date aka reference date. This yielded some boost in the leaderboard personally for me.

for df in [train_df, test_df]:
df['purchase_date_uptonow'] = (df['observation_date_old'] - df['purchase_date_max']).dt.days
df['purchase_date_uptomin'] = (df['observation_date_old'] - df['purchase_date_min']).dt.days
df['purchase_date_uptonow_old'] = (df['observation_date_old'] - df['purchase_date_max_old']).dt.days
df['purchase_date_uptomin_old'] = (df['observation_date_old'] - df['purchase_date_min_old']).dt.days
# df.drop(['days_since_last_transaction', 'days_since_last_transaction_old'], inplace=True, axis=1)
train_df.shape, test_df.shape
((201917, 386), (123623, 384))

Feature: Mark the outliers permalink

Add an additional feature which flags the outliers. Let's plot the distribution of target using a histogram. image.png

train_df['outliers'] = 0
train_df.loc[train_df['target'] < -30, 'outliers'] = 1
train_df['outliers'].value_counts()

0 199710
1 2207
Name: outliers, dtype: int64

We have around 2207 outliers in our dataset which is around 1.105% of the dataset. This is a very skewed distribution and the outliers penalise our metric RMSE heavily as they have huge variance with non-outliers. It's important to retain them in our dataset instead of deleting them. Later we will learn how to build models with and without outliers and optimise both models to improve our prediction accuracy.

Feature: Redo features based on new purchase amount permalink


new_hist_trans_purchase_aggs = new_hist_trans.groupby('card_id').agg({
'purchase_amount_new': [('sum', 'sum'), ('pct_75', lambda x: np.percentile(x, q = 75)),
('pct_25', lambda x: np.percentile(x, q = 25)), ('mean', 'mean'),
('median', 'median'), ('max', 'max'), ('min', 'min'), ('var', 'var'),
('skew', 'skew'), ('head_sum', head_sum), ('head_max', head_max),
('tail_sum', tail_sum), ('tail_max', tail_max), ('gmean', scipy.stats.gmean ),
('hmean', scipy.stats.hmean)]
})
new_hist_trans_purchase_aggs.columns = [ '_'.join(c).strip() for c in new_hist_trans_purchase_aggs.columns.values]
hist_trans.loc[hist_trans['purchase_amount_new'] == 0, 'purchase_amount_new'] = 0.0001
hist_trans_purchase_aggs = hist_trans.groupby('card_id').agg({
'purchase_amount_new': [('sum', 'sum'), ('pct_75', lambda x: np.percentile(x, q = 75)),
('pct_25', lambda x: np.percentile(x, q = 25)), ('mean', 'mean'),
('median', 'median'), ('max', 'max'), ('min', 'min'), ('var', 'var'),
('skew', 'skew'), ('head_sum', head_sum), ('head_max', head_max),
('tail_sum', tail_sum), ('tail_max', tail_max), ('gmean', scipy.stats.gmean ),
('hmean', scipy.stats.hmean)]
})
hist_trans_purchase_aggs.columns = [ '_'.join(c).strip() for c in hist_trans_purchase_aggs.columns.values]
hist_trans_purchase_aggs.reset_index().to_feather('hist_trans_purchase_aggs_alpha')
new_hist_trans_purchase_aggs.reset_index().to_feather('new_hist_trans_purchase_aggs_alpha')
train_df.shape, test_df.shape, hist_trans_purchase_aggs.shape, new_hist_trans_purchase_aggs.shape
((201917, 357), (123623, 355), (325540, 15), (290001, 15))

We now have 355 features so far. Let's export the dataframe to feather and store it to disk:

train_df.to_feather('train_df_alpha')
test_df.to_feather('test_df_alpha')

Feature Selection permalink

We have 355 features on train & test datasets. Thats' a quite huge number. Also feeding all of them to the model made the LB score worse. This is because we are either adding too much noise the data or unnecessarily weighing wrong features.

Feature selection is a very important step in Data science. There exists lots of approaches for feature selection:

  1. Prune features which tail at the feature importance ranking after certain percentile.

  2. Remove features which are highly correlated by calculating VarianceThreshold.

  3. Use hierarchical clustering techniques like constructing a Dendrogram to eliminate correlated features.

  4. Boruta algorithm: Shuffle one column randomly and see if it improves the score.

  5. Forward Feature Selection (FFS): Add features one after the other if they improve the RMSE score.

After trying all the above techniques I stuck with FFS as it was more reliable than others though it was taking lot of compute time & power. This is because of too little representation & very high penalty on RMSE by the outliers in this dataset.

What I've done:

  1. Established a base line score by training a LGBM regressor on 3 base columns.

  2. Added features one by one to the regressor to compute the score. Remove the feature if it detoriates the base score. If it improves, add the feature to my base columns and update the base score to the current score.

Here is our training function:

def lgb_train_fn(train_df, target, trn_cols,  n_fold):
folds = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=4590)
# initialise out of fold preds to 0s.
oof = np.zeros(len(train_df))

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_df,train_df['outliers'].values)):
trn_data = lgb.Dataset(train_df.iloc[trn_idx][trn_cols], label=target.iloc[trn_idx])
val_data = lgb.Dataset(train_df.iloc[val_idx][trn_cols], label=target.iloc[val_idx])

num_round = 10000
clf = lgb.train(lgb_param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=0, early_stopping_rounds = 200)
oof[val_idx] = clf.predict(train_df.iloc[val_idx][trn_cols], num_iteration=clf.best_iteration)

print(np.sqrt(mean_squared_error(oof, target)), 'CV score')
return np.sqrt(mean_squared_error(oof, target))

Here is our selection loop:

for c in cols_to_add:
lgb_cols = final_cols + [c]
print(len(lgb_cols), 'lg_cols', c)
score = lgb_train_fn(x, y, lgb_cols, 5)
delta = base_score - score
fe_d[c] = delta
if delta > 0:
base_score = score
selected_cols.append(c)
print('Selected cols', c)
print('Selected col delta', delta)
print(' score with col', score)
np.save('selecte_cols_extra', selected_cols)
final_cols = final_cols + [c]

After running the above pieces of code for about 1.5 hours we are left with 180 features out of our initial 355 features.

Model training permalink

Cross validation data set permalink

Cross-validation is a statistical method used to estimate the skill of machine learning models.

It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.

We will be using Stratified k fold for sampling our data. We have seen previously that outliers are about ~1% of our dataset and they are heavily skewing the score. While training our model, it is important to keep the outlier distribution even so that we get the best predictions. Stratified sampling makes sure that data from all strata is represented proportionately.

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

LGBM model permalink

I used the optimised & finetuned hyperparameter from one of the public kernels. We can arrive at the same numbers by using grid search or leveraging Bayesian optimisation techniques using Hyperopt.

%%time
n_fold =5
param = {
'task': 'train',
'boosting': 'goss',
'objective': 'regression',
'metric': 'rmse',
'learning_rate': 0.01,
'subsample': 0.9855232997390695,
'max_depth': 7,
'top_rate': 0.9064148448434349,
'num_leaves': 123,
'min_child_weight': 41.9612869171337,
'other_rate': 0.0721768246018207,
'reg_alpha': 9.677537745007898,
'colsample_bytree': 0.5665320670155495,
'min_split_gain': 9.820197773625843,
'reg_lambda': 8.2532317400459,
'min_data_in_leaf': 21,
'verbose': -1,
'seed':int(2**n_fold),
'bagging_seed':int(2**n_fold),
'drop_seed':int(2**n_fold)
}
folds = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=4590)
oof = np.zeros(len(train_df2))
predictions = np.zeros(len(test_df2))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_df2,train_df2['outliers'].values)):
print("fold {}".format(fold_))
trn_data = lgb.Dataset(train_df2.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx])
val_data = lgb.Dataset(train_df2.iloc[val_idx][df_train_columns], label=target.iloc[val_idx])

num_round = 10000
clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
oof[val_idx] = clf.predict(train_df2.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)

fold_importance_df = pd.DataFrame()
fold_importance_df["Feature"] = df_train_columns
fold_importance_df["importance"] = clf.feature_importance()
fold_importance_df["fold"] = fold_ + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

predictions += clf.predict(test_df2[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

print(np.sqrt(mean_squared_error(oof, target)), 'CV score')
np.save('oof_lgbm', oof)
np.save('predictions_lgbm', predictions)
3.644042133990217 CV score

Let's store the predictions on test & train set to a pickle file for stacking & blending with other models later.

XGBM Model permalink


xgb_params = {
'objective': 'reg:linear',
'eval_metric': 'rmse',
'silent': True,
'booster': 'gbtree',
'n_jobs': 4,
'n_estimators': 20000,
'grow_policy': 'lossguide',
'max_depth': 12,
'seed': 538,
'colsample_bylevel': 0.9,
'colsample_bytree': 0.8,
'gamma': 0.0001,
'learning_rate': 0.006150886706231842,
'max_bin': 128,
'max_leaves': 47,
'min_child_weight': 40,
'reg_alpha': 10.0,
'reg_lambda': 10.0,
'silent': True,
'eta': 0.005,
'subsample': 0.9
}
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_df2,train_df2['outliers'].values)):
print('-')
print("Fold {}".format(fold_ + 1))
trn_data = xgb.DMatrix(data=train_df2.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx])
val_data = xgb.DMatrix(data=train_df2.iloc[val_idx][df_train_columns], label=target.iloc[val_idx])
watchlist = [(trn_data, 'train'), (val_data, 'valid')]
print("xgb " + str(fold_) + "-" * 50)
num_round = 10000
xgb_model = xgb.train(xgb_params, trn_data, num_round, watchlist, early_stopping_rounds=500, verbose_eval=1000)
oof_xgb_3[val_idx] = xgb_model.predict(xgb.DMatrix(train_df2.iloc[val_idx][df_train_columns]), ntree_limit=xgb_model.best_ntree_limit+50)

predictions_xgb_3 += xgb_model.predict(xgb.DMatrix(test_df2[df_train_columns]), ntree_limit=xgb_model.best_ntree_limit+50) / folds.n_splits

np.save('oof_xgb', oof_xgb_3)
np.save('predictions_xgb', predictions_xgb_3)
print(np.sqrt(mean_squared_error(target.values, oof_xgb_3)))

3.649844028552147

Post processing permalink

Stacking the model predictions permalink

There are various techniques for improving the score based on the predictions of multiple models. Simple average of models' predictions has significant boosts in the LB. Here we will see another such technique.

We will construct a meta model which takes training data predictions from various models as features. Our dependant variable will still remain target from the training data frame. We will feed this features to a simple Ridge regression model and make predictions by repeating the same on afore said models' test predictions.

train_stack = np.vstack([oof_lgbm, oof_dl, oof_xgb]).transpose()
test_stack = np.vstack([predictions_lgbm, predictions_dl, predictions_xgb]).transpose()

folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)
oof_stacked = np.zeros(train_stack.shape[0])
predictions_stacked = np.zeros(test_stack.shape[0])

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_stack, train_df2['outliers'].values)):
print("fold nĀ°{}".format(fold_))
trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].values
val_data, val_y = train_stack[val_idx], target.iloc[val_idx].values

clf = Ridge(alpha=1)
clf.fit(trn_data, trn_y)

oof_stacked[val_idx] = clf.predict(val_data)
predictions_stacked += clf.predict(test_stack) / folds.n_splits

np.sqrt(mean_squared_error(target.values, oof_stacked))

3.641238686022574

Our meta model gave a huge boost of 0.03 on stacking the LGBM, XGBoost & Deep learning models. Lets' make a submission to kaggle at this point.

Combining model without outliers data permalink

Outliers are greatly skewing the predictions for other card_id. Maybe if the outliers didn't penalise our model using RMSE so much, we can make better predictions for the rest of the data. So how we know who are the outliers in our test data set? We will build a classifier to classify outliers!

So the total steps are:

  1. Train a model without outliers - model A.
  2. Train a model to classify outliers in the data - model B.
  3. Use model B to classify outlier card_ids in the test data using a threshold.
  4. Use model A to predict the target for non-outliers, use our previous predictions from older models for the outliers.

This way we aren't changing the predictions for the outlier data but we would be using better predictions from model A for the non-outliers.

The above approach gave me a boost of +0.003 on the LB.

The 1st prize winner had a simple trick built on top of the above approach. Instead of picking and blending the outliers by a threshold, he did a linear interpolation based on the probabilites from the classifier.

train['final'] = train['binpredict'](-33.21928)+(1-train['binpredict'])train['no_outlier']

Obviously accuracy depends on the performance of our classifier. It was so simple and yet very effective for him. It gave him 0.015 boost in local cv compare with same feature on the LB. See more discussion here.

Kaggle submission permalink

submission_name = pd.to_datetime("today")
sub_df = pd.DataFrame({"card_id":test_df2["card_id"].values})
sub_df["target"] = predictions_stacked
sub_df.to_csv(f'dl_xgb_submission_stacked_lgb-{submission_name}.csv', index=False)
filename = f'dl_xgb_submission_stacked_lgb-{submission_name}.csv'
kaggle competitions submit elo-merchant-category-recommendation -f 'dl_xgb_submission_stacked_lgb-2019-02-26 22:45:24.597929.csv' -m "dl & xgb & lgbm stacked with 178 cols - 3.641238686022574"

Our submission gave us 3.61001 on private leaderboard and 3.68463 on public leaderboard.