Airbnb New User Booking Prediction

10 min readMar 2, 2021

Instead of waking to overlooked “Do not disturb” signs, Airbnb travelers find themselves rising with the birds in a whimsical treehouse, having their morning coffee on the deck of a houseboat, or cooking a shared regional breakfast with their hosts

New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.

Business Problem
Use of ML
Source of Data
Existing Approaches
My Improvements
EDA
First Cut Solution
Comparison of Models
Kaggle Screenshot
Future Work
References
Github Repo

Explanation of business problem

The exact business problem for this is that by doing accurate
prediction Where the new user will be doing their first travel
experience , Airbnb can create a more personalized content for the
users and can decrease the average booking time for new user which
will lead the growth of the company and also for will be helpful for the
new users.

The problem is to do the prediction to which country the new user of
Airbnb will book his/her first travel experience.Since we have to
predict for multiple countries it is a multiclassification problem.

Using of Machine Learning

For solving this problem we are going to use Machine Learning techniques for predicting where a user going to book his/her first travel using Airbnb.We are going to use NDCG score as our metric for this problem.

Source of Data

The Data set we are using for this problem is taken from the kaggle competition https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA

There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. In this ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.

The data contains 5 csv files as mentioned below .

train_users.csv — This is our train data set which contains total data points 213451 and features like id,date_account_created,timestamp_first_active,age,gender,…,country_destination here country_destination is the target value which we have to predict for given user.

Test_users.csv — This is our test data which contains 62096 data points and same features as train data which we have to use for the testing time.

sessions.csv — This is csv file in which all web session logs of the users are stored which
contains features like user_id,action,action_type,etc contiang total data points as
10567737.It contains data of users from 2014

countries.csv — This contains the summary statistics of destination countries in this dataset
and their locations having features like
country_destination,lat_destiantion,lng_destination,etc

age_gender_bkts.csv — This contains the summary statistics of users’ age group, gender,
country of destination having features like age_bucket,country_destination,gender,etc

Existing approaches to this problem

There are few existing approaches for the problem

People have used only train data and discard session data but we will
be lossing lots of information containing in session data the biggest
problem is that the only users left in the training set are users created
during the winter to early summer months and the users we want to
predict only contains users from late summer months, realistically
these 2 groups should have a different distribution of travel
destinations since the attractiveness of some destination largely
depend on the seasons.
There are is another approache for this to use both train and session data. But the problem is that only 35% of the train users have session data along with 99% of the users from the test data.Also for session data there are multiple entries for each user.

My improvements to the existing approaches

For tackling this problem I am going to use both train and session datasets to solve this problem .Since about half users from train data are having session data we are going to perform natural join on both data to combine into one and going to do derive some new features from both train and sessions data and do feature engineering on the combined data.For session data since it is having multiple entries for a user we are going perform a groupby operation on the session data to get one entry corresponding to one user.

Exploratory Data Analysis

Univariate Analysis

From the above plot we can see that there and many missing values into the date_first_booking column,age and first_affiliate_tracked.
There are 58% of the values from the total data which are missing for date_first_booking and 42% of the values from the age variable are missing.
We have to use some mechanism to handle all the missing values

This class label data is highly imbalanced data as majority user belongs to NDF(no destination found)and US
Most of the users are who have registered and didnot have booked any travel.
While other than NDF most of the users prefer to travel for US.
There are few percentages of people who use to travel to others counteris like itally,spain,great britain,etc

From above we can see, people who are making travel to US majority of them are females followed by followed by unknown and males.
Same above pattern can be observe for France and other countries.
As a conclusion for US,FR females prefer more to travel than males, for rest of the countries there is not much difference in travel to genders

After removing the outliers from age variable we can the above plot which is heavyly right skewed.
Most of the user are having age between 25 to 40.
Very few people are having ages greater than 50 also we can observe the count values decreases as the age value increses.
We can conclude, Most of users who are booking travel usually are having ages between 25 to 40 while elder people rarely use to travels.

For users who are travelling to US,other,and NL are having almost same age group.
Users who wish to travel to FR,DE,GB and AU are usually older than other users.
Users who wish to travel to ES and PT are usually younger than other users

From above plot we can observe people who wish to travel to FR and IT are making their booking into the early year i,e most of them are having booking into the may.
Some of the users who wish to travel to US,other and AU are also making first booking into the september

Bivariate Analysis

From above we can observe that female users who are younger prefer to travel countries like US,FR,ES and PT.
female user who are older some of them have made a travel to GB.
People whose gender is other and are older usually book a travel to DE.
People whose gender is other and are younger usually book a travel to NL.
User whose gender is unknown,Male and are older most of them booke travel to France

In the first plot we can see the distribution of first_device_type as most of the users are of mac desktop followed by windows desktop, very few are using android device type.
In the second plot we can observe that,User who are using iPad as first_device_type and older usually book a travel to GB and PT.
If a user is using iPhone and younger mostly book travel to countries like US,FR,CA,ES,IT and PT
User who are using device as android tablet and are younger makes a travel to PT.
User who are using device as android and are younger makes a travel to GB.
User who are using device as Unknown/other and are older makes a travel to PT.

My first cut approach for this problem

As I am using both train and sessions data for the predictions I followed below mentioned approach for my first cut solution to this problem.

Data cleaning

For handing train data, as there are many NaN values and missing values into the data I replaced all NaN values by the mode value of that feature and for missing values I replaced with mean value of that feature and perform data preprocesing to make data more meaningfull

I engineered some new features like first_booking_day,first_booking_month,etc from the train data and dropped the features which where adding not much value.

For session data , I performed the data cleaning to make it more meaning full and as there are multiple entries for particular user I grouped them all into one as one entry for a particular user and created new session data

I combined the train and session data frames into one so that we can build our model and train it with is data

Data before cleaning,

Data after cleaning,

Featurization of data

For all categoricals variables into our data ,I choose to perform one hot encoding upon them using CountVectorizer.

For the features like action,action_type and action_detail I encoded them using tfidf vectorizer with n_grams(1,4) to capture the series information.

For the numerical features I choose to perform Standarscalar to bring all values with the range of 1- and 1

I combined the whole featured data onto one using using hstack which can be feed to our model for training and testing.

Building the model

I build the model using xgboost and perform all hyperparameter tunning on its parameters like n_estimators,max_depth . Finally I trained the model on our data and was able to get the score of 0.9372843247428133 on the test data.

Experiment with other models

I also experimented with the other machine learning models some of them mentioned below.

K Nearest Neighbour

I build the model with KNN along with hyperparameter tunning of k value and was getting a decent score but as it have a high time complexity it took more time for training and producing the results.

Random Forest

I experimented with Random Forest model with all hyperaprameter tunning we where not getting good results as compared to other models.

Decision tree

After building our model using decision tree and with some hyperparameter tuning on Decision tree parameters we where able to get good score as compare to the Random forest model.This model is giving NDCG score of 0.9359998267403783 .The best hyperparameter are 4 and 100 as depth of tree and min_sample_split respectivily.

Custom Stacking Classifier

I also used custom model in which I implemented custom stacking classifier as follows

I splitted my training data into 50–50. For first 50% data I randomly generated n samples of data with replacement where n is the number of estimators in our custom stacking model.From this n samples I used to train my n baseline models.For next 50% data I took the predictions from each baseline model and created a data set and trained the metamodel using this data set.After buiding the model and training on the train data(which we splitted above) I took the predictions on the test data.We where getting pretty good results by this model as this was giving score of 0.9367217804648186

By this model I was able to get score of 0.9367217804648186

Comparation of all models

I experimented with many models like NB,KNN,Logistic regression,etc for solving this problem and at last xgboost model was performing this best with the score of 0.9372843247428133 while the second best performer is the custom stacking model with the score of 0.9367217804648186.

Kaggle Screenshot

I used xgboost as my final model and was getting a kaggle score of 0.84025.

Future work

More features would have been extracted from the session data as well as train data , also other data sets like countries and age_gender_bkts can be used to build new data set for training the model.
More hyper parameter tunning on different paramters of the model can be done.
Different techniques like tfidf weighted word2vec can be used for featurization to capture the sematic meaning of the words.