How we won the Securitas Direct 2019 Data Day

4 minute read

Hello!

On the 3rd & 4th of October 2019, the Securitas Direct Data Day contest was held in collaboration with the University of Valladolid. In this datathon we were challenged to build models for theft prediction in spanish homes with the aim of giving some insights on how these thefts were distributed geographically around the country and what variables influenced them.

Since they handled us private data, I won’t tell what specific data we recieved but I’ll try to describe in general terms.

First day, data preprocessing

On the first day, we were handled a humongous dataset for what we were used too (over 1GB) full of NAs (over 100 variables, most of them with more than 70% missing values).

Since we barely had 24 hours to deliver a solution, we decided to go greedy and stick to the top 25 variables with less missing values (<3%). Doing such a brutal variable elimination can be very risky, since you might be loosing important information. However, time was running and among those 25 variables we had economic factors such as owners average expenditure in food, furniture & beverage and geographical ones such as zip code, which is really usefull because if treated like a factor it gives a direct interpretation on how the thefting happens in different cities.

Instead of working with the full dataset (remember that we had to work with our low budget computers, so we had to go small), we decided to work with a subsample. We iterated, working at first with 150000 observations and decreasing the subsample steadily while checking if the model changed. Surprisingly, we found out that working with just 10000 observations provided the same results as working with 150000. This gave us a huge competitive advantage in terms of computational efficiency over the other teams, allowing us to train models faster which gave us more time to study the model.

Second day, model selection

Thanks to our quick decision making, we were kind of ahead and decided to spend the morning with fast prototyping between two models. Since the answer was binary (1=theft ocurred at that home, 0=it didn’t), we went for logistic regression.

As I had already worked with boosting algorithms for my final degree project, we chose AdaBoost as our second candidate aswell, so we were testing two approaches: difficult models with high prediction power (Boosting) vs simple and highly interpretable models (Logistic Regression).

Through 10-fold cross validation we found out that , against theory, Logistic Regression had the highest accuracy! (Seems like Ockham Razor is always present). It didn’t have a very high accuracy (around 0.8) but the model was very interesting and delivered some useful insights.

However, we realized that data was very unbalanced in terms of provinces. Madrid had hundreds of thousands of observations while Soria had very few. We retrained the models with balanced data, and the results were basically the same (such a relief).

We also tried to increase the number of variables to 50 even though it will increase the number of NAs, but results didn’t significantly improve, which is good since it allowed us to stick to 25 variables.

Final insights

Algorithm XV-Accuracy Test Accuracy
Logistic Regression balanced sample (25 variables) 0.7602 0.7632
Logistic Regression (25 v.) 0.775 0.770
Adaboost (25 v.) 0.7524 0.7513
Logistic Regression (50 v.) 0.7562 0.7509
AdaBoost (50 var.) 0.7602 0.7555

We decided to go for the Logistic Regression with 25 variables, with zip code as a factor. Here are the results we delivered

Numeric variables

Here are the most important variables:

Variable Coefficient
Number of homes per building 0.074
Total built surface 0.00017
Average expendidture in food -0.0016
Average expendidture in beverage 0.00418
Average expendidture in furniture -0.00228

Here we can see that having more homes per building increases the odds of suffering a theft (which is quite straightforward), but it’s quite interesting to see that for fixed values in other variables, families spending more money in food or in furniture were less likely to suffer a theft. Maybe they have kids and/or are more wary so they take more precautions to avoid thefting.

Cities

This was one of the main objectives of the contest, so they could distribute their efforts more efficiently or offer better services in the most dangerous cities

Safest Cities

City Coefficient
Salamanca -1.68
Castellón -1.38
Teruel -2
Valencia -1.15
Ciudad Real -1.48

Most Dangerous Cities

City Coefficient
Almería 1.14
Alicante 1.00
Barcelona 1.33

Higher values in the Coefficient means that while all the other variable values are fixed, homes in that city are more likely to suffer theft than other (since it’s a categorical variable, keep in mind that these coefficients are referenced to a fixed city, so the beta value is not directly interpretable).

Why did we win

In my humble opinion, it was the model interpretation what lead us to the victory. No team thought of using the zip code as a geographical variable, and if they used it, they didn’t use it as a factor (what leads to missinterpretation since the zip code doesn’t follow a mathematical order). Furthermore, many teams wasted a lot of time in variable selection and didn’t have any models until the second day (in fact I believe we were the only team to have a deliverable model on the first afternoon).

However, what I enjoyed the most was the camaraderie between teams. Even though it was a competition, we helped each other at some times, giving each other some hints or insights during the coffe breaks or lunch time. It was a truly enriching experience, and I surely learnt a lot and got new points of view.