Data Science 2020: My personal thoughts
Hi! Long time no see. It’s been a while since I last posted but I’ve been kinda busy doing both of my final degree projects (I’ll publish them here shortly once they get stored in the official UVa repository to avoid copyright issues). It’s also been a rather difficult time for everyone.
Today I wanted to share my thoughts on the field, on the current trends that people follow and what pitfalls we usually face when it comes to data science. This is a personal opinion post, so if you are here for the technical stuff, maybe you want to wait for the next post.
I wanted to write this post as a sort of checkpoint, so I can come back in some years and see what my thoughts were and how my thinking evolved. Keep in mind that I’m a fresh graduate and a young researcher and this is my personal opinion, so this post may be wrong and might sound like nonsense to you. Feel free to comment if you have different thoughts.
I don’t like the “bucolic” Data Science image
I’ve seen many people being lured by the mistic word “Data Science”. Public Gurus promise a land of opportunities for everyone, with data science even being named as the “Sexiest Job of he 21st Century”.
I don’t think this is beneficial. Don’t get me wrong, I don’t want this to be a closed field so none can’t enter, but people should know about the drawbacks of the field aswell, and none talks about them. Constant renewal (even I who has just started had to do some individual/personal study on the newest methods) and recycling must be carried out (just like any other job), learning new methods and the latest (and some of the old but hidden/forgotten) tools. Data cleaning is hard, knowing what technique to apply in each situation is very challenging, and extracting knowledge out of the data should be the main objective and, surprisingly enough, it’s usually left as the last one, priming the predictive power of the process over anything. I’ll talk about this later on.
Yes, Data Science is fun and I enjoy it, but it requieres passion (like any other job). Many of the gurus I mentioned earlier are profiting by selling an unrealistic/mistic image of what Data Science is, luring tons of people that are seeking an easy change of direction in their professional life.
Again, don’t get me wrong. I’m a strong supporter of the idea that everyone should know data science as another auxiliary tool for their job. But there is a big difference between using it as an auxiliary tool and being a specialist.
The need for deep understanding of the methodology
I’ve tackled some courses that tell you “just believe me for this time” and never get into the root of the models. If you don’t understand the methodology you are working with, you won’t just have a higher chance of making mistakes (i.e. as I commented in the Data Day post, some people treated the ZIP code as a number and not as a factor) but it won’t allow you to squeeze all the power out of your model. In fact, this superficial knowledge is making many people forget about Statistics, specially probability, and that’s sad because I feel like that’s the most difficult but developed methodology we have.
The abstract meaning of Data Science
What is a Data Scientist? What is a Data Analyst? Data Engineer? ML Engineer? Loads of different names that are not very specific on what the person is working on and usually used wrongly. In fact 5 years ago, my double degree was categorized as Data Engineering when it’s more like Data Science. If you check my position in LinkedIn, you can see I declare myself just as Statistician in order to be as specific as possible (give me data and I’ll immediately think about an underlying probability distribution). I feel like there is not an easy solution for this problem so if you have one please feel free to comment :).
Research in Data Science
I’ve been lucky enough to collaborate with three different research groups which gave me different perspectives. I was a ML Research Intern at the IRUSE-NUIG group in Ireland through a collaboration with the GSI-UVa group for my Final Degree Project in Computer Science, and with the Inference with Restrictions UVa group for my Final Degree Project in Statistics.
From what I’ve seen, the biggest problem with research are non-reproducible results. Usually they work with non-public datasets (which is understandable), but maybe using a public repository too could be a great idea to compare performance between researchers. I mean, I can’t tell if my model works better than yours if I don’t have access to your data. I could only say that my model works better on my data. This presents a big problem, results that can’t be generalized.
An avalanche of papers flooded the scientific journals in the recent years, and this trend seems to be increasing. The craze about Data Science can be detrimental, since I’ve seen myself trying some high-end techniques that did not improve in any other case than the paper authors were working with (usually, just one specific case while claiming their model is the best, I don’t think you should do that). In my opinion, there is not much interest if your architecture only works well in a very specific problem as there would be if it was good in various datasets. Having tons of papers whose results are not as general as they could be makes paper review very exhausting and difficult.
Another issue is the unavailability of code. Data Science and Code are not dissociable anymore (if they ever were). Code allows to understand the pipeline and the process of thought the authors followed as good as the text in the paper. Sometimes even better. Interesting initiatives such as Papers with code have emerged recently to overcome this issue, and I believe this should be mandatory in this field. Having the code would allow to reproduce results very easily and with more guarantees.
Knowledge extraction
This is key, and people forget about this. When we create a model, we create it to understand reality. In my opinion, is much more interesting having an 85% accuracy model that is interpretable than a 90% accuracy model that is a black box and doesn’t allow you to understand what it’s doing. Imagine you have to explain a customer you denied him a loan.
- “This Neural Network with 10 layers and 248 neurons in each layer decided that we can’t give you a loan”
This doesn’t seem like a good explanation. However, telling a customer:
- “Your wage being less than 30k a year prevents us from giving you this loan, however a smaller loan could be studied or you could access the loan if…”
Can be much more interesting. Predictive power isn’t all, and some options to decrypt black box models like LIME or SHAP are just emerging to overcome these non-interpretable models drawbacks. I’ll talk about them in future posts because they are simple yet effective.
Final thoughts
Data Science is the newest born baby. It’s still learning to walk (don’t get me wrong, many people worked on this decades ago but not with the Big Data perspective we have today) and it’s fun to be on the ride for these first steps. However, I do believe that more consensus is needed and trying to give it a global perspective is key.
I feel excited for what it’s to come and what will the Data Science industry provide in years to come.
If you sticked to the end, thanks for reading and hope you enjoyed it :) . I’ll be glad if you comment your thoughts on this post or any other topic with me!