A Few Useful Things to Know about Machine Learning
For this first Research Minute, I chose the paper A Few Useful Things to Know about Machine Learning by Pedro Domingos from the University of Washington. This paper has some very practical thoughts on creating your machine learning models. Below are a few of the ones I found the most helpful.
Generalizing of the model is what matters most
This is the main reason you create a model, but often times you won't have any true test data from the wild to test on. Because of this it is recommended to split your data even before you do any pre-processing on it. scikit-learn
has a very handy test_train_split
method just for this purpose.
What we're trying to avoid is overfitting your model (which we explore a bit later in this post). An overfit model won't give you accurate results if you deploy it and it predicts with real data that it hasn't seen before when the model was trained.
You need more than just data
The author puts this very nicely that doing machine learning is a lot of work. It's not magic that we just give it data and you're done. There's a lot more to it than that. A good analogy from the paper is that creating machine learning models is like farming. Nature does a good bit of the work, but a lot of up-front work has to be done for nature to do its thing. It's the same way with creating machine learning models - a lot of pre-processing and business understanding of the data is needed before starting the modeling process.
It's often said that most of the work in data science and machine learning is cleaning and getting your data in a state that you can feed it into a model and also to evaluate how your model's performance does. Fitting data into a model is easy with the current libraries.
Be aware of overfitting
Overfitting a model is where the model will score high on training data, but when it sees real data that it hasn't yet seen, it will score low. Ways in which overfitting happens is with bias and variance.
Bias is where the algorithm learns the same wrong thing, which results in the model missing relevant relationships. Recall my previous post about how bias in machine learning is a big topic that we still need to work on. Variance is where the model is sensitive to irrelevant data which can cause it to model random noise in the training data.
A popular way to battle against overfitting is to use cross validation) on your model. Cross validation will take your data and create subsets of it. In each subset it will take a random portion of the data as test data. It does this with each iteration, or fold, and it compares each score of the sampled test data of all iterations. This is a good way to test your model on different data to make sure you get consistent scores. If you don't, then your model may be sensitive to part of the data.
The image below from the article helps explain bias vs. variance through the game of darts.