5 Best Places to Read Research Papers

Since I'm starting to read more and more research papers, I thought I'd give a small rundown on where I'm finding these papers. You can find a lot available for free, and the places below are my favorite ones to go to.

Arxiv

Arxiv (I believe it is pronounced "archive") is the most popular place to find research papers. There are several subsections but the ones to look at are machine learning and artificial intelligence. There is just so much you find here. In fact, there's so much there's an open source version of Arxiv called Arxiv Sanity.

GitHub

Yep, the main place to find code is also a great place to find collections of research papers. Here are just a few of those collections to get you started.

Papers We Love

Papers We Love are a community of people who like to read computer science papers and then talk about what they have read. It's like a book club for computer science research. While this isn't solely for machine learning or AI there are some papers that touch on those fields.

Machine Learning Papers

While this repository isn't the most up-to-date having papers from NIPS in 2016, they do link out to the GitHub repositories so you can access the source code along with reading the paper.

Deep Learning Papers

This repository has quite a lot of papers in it. It has them broken down by category such as natural language process and reinforcement learning. This doesn't have just papers, either. There are some links to video lectures and other blogs you can go to as additional resources.

Quick tip if you want to find more more GitHub lists that are curated, there's a lot of people who have lists of topics under the awesome badge. Doing a search for github awesome and then what you're looking for will yield some interesting results.

Google Scholar

This is more of a search than a list of articles, but you can find a lot here. You can even create alerts on keywords or by researchers you want to follow to see what they are citing.

Machine Learning subreddit

Reddit is always a good place to find a community in topics that you're interesting in. Machine learning and sharing interesting papers has a place there as well.

While I'm on the data science subreddit a lot, the machine learning one is great for research, projects, and discussions. Often times, on the research posts, you'll get some extra context from the comments which can be more valulable than the paper itself.

Tech Company Sites

Some of the big tech companies have their own research entities, such as Google's Deep Mind, Microsoft Research, and Facebook Research. Often times, they'll publish their papers on their sites for anyone to access. Even better, sometimes they'll put out a blog post that highlights what a paper is about and will include some more feature rich graphics to go along with it that you can't always put into a research publication to help you understand what's going on in the paper.

Specific Journal Sites

There are some interesting journals that tailor specifically for publishing your work. Like ArXiv, all of the papers here are free to access. You have the Journal of Machine Learning Research which is specific to only machine learning topics. The Journal of Data Science which encompasses the huge field of data science, which you'll probably see a lot of statistics papers in here as well. And then there's the R Journal which has papers where the code was specifically written in R, so you may have more statistics topics in here, as well.


Hopefully, with these resources you'll be able to find research papers that will keep you busy for quite a long time.

Research Minute: Useful Things about Machine Learning

I've done book reviews in the past, but as I mentioned in my post on deliberate practice, I intended to get into more research papers. I don't have any academic experience in reading research papers, I can't quite be able to reach the very math centric or abstract papers (at least not yet), but there are quite a few that, with some patience, anyone can read and learn from.

With that, I'll do my best to summarize the more interesting research papers I read here.

How to Read a Research Paper

When first getting started with reading research papers is that you don't read the whole thing in one pass through. Read a bit of the Abstract and Introduction sections to make sure it's a paper that looks interesting.

For more on how to read a research paper, Siraj Raval has an interesting video:

A Few Useful Things to Know about Machine Learning

For this first Research Minute, I chose the paper A Few Useful Things to Know about Machine Learning by Pedro Domingos from the University of Washington. This paper has some very practical thoughts on creating your machine learning models. Below are a few of the ones I found the most helpful.

Generalizing of the model is what matters most

This is the main reason you create a model, but often times you won't have any true test data from the wild to test on. Because of this it is recommended to split your data even before you do any pre-processing on it. scikit-learn has a very handy test_train_split method just for this purpose.

What we're trying to avoid is overfitting your model (which we explore a bit later in this post). An overfit model won't give you accurate results if you deploy it and it predicts with real data that it hasn't seen before when the model was trained.

You need more than just data

The author puts this very nicely that doing machine learning is a lot of work. It's not magic that we just give it data and you're done. There's a lot more to it than that. A good analogy from the paper is that creating machine learning models is like farming. Nature does a good bit of the work, but a lot of up-front work has to be done for nature to do its thing. It's the same way with creating machine learning models - a lot of pre-processing and business understanding of the data is needed before starting the modeling process.

It's often said that most of the work in data science and machine learning is cleaning and getting your data in a state that you can feed it into a model and also to evaluate how your model's performance does. Fitting data into a model is easy with the current libraries.

Be aware of overfitting

Overfitting a model is where the model will score high on training data, but when it sees real data that it hasn't yet seen, it will score low. Ways in which overfitting happens is with bias and variance.

Bias is where the algorithm learns the same wrong thing, which results in the model missing relevant relationships. Recall my previous post about how bias in machine learning is a big topic that we still need to work on. Variance is where the model is sensitive to irrelevant data which can cause it to model random noise in the training data.

A popular way to battle against overfitting is to use cross validation) on your model. Cross validation will take your data and create subsets of it. In each subset it will take a random portion of the data as test data. It does this with each iteration, or fold, and it compares each score of the sampled test data of all iterations. This is a good way to test your model on different data to make sure you get consistent scores. If you don't, then your model may be sensitive to part of the data.

The image below from the article helps explain bias vs. variance through the game of darts.

Prioritize feature engineering

Sometimes the best way to get a more accurate model is to do feature engineering on your data. This combines or manipulates columns of your data to create columns of new or a different type of data which may be better for the algorithm to detect a pattern. The author states in this paper that feature engineering is the best thing you can do to increase the accuracy of your model. Though, it is a tough thing to learn and there's a book strictly about feature engineering coming out soon.

Feature engineering is tricky. In the book mentioned above, doing feature engineering on your data is more like an art than it is a science. You have to know a lot about your data and have the creativity to think that unrelated parts of data can come together, or that one part of the data can be represented in a different way.

More data over a more complex algorithm

The author points out here in the paper that there can be a tradeoff of three elements when training a model - the amount of data, computational power, and the time it takes to train athe model. The author then goes into how the most important of those three is the time. Since time is a resource we can never get back or improve upon, unless we can invent the time machine, we need to choose algorithms that generalize well to our data as well as being able to be trained quickly. The best way to get both is to use more data with the simplest algorithm possible.

Wouldn't more complex algorithms generalize better than simpler algorithms? Not necessarily. Take a look at the below image from the paper. Multiple algorithms, some more complex than the others, still can generalize the same as the simpler algorithms. However, these simpler algorithms can be much faster to train and predict on new data.


The paper has several more ideas, thirteen in total, but these are the ones I thought gave the most insights from the projects I've done. I'm sure I'll come back to this paper several times as I do more projects and gain more machine learning experience.

Why I'm Reading Research Papers

In my recent post on doing deliberate practice to become a better developer I mentioned that I was going to spend some time to read and understand some research papers. This may seem a like an odd thing to do in order to become better at my craft, but I figured a little experimentation couldn't hurt. At the worst, I'll have a few research papers read and understood. Perhaps I'll even meet one of the co-authors and have something to engage in discussion with. However, I believe I may get a bit more out of it than just that.

Understand Latest Research

Seeing what the latest research trends are, I feel, can be quite beneficial in a practical sense. For instance, there's a paper that suggests that simple testing can prevent most critical failures in software. From reading the paper and Adrian Colyer's post about it one can get a lot of insight about preventing most crashes in software. One having that insight, you can put it to good use in all of the software that you currently are developing.

See Cutting Edge Technologies

I'm sure most of y'all have seen this graph on emerging technologies.

CC BY 2.5, https://en.wikipedia.org/w/index.php?curid=11484459

Keeping up with new research articles allows me to be a part of the early adopters. Whereas now, I'm most likely split between the Early Majority and the Late Majority. Getting in early to new technologies will give multiple advantages, such as being among the first to submit pull requests if they have their code on GitHub, or generating the first set of blog posts on the subjet.

For example, Elm, a functional web language that outputs to JavaScript, was first introduced as a research paper. While I would say it is still in the late stages of the Early Adopter phase, if I was able to get on it earlier soon after this paper came out I could be considered one of the go-to people for this technology and even could help contribute to future releases of it.

Try to Understand More Math

A lot of computer science, and especially most of the research done in the field that I've seen, has a good bit of math behind it. While I took some math in my own studies of computer science, a lot of that was lost due to just not using it or keeping up with it.

While it's not necessary in day-to-day programming, it can be a bit helpful. Learning the math can help develop that extra bit of logic that will help in my daily programming, whether business logic or debugging.


With these benefits in mind, I plan on reading a paper a quarter this year and see how that goes. I'll definitely report back any benefits, or lack of any, that I believe I receive during that process.