Completing the Microsoft Professional Program for Data Science

Microsoft has two data related certifications in their Microsoft Professional Program - one for Data Science and another for Big Data. I recently completed the one for data science and want to share my experience to others interested in doing the same.

Program Details

The data science program consists of 10 courses, with one being an orientation and the last being a final capstone project. In order to get credit you do need to receive a passing grade (70% for all courses other than the capstone in which you need 76%), and to purchase a verified certificate. The certificates cost $99 so the whole course will cost you $990. A good chunk of change if your employer won't reimburse you on it, so I would definitely check that first before committing the time.

The main course materials cover the following:

  • Statistics
  • Coding in R or Python
  • Machine Learning
  • Visualizing Data

The courses all differ in how they do the grades, but usually they have either a lab or quiz after each section and a final exam at the end of the course. The questions usually allow you two chances to get the correct answer, but some may only offer one.

My Experience

Most of the courses are easy as long as you pay attention to the lectures and actually do the labs. The instructors were all knowledgable and presented the materials well, however, I think there was an issue with the materials involving the math and statistics.

To go over the math portions to try to explain a concept, they showed slides of the formulas and just talked about what the formulas mean...and that's basically it. I don't think this was needed at all since the concepts were explained much better once they were put as demos. Just going over formulas isn't the best way to teach any kind of math. Showing the applications of it, though, is. Then if you need to go back to the formula you can while relating it to the application.

Speaking of which, the statistics course was the hardest for me. I barely made a passing grade. The course structure was slightly different

Tips

A couple of things that may be useful to you if you do decide to take this course:

  • Audit the course first. Then, once you know you've gotten a passing grade, purchase the certificate. There's nothing worse than paying for something and come to find out you barely missed the passing grade and you've payed $99 for nothing.
  • The statistics course was harder than expected. Not only do they only allow one choice to choose an answer instead of the usual two, the questions don't make it clear what part of the lectures it comes from. I would definitely go back over the lectures after reading each question. They will mostly come from the Excel portions of the lectures.

Conclusion

Would I take this course again if I needed to? I probably would, but I love to learn. I did learn a lot of things, such as tuning models in Azure Machine Learning, how awesome PowerBI is and all the things you can do with it, and time series analysis concepts.

In the end, you get a certificate that you can share on LinkedIn or to your boss. However, keep in mind that there's still tons more to learn and projects to work on to futher solidify the concepts from this course.

2018-02-04 08_03_56-Clipboard.png

Research Minute: Useful Things about Machine Learning

I've done book reviews in the past, but as I mentioned in my post on deliberate practice, I intended to get into more research papers. I don't have any academic experience in reading research papers, I can't quite be able to reach the very math centric or abstract papers (at least not yet), but there are quite a few that, with some patience, anyone can read and learn from.

With that, I'll do my best to summarize the more interesting research papers I read here.

How to Read a Research Paper

When first getting started with reading research papers is that you don't read the whole thing in one pass through. Read a bit of the Abstract and Introduction sections to make sure it's a paper that looks interesting.

For more on how to read a research paper, Siraj Raval has an interesting video:

A Few Useful Things to Know about Machine Learning

For this first Research Minute, I chose the paper A Few Useful Things to Know about Machine Learning by Pedro Domingos from the University of Washington. This paper has some very practical thoughts on creating your machine learning models. Below are a few of the ones I found the most helpful.

Generalizing of the model is what matters most

This is the main reason you create a model, but often times you won't have any true test data from the wild to test on. Because of this it is recommended to split your data even before you do any pre-processing on it. scikit-learn has a very handy test_train_split method just for this purpose.

What we're trying to avoid is overfitting your model (which we explore a bit later in this post). An overfit model won't give you accurate results if you deploy it and it predicts with real data that it hasn't seen before when the model was trained.

You need more than just data

The author puts this very nicely that doing machine learning is a lot of work. It's not magic that we just give it data and you're done. There's a lot more to it than that. A good analogy from the paper is that creating machine learning models is like farming. Nature does a good bit of the work, but a lot of up-front work has to be done for nature to do its thing. It's the same way with creating machine learning models - a lot of pre-processing and business understanding of the data is needed before starting the modeling process.

It's often said that most of the work in data science and machine learning is cleaning and getting your data in a state that you can feed it into a model and also to evaluate how your model's performance does. Fitting data into a model is easy with the current libraries.

Be aware of overfitting

Overfitting a model is where the model will score high on training data, but when it sees real data that it hasn't yet seen, it will score low. Ways in which overfitting happens is with bias and variance.

Bias is where the algorithm learns the same wrong thing, which results in the model missing relevant relationships. Recall my previous post about how bias in machine learning is a big topic that we still need to work on. Variance is where the model is sensitive to irrelevant data which can cause it to model random noise in the training data.

A popular way to battle against overfitting is to use cross validation) on your model. Cross validation will take your data and create subsets of it. In each subset it will take a random portion of the data as test data. It does this with each iteration, or fold, and it compares each score of the sampled test data of all iterations. This is a good way to test your model on different data to make sure you get consistent scores. If you don't, then your model may be sensitive to part of the data.

The image below from the article helps explain bias vs. variance through the game of darts.

Prioritize feature engineering

Sometimes the best way to get a more accurate model is to do feature engineering on your data. This combines or manipulates columns of your data to create columns of new or a different type of data which may be better for the algorithm to detect a pattern. The author states in this paper that feature engineering is the best thing you can do to increase the accuracy of your model. Though, it is a tough thing to learn and there's a book strictly about feature engineering coming out soon.

Feature engineering is tricky. In the book mentioned above, doing feature engineering on your data is more like an art than it is a science. You have to know a lot about your data and have the creativity to think that unrelated parts of data can come together, or that one part of the data can be represented in a different way.

More data over a more complex algorithm

The author points out here in the paper that there can be a tradeoff of three elements when training a model - the amount of data, computational power, and the time it takes to train athe model. The author then goes into how the most important of those three is the time. Since time is a resource we can never get back or improve upon, unless we can invent the time machine, we need to choose algorithms that generalize well to our data as well as being able to be trained quickly. The best way to get both is to use more data with the simplest algorithm possible.

Wouldn't more complex algorithms generalize better than simpler algorithms? Not necessarily. Take a look at the below image from the paper. Multiple algorithms, some more complex than the others, still can generalize the same as the simpler algorithms. However, these simpler algorithms can be much faster to train and predict on new data.


The paper has several more ideas, thirteen in total, but these are the ones I thought gave the most insights from the projects I've done. I'm sure I'll come back to this paper several times as I do more projects and gain more machine learning experience.

Bias in Machine Learning

You hear a lot about machine learning and how it's transforming industries, but there may be something about these algorithms you may not have heard - its biases.

A bias is favoring one group over the others and these machine learning algorithms are supposed to prevent decisions based on any type of bias a person would have. However, the machine learning models are only as good as the data that you give it. The historical data that we have has a built in bias that we need to address.

I came across an interesting talk from the 2017 NIPS conference that goes over this very well.

There's also an interesting podcast episode of the TED Radio Hour that goes over algorithmic bias.

Researchers know this is a problem in algorithms and they are actively doing research on how to better beat these biases from machine learning models. Until then, though, we would need to be diligent when testing our models to make sure these types of biases aren't integrated in our data.

Book Review: Becoming a More Effective Developer

A lot of programming books teach you about a new framework, language, or computer science theory, but very few teach you how to actually be effective at your day job. The book, The Effective Engineer is a great one to learn all the ways in which you can be the best developer you can be.

 
 

This book goes through three main points:

  • Developing the right mindset
  • Executing effectively
  • Building long term value

Each of these has some really good and actionable insights on how you can become better as a programmer.

Developing the right mindset

As Carol Dweck mentions in her book Mindset, the most successful people have a certain mindset in terms of how they view their learning. They believe that your brain can be changed and you can learn anything once you put in the work. That's definitely true for programmers.

Continuous Learning

A large part of being a software developer is that you are continuously learning new things. Whether new frameworks, languages, or software theory, there is no shortage of things to learn.

A majority of the time this learning comes from actual work on the job. No longer being able to learn on the job is a reason a lot of developers change jobs, so it's definitely cared about. Sometimes, though, that learning may need to be supplemented outside of the job. This is where you hear some developers doing projects and learning during nights and weekends.

The book goes through a few ways developers can actually do this learning, such as reading code from developers who are more experienced and to have them review your own code.

Prioritize

One of my favorite authors, Tim Ferriss often talks about being efficient vs. being effective to help boost productivity:

Focus on doing the right things (efficiency) vs doing things well (being efficient).

Doing things to be efficient may include items such as replying to emails, updating status reports more than they need to be, and other tasks that make it seem like you're busy but doesn't actually contribute much to the project you're working on.

That's where prioritization comes in. Once you take some time each day or week to prioritize what needs to be done, you'll have a clear way to move forward with the actual project.

This is similar to Getting Things Done in that you define clear tasks and prioritize the ones that need to be done. I would also argue to break up bigger tasks into much smaller ones.

Executing effectively

Now that we have the right mindset for our productivity, we also need to know how to execute them. A few things from the book stand out in what I believe will help the most.

Master Your Environment

Do you use Visual Studio for all of our development? Then knowing as much as you can about it can save you so much time in terms of keyboard shortcuts or knowing when to move to the command line for certain tasks. Speaking of the command line, knowing PowerShell or bash can save you a ton of time as there are snippets and commands that can do a lot that you may not even know of.

Improve Your Estimation Skills

All too often you, as a developer, will get asked by your boss or a manager how long a certain task will take. How often have you said, "I don't know?" Improving your estimation skills is an easy way to be thought of as a senior programmer.

The main way to improve this is to just keep track of your tasks. Once you are asked to estimate you can refer back to your tasks and get an idea of how long it took you to do something similar.

There's also the book Software Estimation that can give a lot more details in improving your estimation details.

Building long term value

To be the best programmer you can be is to build long term value for your company and your clients. The book offers some great ways in order to help you build that long term value.

Balance Quality with Practicality

How often have you come across this scenario: you implement a new feature and the deadline is very close, however your boss wants you to go ahead and get it in to start testing. You mention how many more tests you need to write, yet your boss says to not worry about that and just check it in. Often referred to as technical debt, us programmers can perfectionists in that we want our code to be perfect before we send out for all to work on and see.

Having this type of balance on having your work be perfect vs. being practical in your work can be beneficial in terms of providing value to your company or clients.

Invest in Your Team

I think the biggest thing you can do for long term value is investing in the team. This can be done in many ways, such as the following:

  • Providing a nice way to onboard new team members: Having a good onboarding solution will make new team members so much more productive and have them committing code within their first week on the job. This leads to higher morale overall across the team.
  • Collaborating on all aspects of the code: This includes having efficient code reviews on all pull requests, and team members not afraid to volunteer for certain bug fixes or features.
  • Building a great culture: This can be a hard one to accomplish, but if done correctly this will build higher morale with the team which will reflect team members making sure they give everything their best.

Of course there are other ways to help invest in your team, but these are some of the ways that can have the most impact.


This book is definitely worth reading for a lot of ideas on how you can easily improve your effectiveness as a programmer. For even more you can checkout the book's blog.

Learning Math for Programming

When I went to college I had to take math. My degree is actually in Math and Computer Science. Why the math part, you ask? Well, the folks at the school figured that the math will help with a lot of the logic that comes with programming.

Of course, there's also that age old question when you first start learning programming...is there a lot of math involved? The answer is it depends. If you're planning to get a doctorate and do a lot of research, then you'll most likely use a lot of it. However, if you're just working for a company, then chances are you won't need hardly any math*. If you're learning data science then some math is essential in order to gain insights from data or to understand machine learning algorithms.

In this post, we'll go over the most common types of math you may want to be familiar with to get the most out of your programming, and where you can start learning these concepts.

Discrete Math

Discrete math is used quite a bit in programming. Whether it's understanding the theory of how integers work in programming languages or using boolean logic, you've probably come across it before.

Where to Learn

Coursera comes to the rescue here as there may not be too much around discrete math.

Algebra

Yep, even plain old algebra you probably didn't like in high school can be helpful. I actually have an example of this: back when I was doing tax software I needed to use some basic algebra to create a function. Of course, tax software relies heavier on math than most other software, so it's not unusual. Knowing that bit of algebra that I needed, though, really helped when creating the function and I may have had to spend a good bit of time Googling for some help.

Where to Learn

Kahn Academy is well known for their math courses and they definitely have a good algebra one.

For a book, Practical Algebra looks to be a good one to brush back up on your algebra.

Statistics

Statistics is needed for data scientists, not only to help get insights from your data but to also make sure variables that seem correlated to each other actually are statistically significat rather than not. That's not the easiest to do even if you plot your data. Statistics will give you a big advantage in understanding your data and how it can answer any questions you can throw at your data.

Need to see if an A/B test has a significant difference? Then statistical hypothesis testing can be of a great help.

Where to Learn

Coursera has a great intro to data course that has really helped me out in learning some of the basics of statistics. The book they recommend, too, is actually really good.

As for books, I recommend a couple to start with. Practical Statistics is a great introduction. If you want to focus more on the data science side of statistics then Practical Statistics for Data Scientists is a great one to get.

If you want to get a bit advanced, then I definitely recommend Introduction to Statistical Learning. This is mainly for getting deep in the machine learning algorithms, but still is an interesting read. You can also read it for free online instead of getting a physical copy. There's an even more advanced version of this too with The Elements of Statistical Learning, which is also available for free online.

Linear Algebra

Linear algebra is a bit of a niche in programming. The only place I've really seen it used is for machine learning algorithms. Linear algebra, is mainly matrix manipulation.

Where to Learn

Khan Academy has some linear algebra courses that you can take. This is probably the most complete of them I've seen around.

For a book, mostly what you'll see are text books. That's not all bad, but sometimes a more general book is helpful. In that case there's Linear Algebra for Dummies.


While math isn't necessary for programming, I believe it can certainly help with the logic like my university thought. Learning the extra math also made me appreciate it more for what all math can teach us about the world. Also, don't think you're not good at math. This old post from Steve Yegge explains more about the way math is taught in schools and what we can do about it as programmers. You're not bad at math, you just need a better way to learn it.

* Though, depending on the company you work for you may need it. For example, working for a financial institution may involve some knowledge of math.