How the Machine Learning Process is Like Cooking

When creating machine learning models it's important to follow the machine learning process in order to get the best performing model that you can into production and to keep it performing well.

But why cooking? First, I enjoy cooking. But also, it is something we all do. Now, we all don't make five course meals every day or aim to be a Michelin star chef. We do follow a process to make our food, though, even if it may be to just heat it up in the microwave.

In this post, I'll go over the machine learning process and how it relates to cooking to give a better understanding of the process and maybe even a way to help remember the steps.

For the video version, check below:

Machine Learning Process

First, let's briefly go over the machine learning process. Here's a diagram that's known as the cross-industry standard process for data mining, or simply known as CRISP-DM.

The machine learning process is pretty straight forward when going through the diagram:

  • Business understanding - What exactly is the problem we are trying to solve with data

  • Data understanding - What exactly is in our data, such as what does each column mean and how does it relate to the business problem

  • Data prep - Data preprocessing and preparation. This can also include feature engineering

  • Modeling - Getting a model from our data

  • Evaluation - Evaluating the model for performance on generalized data

  • Deployment - Deploying the model for production use

Note that a couple of items can go back and forth. You may do multiple iterations of getting a better understanding of the business problem and the data, data prep and modeling, and even going back to the business problem when evaluating a model.

Notice that there's a circle around the whole process which means you may even have to go back to understanding the problem once a model is deployed.

There are a couple of items I would add to help improve this process, though. First, we need to think about getting our data. Also, I believe can add a separate item in the process for improving our model.

Getting Data

I would actually add an item before or after defining the business problem, and that's getting data. Sometimes you may have the data already and define the business problem but you may have to get the data after defining the problem. Either way, we need good data. You may have heard an old saying in programming, "Garbage in, garbage out", and that applies to machine learning as well.

We can't have a good model unless we give it good data.

Improving the Model

Once we have an algorithm we can also spend some time to improve it even further. We can deploy some techniques that can tweak the algorithm to perform better.

Now that we understand the machine learning process a bit better, let's see how it relates to cooking.

Relating the Machine Learning Process to Cooking

At first glance, you may not see how the machine learning process relates to cooking at all. But let's go into more detail of the machine learning process and how each step relates to cooking.

Business Understanding

One of the first things to do for the machine learning process is to get a business understanding of the problem.

For cooking, we know we want to make a dish, but which one? What do we want to accomplish with our dish? Is it for breakfast, lunch, or dinner? Is it for just us or do we want to create something for a family of four?

Knowing these will help us determine what we want to cook.

Getting Data

I would actually add an item before or after defining the business problem, and that's getting data. Sometimes you may have the data already and define the business problem but you may have to get the data after defining the problem. Either way, we need good data. You may have heard an old saying in programming, "Garbage in, garbage out", and that applies to machine learning as well.

We can't have a good model unless we give it good data.

Data Processing

Data processing is perhaps the most important step after getting good data. Depending on how you process the data will depend on how well your model performs.

For cooking, this is equivalent to preparing your ingredients. This includes chopping any ingredients such as vegetables, but keeping a consistent size when chopping also counts. This helps the pieces cook evenly. If some pieces are smaller they can burn or if some pieces are bigger then they may not be fully cooked.

Also, just like in machine learning there are multiple ways to process your data, there are also different ways to prepare ingredients. In fact, there's a word for processing all of your ingredients before you start cooking - mise en place - which is French for "everything in it's place". This is done in cooking shows all the time where they have everything ready to start cooking.

This actually also makes sense for machine learning. We have to have all of our data processing done on the training data before we can give it to the machine learning algorithm.

Modeling

Now it's time for the actual modeling part of the process where we give our data to an algorithm.

In cooking, this is actually where we cook our dish. In fact, we can relate choosing a recipe to choosing a machine learning algorithm. The recipe will take the ingredients and turn out a dish, and the algorithm will take the data and turn out a model.

Different recipes will turn out different dishes, though. Take a salad, for instance. Depending on the recipe and the ingredients, the salad can turn out to be bright and citrusy like this kale citrus salad. Or, it can be warm and savory like this spinach salad with bacon dressing.

They're both salads, but they are turned into different kinds of salads because of different ingredients and recipes. In machine learning, you can have similar models from different data and algorithms.

What if you have the same ingredients? There are definitely different ways to make the same recipe. Hummus is traditionally made with chickpeas, tahini, garlic, and lemon like in this recipe. But there is also this hummus recipe that has the same ingredients but the recipe is just a bit different.

Optimizing the Model

Depending on the algorithm the machine learning model is using we can give it different parameters that can optimize the model for better performance. These parameter are called hyperparameters, which is used during the learning process of the model to your data.

These can be updated manually by you choosing values for the hyperparameters. This can be quite tedious and you never know what value to choose. So, instead, there are ways this can be automated by giving a range of values and running the model multiple times with different values and you can use the best performing model that is found.

How do we optimize a dish, though? Perhaps the best way to get the best taste out of your dish, other than using the best ingredients, is to season it. Specifically, seasoning with salt. In this video by Ethan Chlebowski, he suggests

…home cooks severely under salt the food they are cooking and is often why the food doesn't taste good.

He even quotes this line from the book Ruhlman's Twenty:

How to salt food is the most important skill to know in the kitchen.

I've even experienced this in my own cooking where I don't add enough salt. Once I do, the dish tastes 100 times better.

Now, adding salt to your dish is the more manual way of optimizing it with seasoning. Is there a way that this can be automated? Actually, there is! Instead of using just salt and adding other spices to it yourself you can get these seasoning blends that has all the spices in it for you!

Evaluating the Model

Evaluating the model is going to be one of the most important steps because this tells you how well your model will perform on new data, or rather, data that it hasn't seen before. During training your model have good performance, but giving it new data may reveal that it actually is performing bad.

Evaluating your cooked dish is a lot more fun, though. This is where you get to eat it! You will determine if it's a good dish by how it tastes. Is it good or bad? If you served it to others, what did they think about it?

Iterating on the Model

Iterating on the model is a part of the process that may not seem necessary, but it can be an important one. Your data may change over time which would then make your model stale. That is, it's relying on data that it used to but due to some process change or something similar it no longer does. And since the underlying data changed the model won't predict as well as it did.

Similarly, you may have more or even better data that you can use for training, so you can then retrain the model with that to make better predictions.

How can you iterate on a dish that you just prepared? First thing is if it was good or bad. If it was bad, then we can revisit the recipe and see if we did anything wrong. Did we overcook it? Did we miss an ingredient? Did we prepare an ingredient incorrectly?

If it was good, then we can iterate on it

A lot of chefs and home cooks like to take notes about recipes they've made. They write in some tricks they've learned along the way but also some different paths from the recipe that they either had to take due to a missing ingredient or preferred to take.

Conclusion

Hopefully, this helps your better understand the machine learning process through the eyes of cooking a dish. It may even help you understand the importance of each step because, in cooking, if one step is missed then you probably won't be having a good dinner tonight.

And if you're wondering where does AutoML fit into all of this, then you can think of it as the meal delivery kits like Hello Fresh or Blue Apron. They do a lot of the work for you and you just have to put it all together.

Perform Linear Regression in Azure Databricks with MLLib

When thinking of performing machine learning, especially in Python, a few frameworks may come to mind such as scikit-learn, Tensorflow, and PyTorch. However, if you're already doing your big data processing in Spark, then it actually comes with its own machine learning framework - MLLib.

In this post, we'll go over using MLLib to create a regression model within Azure Databricks. The data we'll be using is the Computer Hardware dataset from the UCI Machine Learning Repository. The data will be on an Azure Blob storage container, so we'll need to fetch the data from there to work with it.

What we would want to predict from this data is the published performance of the machine based off of its features. There is a second performance column at the end, but looking at the data description that is what the original authors predicted using their algorithm. We can safely ignore that column.

If you would prefer a video version of this post, check below.

Connecting to Azure Storage

Within a new notebook, we can set some variables, such as the storage account name and what container the data is in. We can also get the storage account key from the Secrets utility method.

storage_account_name = "databricksdemostorage"
storage_account_key = dbutils.secrets.get("Keys", "Storage")
container = "data"

After that we can set a Spark config setting specifically within Azure Databricks that can connect the Spark APIs to our storage container.

spark.conf.set(f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net", storage_account_key)

For more details on connecting to Azure Storage from Azure Databricks, check out this video.

post1.png

Using the APIs we can use the read property on the spark variable (which is the SparkSession) and set some options such as telling it to automatically infer the schema and what the delimiter is. Then the csv method is called with the Windows Azure Storage Blob URL (WASB), which is built on top of HDFS. With the data fetched we then call the show method on it.

However, looking at the data, it defaults the column names. Since we'll be referencing the columns, it'll be nice to have names to them to make referencing them easier. To do this we can create our own schema and use that when reading the data.

To read the schema we'll need to create it using StructType and StructField which help specify a schema for a Spark DataFrame. Don't forget to import these from pyspark.sql.types.

from pyspark.sql.types import StructType, StructField, StringType, DoubleType

schema = StructType([
  StructField("Vendor", StringType(), True),
  StructField("Model", StringType(), True),
  StructField("CycleTime", DoubleType(), True),
  StructField("MinMainMemory", DoubleType(), True),
  StructField("MaxMainMemory", DoubleType(), True),
  StructField("Cache", DoubleType(), True),
  StructField("MinChannels", DoubleType(), True),
  StructField("MaxChannels", DoubleType(), True),
  StructField("PublishedPerf", DoubleType(), True),
  StructField("RelativePerf", DoubleType(), True)
])

With this schema set we can pass that into the read property with the schema method and pass this in as the parameter. Since we pass in the schema we no longer need the inferSchema option. Also, we can now tell Spark to use a header row.

data = spark.read.option("header", "true").option("delimeter", ",").schema(schema)
  .csv(f"wasbs://.blob.core.windows.net/machine.data")

data.show()
post2.png

Splitting the Data

With our data now set, we can start building our linear regression machine learning model with it. The first thing to do, though, is to split our data into a training and testing set. We can do this with the randomSplit method on the Spark DataFrame.

(train_data, test_data) = data.randomSplit([0.8, 0.2])

The randomSplit method takes in a list as a parameter and the first item of the list is how much to keep in the training set and the second item is how much to take in the testing set. This returns a tuple which is why there are parentheses around the variables.

Just because I'm curious, let's see what the count of each of these dataframes are.

print(train_data.count())
print(test_data.count())
post3.png

Creating Linear Regression Model

Before we can go further, we need to make some additional imports. We need to import the LinearRegression class, a class to help us vectorize our features, and a class that can help us evaluate our model based on the test data.

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator

Now, we can begin building our machine learning model. The first thing we need is to create a "features" column. This column will be an array of all of our numerical columns. This is what the VectorAssembler class can do for us.

vectors = VectorAssembler(inputCols=['CycleTime', 'MinMainMemory', 'MaxMainMemory', 'Cache', 'MinChannels', 'MaxChannels'], outputCol="features")

vector_data = vectors.transform(train_data)

For the VectorAssembler we give it a list of input columns that we want to vectorize into a single column. There's also an output column parameter in which we specify what we want the name of the new column to be.

NOTE: To keep this simple, I'll exclude the text columns. Don't worry, though, we'll go over how to handle this in a future post/video.

We can call the show method on the vector data to see how that looks.

vector_data.show()
post4.png

Notice the "features" column was appended on to the DataFrame. Also, you can tell that the values match each value in the other columns in that row. For instance, the first value in the "features" column is 29 and the first value in the "CycleTime" column is 29.

Let's clean up the DataFrame a bit and only have the columns we care about, the "features" and "RelativePerf" columns. We can do that just by using the select method.

features_data = vector_data.select(["features", "PublishedPerf"])

features_data.show()
post5.png

With our data now updated to the format the algorithm wants, let's actually create the model. This is where we use the LinearRegression class from the import.

lr = LinearRegression(labelCol="PublishedPerf", featuresCol="features")

With that class we give it a couple of parameters in the constructor, what the label column name is and what the name of the features column is.

Now we can fit the model based on our data.

model = lr.fit(features_data)

And now we have our model! We can look into it by getting the model summary and examining the R Squared on it.

summary = model.summary

print("R^2", summary.r2
post6.png

From here, it looks like it performs quite well with an R Squared of around 91%. But let's evaluate the model on the test data set. This is where we use the RegressionEvaluator class we imported.

evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="PublishedPerf", metricName="r2")

This takes in a few constructor parameters which include the label column name, the metric name we want to use for our evaluation, and the prediction column. The prediction column we'll get when we when we run the model on our test set.

With our evaluator defined, we can now start to pass data to it. But first, we actually need to make our test dataset into the same format that we did our training data set. So we'll need to follow the same steps to vectorize our data.

vector_test = vectors.transform(test_data)

Then, select the columns we care about using.

features_test = vector_test.select(["features", "PublishedPerf"])

Now we can use the model to make predictions on our test data set and we'll show those results.

test_transform = model.transform(features_test)

test_transform.show()
post7.png

The prediction column are the predicted values. You can do a bit of a comparison to that and the "PublishedPerf" column. Do you think this model will perform well based on what you see?

With the predictions on the test dataset we can now evaluate our model based on that data.

evaluator.evaluate(test_transform)
post8.png

Looks like the model doesn't perform very well with this R Squared being 56%, so there is probably some feature engineering we can do.

If you watched the video and saw that the evaluation returned 90%, then it's possible the split got different data that caused this discrepancy. This is a good reason to run cross validation on your data.


Hopefully, you've learned some things with this post about using MLLib in Azure Databricks. Things to take away is that MLLib is built into Spark and, therefore, built into Azure Databricks so there's no need to install another library to perform machine learning on your data.

Top 5 Machine Learning Books

Machine learning is a vast subject and there is a lot to learn. Luckily, there are several books out there that can help us along the way. Below I list what I believe are the top 5 machine learning books that are currently out there.

ML.NET End-to-End: Build Model from Database Data

When doing machine learning on your own data instead of data downloaded from the internet, you'll often have it stored on a database. In this post, I'll show how to use an Azure SQL database to write and read data then use that data to build an ML.NET machine learning model. I'll also show how to save the model into an Azure Blob Storage container so other applications can use it.

The code can be found on GitHub. For a video going over the code, check below.

The Data

The data used will be the wine quality data that's on Kaggle. This has several qualities of wine such as its Ph, sugar content, and whether the wine is red or white.

2019-03-26 11_57_48-Clipboard.png

The label column will be "Quality". So we will use the other characteristics of wine to predict this value, which will be from 1 to 10.

Setup

Creating Azure Resources

For a database and a place to store the model file for other code, such as an API, can read from, we'll be using Azure.

Azure SQL Database

To create the SQL database, in the Azure Portal, click New Resource -> Databases -> SQL Database.

2019-03-26 12_09_04-Clipboard.png

In the new page, fill in the required information. If creating a new SQL Server to hold the database into, keep track of the username and password you use to fill it out as it will be needed to connect to it later. Click Review + Create and, if validations pass, click Create.

2019-03-26 12_16_35-Clipboard.png

Azure Blob Storage

While the SQL Server and database are being deployed, click on New Resource -> Storage -> Storage Account

2019-03-26 12_20_02-Clipboard.png

Similar to when creating the SQL database, fill in the required items. For a blob container, make sure the Account kind is set to Container.

2019-03-26 12_51_21-Clipboard.png

Creating Database Table

Before we can start writing to the database, the table we'll be writting to needs to be created. You can use Azure Data Studio, which is a light weight version of SQL Server Management Studio, to connect to the database earlier using the user and password and then use the below script to create the table. The script just has columns corresponding to the columns in the data file with an added ID primary key column.

CREATE TABLE dbo.WineData
(
    ID int NOT NULL IDENTITY,
    Type varchar(10) not null,
    FixedAcidity FLOAT not null,
    VolatileAcidity FLOAT not null,
    CitricAcid FLOAT not null,
    ResidualSugar FLOAT not null,
    Chlorides FLOAT not null,
    FreeSulferDioxide FLOAT not null,
    TotalSulfurDioxide FLOAT not null,
    Density FLOAT not null,
    Ph FLOAT not null,
    Sulphates FLOAT not null,
    Alcohol FLOAT not null,
    Quality FLOAT not null
)

Code

The code will be done in a .NET Core Console project in Visual Studio 2017.

NuGet Packages

Before we can really get started the following NuGet packages need to be installed:

  • Microsoft.ML
  • Microsoft.Extensions.Configuration
  • Microsoft.Extensions.Configuration.Json
  • System.Data.SqlClient
  • WindowsAzure.Storage

Config File

In order to use the database and Azure Blob Storage we just created, we will have a config JSON file instead of hard coding our connection strings. The config file will look like this:

{
  "sqlConnectionString": "<SQL Connection String>",
  "blobConnectionString": "<Blob Connection String>"
}

Also, don't forget to mark this file to copy in its properties.

2019-03-27 05_33_19-Clipboard.png

The connection strings can be obtained on the resources in the Azure Portal. For the SQL database connection string, go to the Connection strings section and you can copy the connection string from there.

2019-03-27 05_39_25-Clipboard.png

You will need to update the connection string with the username and password that you used when creating the SQL server.

For the Azure Blob Storage connection string, go to the Access keys section and there will be a key that can be used to connect to the storage account and under that will be the connection string.

2019-03-27 05_44_49-Clipboard.png

Writing to Database

Since we just created the SQL server, database, and table, we need to add the data to it. Since we have the System.Data.SqlClient package, we can use SqlConnection to connect to the database. Note that, an ORM like Entity Framework can be used instead of the methods from the System.Data.SqlClient package.

Real quick, though, let's set up by adding a couple of fields on the class. One to hold the SQL connection string and another to have a constant string of the file name for the model we will create.

private static string _sqlConectionString;
private static readonly string fileName = "wine.zip";

Next, let's use the ConfigurationBuilder to build the configuration object and to allow us to read in the config file values.

var builder = new ConfigurationBuilder()
    .SetBasePath(Directory.GetCurrentDirectory())
    .AddJsonFile("config.json");

var configuration = builder.Build();

With the configuration built, we can use it to pull out the SQL connection string and assign it to the field created earlier.

_sqlConectionString = configuration["connectionString"];

To write the data into the database we need to read in from the file. I put the file in the solution and make sure it can be read using the same method as the config file to make sure it gets copied over.

Using LINQ, we can read from the file and parse out each of the columns into a WineData object.

var items = File.ReadAllLines("./winequality.csv")
    .Skip(1)
    .Select(line => line.Split(","))
    .Select(i => new WineData
    {
        Type = i[0],
        FixedAcidity = Parse(i[1]),
        VolatileAcidity = Parse(i[2]),
        CitricAcid = Parse(i[3]),
        ResidualSugar = Parse(i[4]),
        Chlorides = Parse(i[5]),
        FreeSulfurDioxide = Parse(i[6]),
        TotalSulfurDioxide = Parse(i[7]),
        Density = Parse(i[8]),
        Ph = Parse(i[9]),
        Sulphates = Parse(i[10]),
        Alcohol = Parse(i[11]),
        Quality = Parse(i[12])
    });

The WineData class holds all of the fields that are in the data file.

public class WineData
{
    public string Type;
    public float FixedAcidity;
    public float VolatileAcidity;
    public float CitricAcid;
    public float ResidualSugar;
    public float Chlorides;
    public float FreeSulfurDioxide;
    public float TotalSulfurDioxide;
    public float Density;
    public float Ph;
    public float Sulphates;
    public float Alcohol;
    public float Quality;
}

There's an additional Parse method added to all but one of the fields. That's due to us getting back a string value of the data, but our class says it should be of type float. The Parse method is fairly straight forward in that it just tries to parse out the field and if it can't it uses the default value of float, which is 0.0.

private static float Parse(string value)
{
    return float.TryParse(value, out float parsedValue) ? parsedValue : default(float);
}

Now that we have the data, we can save it to the database. In a using statement, new up an instance of SqlConnection and pass in the connection string as the parameter. Inside here, we need to call the Open method of the connection and then create an insert statment. Then, loop over each item from the results of reading in the file and add each field from the item as parameters for the insert statement. After that, call the ExecuteNonQuery method to execute the query on the database.

using (var connection = new SqlConnection(_sqlConectionString))
{
    connection.Open();

    var insertCommand = @"INSERT INTO dbo.WineData VALUES
        (@type, @fixedAcidity, @volatileAcidity, @citricAcid, @residualSugar, @chlorides,
         @freeSulfureDioxide, @totalSulfurDioxide, @density, @ph, @sulphates, @alcohol, @quality);";

    foreach (var item in items)
    {
        var command = new SqlCommand(insertCommand, connection);

        command.Parameters.AddWithValue("@type", item.Type);
        command.Parameters.AddWithValue("@fixedAcidity", item.FixedAcidity);
        command.Parameters.AddWithValue("@volatileAcidity", item.VolatileAcidity);
        command.Parameters.AddWithValue("@citricAcid", item.CitricAcid);
        command.Parameters.AddWithValue("@residualSugar", item.ResidualSugar);
        command.Parameters.AddWithValue("@chlorides", item.Chlorides);
        command.Parameters.AddWithValue("@freeSulfureDioxide", item.FreeSulfurDioxide);
        command.Parameters.AddWithValue("@totalSulfurDioxide", item.TotalSulfurDioxide);
        command.Parameters.AddWithValue("@density", item.Density);
        command.Parameters.AddWithValue("@ph", item.Ph);
        command.Parameters.AddWithValue("@sulphates", item.Sulphates);
        command.Parameters.AddWithValue("@alcohol", item.Alcohol);
        command.Parameters.AddWithValue("@quality", item.Quality);

        command.ExecuteNonQuery();
    }
}

We can run this and check the database to make sure the data got added.

2019-03-27 06_13_22-Clipboard.png

Reading from Database

Now that we have data in our database let's read from it. This code will be similar than what we used to write to the database by using the SqlConnection class again. In fact, the only differences is the query we send to it and how we read in the data.

We do need a variable to add each row to, though, so we can create a new List of WineData objects.

var data = new List<WineData>();

Within the SqlConnection we can create a select statement that will return all of the columns and execute it with the ExecuteReader function. This returns a SqlDataReader object and we can use that to extract out the data.

In a while loop, which checks that the reader can read the next row, use the List variable created earlier to add a new instance of the WineData object to it and we can map from the reader to the object using the reader.GetValue method. The GetValue parameter will be the column position and then we'll do a ToString on it. Note that we need the Parse method from above again here to parse the strings into a float.

using (var conn = new SqlConnection(_sqlConectionString))
{
    conn.Open();

    var selectCmd = "SELECT * FROM dbo.WineData";

    var sqlCommand = new SqlCommand(selectCmd, conn);

    var reader = sqlCommand.ExecuteReader();

    while (reader.Read())
    {
        data.Add(new WineData
        {
            Type = reader.GetValue(0).ToString(),
            FixedAcidity = Parse(reader.GetValue(1).ToString()),
            VolatileAcidity = Parse(reader.GetValue(2).ToString()),
            CitricAcid = Parse(reader.GetValue(3).ToString()),
            ResidualSugar = Parse(reader.GetValue(4).ToString()),
            Chlorides = Parse(reader.GetValue(5).ToString()),
            FreeSulfurDioxide = Parse(reader.GetValue(6).ToString()),
            TotalSulfurDioxide = Parse(reader.GetValue(7).ToString()),
            Density = Parse(reader.GetValue(8).ToString()),
            Ph = Parse(reader.GetValue(9).ToString()),
            Sulphates = Parse(reader.GetValue(10).ToString()),
            Alcohol = Parse(reader.GetValue(11).ToString()),
            Quality = Parse(reader.GetValue(12).ToString())
        });
    }
}

Creating the Model

Now that we have our data from the database, let's use it to create an ML.NET model.

First thing, though, let's create an instance of the MLContext.

var context = new MLContext();

We can use the LoadFromEnumerable helper method to load the IEnumerable data that we have into the IDataView that ML.NET uses. In previous versions of ML.NET this used to be called ReadFromEnumerable.

var mlData = context.Data.LoadFromEnumerable(data);

Now that we have the IDataView we can use that to split the data into a training set and test set. In previous versions of ML.NET this returned a tuple and it could be deconstructed into two variables (var (trainSet, testSet) = ...), but now it returns an object.

var testTrainSplit = context.Regression.TrainTestSplit(mlData);

With the data set up, we can create the pipeline. The two main things to do here is to set up the Type feature, which denotes if the wine is red or white, as one hot encoded. Then we concatenate each of the other features into a feature array. We'll use the FastTree trainer and since our label column isn't named "Label", we set the labelColumnName parameter to the name of the label we want to predict, which is "Quality".

var pipeline = context.Transforms.Categorical.OneHotEncoding("TypeOneHot", "Type")
                .Append(context.Transforms.Concatenate("Features", "FixedAcidity", "VolatileAcidity", "CitricAcid",
                    "ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "Density", "Ph", "Sulphates", "Alcohol"))
                .Append(context.Regression.Trainers.FastTree(labelColumnName: "Quality"));

With the pipeline created, we can now call the Fit method on it with our training data.

var model = pipeline.Fit(testTrainSplit.TrainSet);

Save Model

With our new model, let's save it to Azure Blob Storage so we can retrieve it to build an API around the model.

To start, we'll use the connection string that we put in the config earlier. We then pass that into the Parse method of the CloudStorageAccount class.

var storageAccount = CloudStorageAccount.Parse(configuration["blobConnectionString"]);

With a reference to the storage account, we can now use that to create a client and use the client to create a reference to the container that we will call "models". This container will need to be created in the storage account, as well.

var client = storageAccount.CreateCloudBlobClient();
var container = client.GetContainerReference("models");

With the container reference, we can create a blob reference to a file, which we created earlier as a field.

var blob = container.GetBlockBlobReference(fileName);

To save the model to a file, we can create a file stream using File.Create and inside the stream we can call the context.Model.Save method.

using (var stream = File.Create(fileName))
{
    context.Model.Save(model, stream);
}

And to upload the file to blob storage, just call the UploadFromFileAsync method. Note that this method is async, so we need to mark the containing method as async and add await in front of this method.

await blob.UploadFromFileAsync(fileName);

After running this, there should now be a file added to blob storage.

2019-03-27 06_42_08-Clipboard.png

Hope this was helpful. In the next part of this end-to-end series, we will show how to create an API that will load the model from Azure Blob Storage and use it to make predictions.

Clustering in ML.NET

Clustering is a well known type of unsupervised machine learning algorithm. It is unsupervised since there isn't usually a known label in the data to help the algorithm know how to train on a known value. Instead of training on the data point to see a pattern in how it got a label value, an unsupervised algorithm will find patterns among each of the data points themselves. In this post, I'll go over how to use the clustering trainer in ML.NET.

This example will be using ML.NET version 0.11. Sample code is on GitHub.

For a video version of this example, check out the video below.

The Data

The data I'll be using is the wheat seed data that can be found on Kaggle. This data has properties of wheat seeds such as area, perimeter, length and width of each seed, etc. These properties measure what variety of wheat the seed is. Whether it is the variety of Kama, Rosa, or Canadian.

Project Setup

For the code, I'll create a new .NET Core Console project and bring in ML.NET as a NuGet package. For the data, I like to put them in the project itself so it can be easier to work with. When doing that, don't forget to mark the file to copy or copy if newer so it can be read when running the project.

2019-03-09 13_57_16-Clipboard.png

Loading Data

To start off, instantiate an instance of the ML Context.

var context = new MLContext();

To read in the data, use the CreateTextLoader method on the context.Data peroperty. This will take in an array of TextLoader.Column objects. In each of these object's constuctor pass in the name of the column, what data type it is which all of ours will be DataKind.Single to represent a float, and the position in the file where the column is. Then, as other parameters to the CreateTextLoader method, pass in that it has a header and that the separator is a comma.

var textLoader = context.Data.CreateTextLoader(new[]
{
    new TextLoader.Column("A", DataKind.Single, 0),
    new TextLoader.Column("P", DataKind.Single, 1),
    new TextLoader.Column("C", DataKind.Single, 2),
    new TextLoader.Column("LK", DataKind.Single, 3),
    new TextLoader.Column("WK", DataKind.Single, 4),
    new TextLoader.Column("A_Coef", DataKind.Single, 5),
    new TextLoader.Column("LKG", DataKind.Single, 6),
    new TextLoader.Column("Label", DataKind.Single, 7)
},
hasHeader: true,
separatorChar: ',');

With our data schema defined we can use it to load in the data. This is done by calling the Load method on the loader we just created above and pass in the file location.

IDataView data = textLoader.Load("./Seed_Data.csv");

Now that the data is loaded let's use it to get a training and test set. We can do that with the context.Clustering.TrainTestSplit method. All this takes in is the IDataView that we got when we loaded in the data. Optionally, we can specify what fraction of the data to get for our test set.

var trainTestData = context.Clustering.TrainTestSplit(data, testFraction: 0.2);

This returns an option that has TrainSet and TestSet properties.

Building the Model

Now that the data is loaded and we have our train and test data sets, let's now create the pipeline. We can start simple by creating a features vector and then passing that into a clustering algorithm of our choosing. Since all of the data are float columns there's no need to do any other processing to it.

var pipeline = context.Transforms.Concatenate("Features", "A", "P", "C", "LK", "WK", "A_Coef", "LKG")
    .Append(context.Clustering.Trainers.KMeans(featureColumnName: "Features", clustersCount: 3));

Using the context.Transforms property we have access to several transformations we can perform on our data. The one we'll do here is the Concatenate transform. The first parameter is the name of the new column that it will create after concatenating the specified columns. The next parameter(s) are params of all the columns to be concatenated.

Appended to the transform is the trainer, or algorithm, we want to use. In this case we'll use the K-Means algorithm. The parameters here are the column name of all the features, which we specified in the Concatenate transform as "Features". This is actually defaulted to "Features" so we don't need to specify it. We can also define the number of clusters the algorithm should try to create.

To get a preview of the data so far, we can call the Preview method on any instance of IDataView.

var preview = trainTestData.TrainSet.Preview();

To create the model, we simply just call the Fit method on the pipeline and pass in the training set.

var model = pipeline.Fit(trainTestData.TrainSet);

Evaluating the Model

With a model built, we can now do a quick evaluation on the model. To do this for clustering, use the context.Clustering.Evaluate method. We can pass in the test data set. However, we would need to transform that data set similar to what we did in the pipeline. To do this we can use the Transform method on the model and pass in our test data set.

var predictions = model.Transform(trainTestData.TestSet);

Now we can use the test data set to evaluate the model and give some metrics.

var metrics = context.Clustering.Evaluate(predictions);

We get a few metrics for clustering on the metrics object but the one I'll care about is the average minimum score. This tells us the average distance from all examples to their center point of their cluster. So the lower the number here the better the clustering is.

Console.WriteLine($"Average minimum score: {metrics.AvgMinScore}");

Predicting on Model

To make a prediction on our model we first need to create a prediction engine. To do that, call the CreatePredictionEngine method on the model. This is generic and it does specify the data input schema and the prediction classes so it knows what object to read in for a new prediction and what object to use when it makes a prediction.

public class SeedData
{
    public float A;
    public float P;
    public float C;
    public float LK;
    public float WK;
    public float A_Coef;
    public float LKG;
    public float Label;
}

public class SeedPrediction
{
    [ColumnName("PredictedLabel")]
    public uint SelectedClusterId;
    [ColumnName("Score")]
    public float[] Distance;
}

The ColumnName attribute tells the prediction engine what fields to use for those columns. This is under the Microsoft.ML.Data namespace.

To create the prediction engine, which used to be done using the CreatePredictionFunction method in previous versions of ML.NET, call it on the model and pass in the context as the parameter.

var predictionFunc = model.CreatePredictionEngine<SeedData, SeedPrediction>(context);

Now we can use the prediction engine to make predictions.

var prediction = predictionFunc.Predict(new SeedData
{
    A = 13.89F,
    P = 15.33F,
    C = 0.862F,
    LK = 5.42F,
    WK = 3.311F,
    A_Coef = 2.8F,
    LKG = 5
});

And we can get the selected cluster ID, or what cluster the model predicts the data would belong to.

Console.WriteLine($"Prediction - {prediction.SelectedClusterId}");
2019-03-09 14_49_30-Clipboard.png