So you’ve heard of Machine Learning. You’ve seen the TED talks. Read the blog posts. But how does one DO it? Is it math? Is it programming? Is it super duper hard? Is brain surgery involved? No.
It’s actually pretty easy. Let me walk you through it. If at any point you find yourself asking the question of why or how then you can also refer to my previous post explaining some of the conceptual basics. This current post is more of a practical example.
SPOILER ALERT: It’s 90% boring programming. Programming is kind of like having a really big box of legos. You CAN build anything, but how do you build something useful? You have to know which pieces carry out certain functions, and most importantly how to put them together. What’s the remaining 10%? Math. Mostly basic statistics.
Whatever you want to do, there are just five steps to follow to create your own Machine Learning solution.
The example code below is in Python, which is the most popular language for all things data science and machine learning. The easiest way to go about it is to install a program called Jupyter, see below. Think of it as installing Excel for the first time. Once you have it, you can use it for so many interesting things beyond this tutorial. It increases the threshold a little, but enables you to data science the sh*t out of any data you find in the future.
Jupyter: If you want to follow step-by-step, and want to learn Python, you might as well set it up to run on your own computer. Plus it’s all free. Go here: https://jupyter.org/install. Then once you start Anaconda, choose Jupyter from the menu and start a new Python 3 worksheet. Trust me, that sentence will make sense once you follow the link.
DISCLAIMER: You CAN just read on from here, but you won’t learn much unless you try it. So maybe skim through first, but if you ARE still interested, I highly recommend you install Jupyter and give it a go! That’s the best way to learn this.
Step 1: The Data
While you could just play around with data to create some kind of solution, it’s pretty unlikely to create any value for your business. You need to have a problem to solve first. Do you need to figure out why your customers are leaving? Do you need to predict or forecast seasonal revenues? Do you need to identify flying squirrels with traffic cameras?
Oh, and you need data. If you don’t have data, you can’t do anything. Sometimes you can source datasets online, especially if you’re working on images, audio, or text. In certain cases, you might have to create that data yourself. Perhaps you have a large set of images in which you want to identify an object. All you need to do is have a team of people manually label those images that contain that object. Sometimes you can start with whatever database of data you already have.
- Usage data: your app, backend, and analytics
- Government statistics: housing, economy, education, weather, maps, etc…
- Open databases: research data like medicine, industry data like mobile devices and online shopping
- Create it yourself
For the purposes of learning, we’re going to get a nicely prepared dataset from Kaggle, which is an amazing site chock-full of data and resources for all things ML. This specific dataset is telco customer data, and the objective here is to identify clients that are likely to “churn”, i.e. leave for another telco. That might be a useful exercise to try on your company’s own data after his exercise, wouldn’t it?
Reading data: The easiest way to get data into your Python program is the Pandas library. The obvious way to go is to read a CSV file which you can generate from any spreadsheet or database system.
To download our telco dataset, go to this Kaggle page. You will need to register but it’s free. Then you need to put the file in the same folder you created your now empty Jupyter notebook. You can just use the Upload button from the Jupyter home menu, too. Great, now let’s check out our new data!
import pandas as pd myData = pd.read_csv('bigml_59c28831336c6604c800002a.csv')
Step 2: Data Exploration
The next step is to get a lay of the land. What’s actually in the data? What types of data? How much of the different types? What are the ranges of values? Are the values clustered together or almost random? Are there any interesting correlations to learn?
Analysis: There’s a handful of tools in the Pandas library for Python, to use each time you have a new dataset. These methods can be used for any dataset in Dataframe format. Let’s try some!
Show the first few rows of data:
These are the outputs from running the above code. So head() produces a preview of the dataset.
Don’t worry about carefully examining everything here. We’re just looking for an overview at this point, like browsing the data in a few different ways.
Show what columns and data types are in your dataframe:
NOTE: while most of the fields are numbers, a few are mysterious “objects” i.e. better look into that. Algorithms work with numbers, they don’t do well with random objects.
So we can see there are 3,333 customers data here. For each customer, we have 21 different types of data. Churn is there as field, too!
Show various statistics from your dataframe:
Okay, there’s some stats on what kind of numbers are in various fields.
Again, don’t stress out about thinking about what it all means. We’re poking around at this point to see what shouts out.
Explore value ranges for individual columns:
this is also a binary field, but it’s just words instead of numbers. No problem.
True means this customer left us, false therefore means the customer is still around. This is the most important field for us, since we’d like to predict it later!
What about those other ones that had “object” as their type.
That makes sense, they’re just binary indicators for what plans are part of the customer’s contract.
Another thing to check is how many of those True and False examples are there for the Churn field.
It’s worth noting there is a significant “bias” in this data in that there’s relatively a lot less of True examples. This can make it harder to predict True than False later on.
Now we know the lay of the land, and can proceed to visualize some of this data!
Plotting: Humans work best with visual representations of data, so plotting libraries are useful to learn early. Plotting is like a detective journey, where you’re looking for correlations in the data related to things you want to predict. In this case, what is actually driving churn?
Seaborn’s library contains plots like countplot, pairplot, jointplot, barplot, and heatmap. If you want to share your labor of love, Plotly offers the ability to upload your plots to their cloud service with a shareable link. There are many others too, but let’s start with the basics.
A great place to start to get an overview of interesting things, you can plot correlations between columns of your dataframe:
import seaborn as se import matplotlib.pyplot as plt %matplotlib inline plt.rcParams["figure.figsize"] = [16,9] se.heatmap(myData.corr())
Oooh, that’s pretty! Basically, the brighter the square, the more correlation there is between the values of these fields. Obviously, charges and minutes are completely correlated, so let’s ignore those. NOTE: you can only do correlations on numbers, so anything containing words isn’t here.
Before we go deeper, let’s make sure we check the non-numerical fields in case they contain juicy correlations, too. Why don’t we start by changing those plan columns into numbers? Let’s throw state in there, too. Phone number, nah.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() myData['voice mail plan'] = le.fit_transform(myData['voice mail plan']) myData['international plan'] = le.fit_transform(myData['international plan']) myData['state'] = le.fit_transform(myData['state'])
Let’s see what happened.
Nice. That’s a lot better.
So let’s try that correlation matrix again, see what shows up.
Oh, snap! There’s a pretty bright square with international plan and churn. Those customers ain’t happy!
We need to dig into these bright squares now.
se.countplot(x='customer service calls', data=myData)
Okay, so most clients make zero or a few calls. But how does this relate to churn then?
Let’s add churn into the mix.
se.countplot(x='customer service calls', hue='churn', data=myData)
So most people who churn have made a customer service call. Makes sense.
How does the international plan fit in?
se.countplot(y='churn', hue='international plan', data=myData)
That tells us that the international plan basically sucks. Compared to other customers they have a huge churn rate in proportion!
One more thing. Let’s get a little fancy with it. Let’s look at some of the other purple squares from the correlation to see if how they combine with churn.
plt.scatter(x="total day minutes", y="total eve minutes", c="churn", s=20, alpha=0.8, cmap="Accent", data=myData)
Oooh, that’s pretty. Also, the bad news is that the more people call, the more they want to leave! That sounds like a terrible business model.
So we have a hypothesis of sorts. Active clients leave, and the international plan sucks. So let’s try to predict churn as a function of all available parameters now! This is why we’re here, isn’t it?
Step 3: Feature Engineering
The most challenging part of this whole exercise isn’t actually the machine learning part. It’s preparing the data so the ready-made algorithms can do something useful with it.
Prep dataset: For any learning task, you’ll need two things: inputs and outputs. Both should already be in your dataset that you’ve been exploring. Input columns are called features. Output columns are called labels. So you’ll want to identify based on your exploration what features look like they have some correlation and therefore predictive value for your label(s).
Pandas again has several easy and useful tools to weed out what you don’t need, and format what you want left in. Before we get to predict anything, we need to do some housekeeping, and get rid of irrelevant things like state and phone number.
Removing unnecessary columns (axis=1) goes like this:
myData = myData.drop(['phone number', 'area code', 'state'], axis=1)
Let’s get those labels separate.
labels = myData['churn'] myData = myData.drop(['churn'], axis=1)
Encoding: This is often the hardest part to understand if you haven’t done much with algorithms and statistics. Algorithms only work with learning numbers. They don’t know a postal code from a telephone number, or names, or images. You have to feed it actual numbers instead of other stuff. There are many ways to do this, and some get pretty complicated quickly.
A trivial example might be to swap names with a placeholder number. If one of your features is a list of names like “Brad”, “Chad”, and “Sinbad” then you can replace them in-place with 1, 2, and 3. You can do more research on useful encoders from the Scikit-Learn framework like LabelEncoder and OneHotEncoder.
That’s exactly what we already did when we turned the voice mail plan and international plan fields into numbers!
Training split / balance: To be able to actually run the training algorithm, you need two sets of data. Why? If you use all of your data to train the algorithm, there is no data left to test if the algorithm learned to predict/classify anything. So you want to save some data to test if the learning actually worked. Luckily there are tools available to randomly pick these subsets for you.
Split into a typical 75% training, 25% testing data using your separated datasets for features and labels:
from sklearn.model_selection import train_test_split
X_train, x_test, Y_train, y_test = train_test_split(myData, labels, test_size=0.25)
Wait, what if I don’t have label data already?
Actually, this is probably usually the case, if you just start by exploring an existing dataset. Here are a few strategies you could try to create labels, and redo this section once you have them!
- Hidden somewhere in your existing data
- Redefine the problem to use existing data
- Identify what you need and add code to gather it
- Hack it together with excel or python
- Expert labeling: who is qualified?
- Crowdsource: wisdom of the crowd
In our example, we were lucky enough to have them all ready to go!
Step 4: Training
Most people find it funny how simple this step is. That’s because decades of hard work has gone into standardizing and tuning these algorithms, so you can just use them.
Choose algorithm: The choice of algorithm really depends on the problem you set out to solve. If you’re predicting real-estate value or forecasting revenues, you’re looking for Regression algorithms that will give you a clear number as the output. If you’re trying to make a decision, that would often fall under Classification algorithms. Classification algorithms can give you the best answer or probabilities for all possible answers, depending on what you want. There are dozens of flavors of each type of course, and some involve neural networks. Given the tuning challenges there, you’re better off starting elsewhere though.
Often the best place to start is a simple linear algorithm, as it literally draws a straight line on top of your dataset. From there, you can optimize the result by exploring other methods such as Decision Trees or Support Vector Machines.
In this case, we’re going to try a few types of Classifiers algorithms, since this is a classification problem (predict churn or not churn as the output).
from sklearn.linear_model import SGDClassifier
linear = SGDClassifier()
Support Vector Machine:
from sklearn.svm import LinearSVC
vector = LinearSVC()
from sklearn import tree
tree = tree.DecisionTreeClassifier()
Hey wait, where are the Neural Networks? Isn’t that what all the fuss is about? Okay fine, we’ll try that too.
NOTE: unlike the others, you DO have to make some choices to use a neural network. Why did we choose a learning rate of exactly 0.1? Why did we choose three layers, with precisely 200, 10, and 3 neurons per layer? Well, because with standard parameters, the network literally couldn’t predict churn, like at all. So I had to fiddle around to find a combination that did something useful. Also, I wanted you to see it run for much longer than the other algorithms, so I added three layers. Deal with it.
from sklearn.neural_network import MLPClassifier
net = MLPClassifier(solver='lbfgs', alpha=0.1, hidden_layer_sizes=(200, 10, 3), random_state=1)
Extra credit: XGBoost (fancy Decision Tree). If/when you get an error with XGBoost it’s because it’s not part of the standard Python package. You may be interested to download it though because it’s a great allrounder for many types of learning tasks. Yes, often better than neural networks with less waiting and fussing around with hyperparameters to make it work. Also, easier to understand why it works.
import xgboost as xgb
boost = xgb.XGBClassifier()
NOTE: To figure out what these are, how they work, and why, I refer you to my previous post on the topic.
Training: There are a few different ways to do learning besides the basic case above of using all data at once, usually depending on how much data you have, how fast your computer is, and whether this is a one-time operation of you need to add new training data in the future. You can look up further examples for mini-batch or online learning if needed.
The base case is incredibly simple. It almost couldn’t be easier, once you’ve done all this prep work. It should take like 1 second with a basic computer.
linear.fit(X_train, Y_train) vector.fit(X_train, Y_train) tree.fit(X_train, Y_train)
Now for comparison let’s train the neural net. This should take several times longer, but certainly no longer than a minute. You know when it’s still running if there’s a star/asterisk next to your line of code, which turns into a number when it’s done.
Prediction: To actually use the model you’ve just trained, you need to predict something. Again, if you’re using Regression it’ll be a number. For Classification, you either get a label, or the probability for each label.
Predict a single output for each row of inputs, for example, row 2 of our dataset:
print("Linear:", linear.predict([myData.loc])) print("SVM:", vector.predict([myData.loc])) print("Decision Tree:", tree.predict([myData.loc])) print("Neural Net:", net.predict([myData.loc]))
Client number two will not leave us, then!
Predict label probabilities for a classification problem, by manually entering inputs. NOTE: This command doesn’t work for all types of algorithms. Let’s try it on the neural net for example.
It’s interesting to think how it ended up with such specific probabilities for False and True, isn’t it?
Step 5: Evaluation
At this point, it feels like you’re done. Technically you now have a solution. But you need to find out if it’s any good. The typical judge of that is called accuracy, which just means that how many of the samples in your test dataset did it get right.
Accuracy: To begin with, this is really the gold standard of measuring if your algorithm works. If it gets the right result often enough to solve your problem, you’re good to proceed at least. There are a lot of exceptions to this, of course, chief among them how well your training data represents real-life data the algorithm will see in the future. Often this means training is not a one-and-done type of deal, but something you revisit if the accuracy with real data starts dropping dramatically.
First, you should check how it does on the training data. Meaning did it learn to predict the exact same samples it already saw before. Since we had several different algorithms to try, you can compare their results to see which works best. If you get a low number, it means the algorithm didn’t learn anything, so you should go back and revisit the data.
NOTE: Here we used # to “comment” out XGBoost in case you didn’t install it separately. If you did install it, just remove the # and it will be included in your results!
print("Linear:", linear.score(X_train, Y_train)) print("SVM:", vector.score(X_train, Y_train)) print("Decision Tree:", tree.score(X_train, Y_train)) #print("XGBoost:", boost.score(X_train, Y_train)) print("Neural Net:", net.score(X_train, Y_train))
Woah, look at you Decision Tree with the 100% score! Pretty good, but then it kind of already knew the answer since it had seen it in training. The others are pretty decent, too. This is why we put aside some secret data earlier, to show the algos some new data!
NOTE: Your numbers won’t match exactly! Why? Because we split the training and test data randomly among the dataset, so results will differ. If we had a million samples, it would probably average out.
Measure accuracy on the testing set. This is what really matters:
print("Linear:", linear.score(x_test, y_test)) print("SVM:", vector.score(x_test, y_test)) print("Decision Tree:", tree.score(x_test, y_test)) #print("XGBoost:", boost.score(x_test, y_test)) print("Neural Net:", net.score(x_test, y_test))
Still pretty even Steven here. Nobody totally bombed, but the tree looks the best so far.
A bunch of other metrics you’ll have to read about to understand fully. What we’re doing is examining more closely how the algorithms perform on both cases, trying to predict False and True for Churn. This is the decider, then. For the sake of screen space, we’ll just focus on the top two: Decision Tree and Neural Net. You can try the others too if you want!
from sklearn.metrics import classification_report
y_tree = tree.predict(x_test) y_net = net.predict(x_test) print("DT", classification_report(y_test, y_tree)) print("------------------------------------------------------") print("NN", classification_report(y_test, y_net))
There were a total of 834 customers in the test dataset we used, of which only 120 were real churns. Again due to the random splitting, you may have slightly different numbers.
So who wins? Well, it looks like the neural net (bottom numbers) is struggling with false positives (Precision) and false negatives (Recall). That’s not ideal at all, since the True case is what we really care about, i.e. customers who do churn.
So Decision Tree takes it, in this case? Yes, definitely. Is it perfect? No, nothing is. The ultimate answer depends on the type of problem you’re solving, and what the risk of false positives/negatives is. If you’re predicting cancer or something, it’s pretty important!
NOTE: Is there in existence a combination of parameters that would make the neural network win? Probably. Should we have scaled the dataset to make it easier for the neural network? Arguable. I tried, and it started overfitting like crazy. So if you’re religious about neural networks, please waste time on finding the magic formula. If not, try something else like XGBoost and enjoy your free time.
BONUS: If you did install XGBoost, not only can you try to beat the Decision Tree (it will), but you can also do something nifty called Feature Importance. It can self-analyze which features contributed most to the prediction performance.
xgb.plot_importance(boost, max_num_features=50, height=0.8)
Pretty cool right! You can see our hypothesis in there, with customer service calls and international plan, but interestingly total minutes takes the cake in terms of predicting churn! The machine is smarter after all…
Repeat until satisfied
At any step above, you may realize you’ve done something wrong and it just won’t work. Most often, this involves the data itself. Having good, clean data to work with will make all other steps so much easier.
Perhaps you’re worried about the number of false negatives and want to improve it. Maybe it’s the distribution of the dataset. Perhaps you could benchmark different algorithms. Perhaps there is skew or bias in the test or training dataset. Perhaps you should try more encoding, or scaling the dataset. Perhaps delete more features. This is the job of the data scientist!
Congratulations, you’re now well on your way to create your first Machine Learning program. There are of course further considerations for saving and exporting your model to run in an actual application or server. You can easily search online to explore these topics further, with plenty of tutorials and free online courses available.
If you’re totally lost at this point, having no idea how and why you ended up here, then you can read this for more context and then retry: