Machine Learning: Understanding the Nomenclature

I have been working on Machine Learning Algorithms and APIs since past several months. There are several terms which didn’t come quite naturally to me when I started on this domain. In this blog post, I will try to explain some of them and hope this might help people starting on this exciting area.

What is Machine Learning
Lot of people have given their thoughts on the definition of Machine Learning. Out of them, I would like to quote couple of my favorites here :

Field of Study that gives computers the ability to learn without being explicitly programmed — Arhtur Samuel

A more mathematical and widely quoted one is:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. — Tom M. Mitchell

It would be good to explain by an example. Let’s take one of very popular use-cases of Machine Learning: Spam Filter. When we open our Inbox, we choose to mark some of the mails as Spam. Over a period of time, program behind Spam Filter learns about what type of emails we mark as spam and stops similar type of mails even reaching our Inbox. Behind the scenes, all these things are not manually programmed but algorithms learn behaviors of different customers over a period of time and implement Spam filtering accordingly.

In terms of 2nd definition, Task T here would be classifying an email as Spam or not. Experience E would be recording user behavior in terms of which mails (s)he classifies as Spam and finally Performance P would be how good or bad filter is in correctly classifying a given mail message as Spam or not. In general, performance or accuracy of a Machine Learning program increases with experience or time.

What are the types of Machine Learning algorithms currently used:

Supervised Machine Learning We try to teach Computers how to do something by providing the algorithm with set of right answers for different types of questions. It then infers/learns and tries to return correct answers when similar but not the same questions are asked from it in future.

This list of Question-Answers is called Training Data-Set. This Data-Set consists of multiple Training Examples or rows. Each Row is a pair of Input objects (Question) and corresponding Output value (Answer). Input objects is generally a vector of multiple attributes and in theory can scale infinitely (Question details). Each element of Input Objects is called a Feature. The output field is the desired answer.

Again let’s take help of an example to understand these terms more clearly. Suppose we want to create a System which provides users an estimated price they might get for their current car. In order for Machine Learning algorithm to work, we need to provide set of data on which the algorithm would base it’s learnings / predictions. This initial data can contain number of information points about the vehicle:

Price Sold, Make, Model, Year of Manufacture, Distance driven, Num Services, Num Accidents
140000, Suzuki, Alto, 2006, 37000, 8, 3

In the above example, Price Sold or 140000 will be output or answer. Remaining information will form Input Objects. Each of the car’s detail is a Feature. Everything together will form a Row and multiple rows together will constitute a Training Data-Set. In every row we are providing information that for a given set of features, a car was sold at this price. This is what we mean by Right Answers. Based upon this data, algorithm will make inferences and when we provide a different set of features, it would be able to predict approximate sale price for a given car. An Example question can be :

Suzuki, Alto, 2010, 15000, 8, 3

And a typical response can be

164500

It’s also referred as Re-Inforced Learning. Supervised Learning can further be classified into two types:

Regression Also referred as Continuous Valued output. Here the results are generally numeric values. Above mentioned example of predicting approximate value for a used car is an example of Regression output. Other examples of Regression output can be predicting the housing prices, sales figures, population figures etc

Classification Also referred as Discrete Valued Output. Results in this case are discrete values. Examples can be items to Recommend a visitor on an E-Commerce web-site, Classifying an email message as Spam or not, based upon different characteristics predicting likelihood of a patient suffering from a given disease.

Unsupervised Machine Learning As opposed to Supervised Learning, we provide Computer with the data and let it learn by itself. In other words, we let underlying algorithm to find some structure in the provided data-set. One of the most widely used algorithm in UnSupervised Learning is Clustering Algorithm. Some of the typical examples of UnSupervised Learning are:

  • Google News
  • Market Segmentation
  • Social Network Analysis like Finding Facebook friends
Advertisements

3 thoughts on “Machine Learning: Understanding the Nomenclature

  1. Narinder, good stuff. Thanks. Now, I can begin to read more about this area and infer some meaning, in the least. I am expecting a little more elaborate example from you in the coming days.

  2. Pingback: Understanding Internals of Linear Regression | Make Systems Intelligent

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s