Hadoop through Examples : Finding User Access Info from Logs

In most of Machine Learning problems we have to deal with large data-sets. This generally requires some kind of Distributed Computing platform. One of the leaders in this space is Hadoop. In brief, we can consider Hadoop as an ecosystem providing us a layer of abstraction to easily and reliably perform large-scale distributed computing. Though Hadoop infrastructure consists of several components, I would like to introduce you to couple of its core components: HDFS and MapReduce and how we can solve complex problems in an elegant manner.

Problem Context We have a social web application accessed by large number of users. We need to determine number of users and numbers of times these users accessed the system within a given time period. We have information about any user login along with the time of their access in our log files. For easy understanding, let’s take this sample log file :

user1 123
user2 123
user3 123
user4 124
user5 124
user6 125
user7 125
user8 125
....

Every line in the log file represents a successful login by a given user along with System time in milliseconds. It’s not really important, how we may be storing the information in our log files and we will see a bit later, that it’s just a matter of few lines of code changes to adjust to different formats.

Expected Output We would be providing startTime and endTime between which we want to have desired information. An expected output would be something like:

user4	2
user6	2
user7	1
user8	1
user9	1
....

Note : All the steps and code mentioned below have been tested to work on Hadoop 1.0.3 (the latest stable release) at the time of writing the blog. Changes may be required while working with other versions
Continue reading

Understanding Internals of Linear Regression

As I talked in one of my previous blogs, Regression is a type of Supervised Machine Learning where given a set of Input features, underlying algorithm returns a continuous valued output. Let’s take an example to understand it better:

In the above data-set, we are providing number of Independent features like house size, number of bedrooms, number of floors, number of bathrooms and age of the house. On the right most column is the actual price, the house was sold. This is generally called target or output. When a similar data-set is provided to an appropriate learning algorithm and after the learning phase is complete, we would expect the algorithm to predict the expected selling price of a house with similar features. One of a typical requirements is mentioned in the last row of the above data-set. As you can imagine, depending upon the values of input features like house size, bedrooms, age etc, the price of the house can be anywhere between 100K USD to 1 Million USD. That’s the reason of it being also called a Continuous Valued output.

Some of the common use-cases of Regression can be predicting population of a city, Sensex Index, Stock Inventory and many more. In-Fact Regression is one of the earliest forms of Statistical analysis and also one of the most widely used Machine Learning techniques.

In this blog, we will try to implement the simplest form of Regression referred as Linear Regression. For any Machine Learning algorithm, a typical work-flow looks like :

We provide Training Data-Set to a given Algorithm. It uses a Hypothesis which is essentially a mathematical formula to derive inferences about the data-set. Once the hypothesis has been formulated, we say that Algorithm has been trained. When we provide to algorithm a set of Input features, it return us an Output based upon its hypothesis. In the above scenario, we would provide details about House Size, Number of Bedrooms, Number of Floors, Number of Bathrooms and Age to the algorithm it in-turn would respond us with an expected price the house can fetch in the market.

Let’s see what can be the hypothesis / formula for our scenario. For easier understanding, let’s simplify the context a bit and think about having only house size as a single input feature against which we need to predict house prices. If we plot both these variables on a graph, it would look something like:

If we can draw a straight line across the points, then we should be able to predict the values of houses when different sizes are provided as inputs. Let’s take a look at different lines we can draw on the above graph:

The line which touches or is quite close to as many points as possible on the graph is most likely to provide us the best predictions while the line which touches least or is far from a majority of points is likely to provide us worst predictions. In our case line h1 appears to be the best possible scenario. The ability to draw such a straight line across a given data-set is in-essence the goal of any Linear Regression Algorithm. As we are attempting to draw a straight line, we call it a Linear Regression solution. In Mathematical terms, we can formulate our hypothesis as follows:

h_{\theta}x = \theta_{0} + \theta_{1}x

where \theta_{0} is the initial constant value we choose and \theta_{1} is the value we will multiple by a given house size. The goal of our hypothesis is to find values of \theta_{0} and \theta_{1} in such a way that difference between actual values provided in our training data-set and predicted values is minimum. Again if we have to represent it in mathematical terms, the formula would be :

J(\theta_{0}, \theta_{1}) = \dfrac {1}{2m} \sum \limits_{i=1}^{m} (h_{\theta} (x^{(i)}) - y^{(i)})^2

where \theta_{0} and \theta_{1} are the chosen parameters, m is the number of rows in our training data-set and h_{\theta} (x^{(i)} is the predicted value for ith element in our training data-set and y^{(i)} is the actual target value of ith element in our training data-set. In Machine Learning terminology, this function is popularly referred as Cost Function or Squared Error Function as well. Let’s take couple of examples to understand the functioning better:
Continue reading

Indic Threads Conference comes to Delhi

IndicThreads have been organizing Independent Technology conferences in Pune since 2006. They have conducted several very successful and high quality conferences in the areas of Java, Cloud Computing, Mobile Development and others. A complete listing of their conferences is available here.

I have spoken in IndicThreads conferences on couple of occasions in 2009 and 2010. After having thoroughly enjoyed the experience, I talked several times with Harshad Oak and Sangeeta Oak, people behind Indic Threads about the need of such a conference in Delhi area as well. At the start of 2012, the ideas finally materialized and they both decided to plan an Indic Threads conference in Gurgaon. With this followed several discussions about the suitable venue, partners, spreading the word, speakers and all the logistics to have an impact-ful Inaugural session of the conference.

With previous history of Indic Threads and our local contacts, an impressive list of Speakers covering a wide array of topics was managed. We were able to find an appropriate venue in the form of Fortune Select Hotel in Gurgaon.

Finally everything came together on Friday, 13th July when the conference started at 9:00 AM with more than 70 participants joining in. Proceedings were kick-started by an Inaugural address from Sanket Atal, CTO of MakeMyTrip.com about how to become 10x SoftwareEngineer. Considering it being a Technology conference for and by Software Developers, I felt it to be very apt way of getting started. Next 2 days were full of interesting talks on wide area of domains we as Software developers are expected to work on. This included JavaEE7 platform, NoSQL Databases, Mobile Applications, Scala, Hadoop and Node.js.

I also presented my experiences on Machine Learning. The slides for my talk are embedded below:

Continue reading

Linear Algebra Cheat Sheet for Machine Learning

Familiarity with Linear Algebra helps quite a bit while working on Machine Learning algorithms. As Software Programmers, we don’t get to do lot of serious Mathematics on regular basis tend to forget the principles. I would like to list some of the concepts here.

Matrices & Vectors

Starting with basics, A Matrix is a rectangular array of elements. Typical example is:

A Matrix is denoted by Number of Rows * Number of Columns. The above Matrix is normally called a 3 * 2 Matrix.
Similarly Vector is another often used Entity in Machine Learning algorithms. It’s special type of Matrix in the sense it always has only one column. It also called a n * 1 Matrix. A typical example of Vector would be:

Vector

Entries of Matrix or Vectors are denoted by A_{ij} where i is number of row and j is number of column. So in examples above, A_{32} (Matrix) will be 4 and a_{31} (Vector) will be 5.

Matrix Addition

We can perform addition operations over 2 matrices. In this, we add element at position i * j in Matrix M1 with element at the same position in Matrix M2. Let’s take an example :

We can perform subtraction in similar manner. One point to mention here is that we can perform these operations only on matrices of the same dimensions.

Scalar Multiplication & Division

We can also perform multiplication and divisions of Matrices and Vectors with real numbers as well. As you might have guessed, we take every number in the matrix and perform the corresponding operation with the provided number. Again an example will help:

Addition and subtraction can also be performed similarly.

Matrix-Vector Multiplication

Things get a bit interesting here. Firstly we can only perform multiplications between a Matrix and Vector when number of columns of Matrix equals number of rows of Vector. In mathematical terms m * n-dimensional Matrix can be multiplied with only n-dimensional Vector. The result of any Matrix – Vector multiplication is always a Vector. Let’s take help of an example to see how this is done.

As you can see, we are multiplying a 3 * 2 matrix with a 2 dimensional Vector and we get a 3 dimensional Vector in result.

Matrix-Matrix Multiplication

A Matrix-Matrix multiplication can be considered as an extended case of Matrix-Vector multiplication:
Continue reading

Machine Learning: Understanding the Nomenclature

I have been working on Machine Learning Algorithms and APIs since past several months. There are several terms which didn’t come quite naturally to me when I started on this domain. In this blog post, I will try to explain some of them and hope this might help people starting on this exciting area.

What is Machine Learning
Lot of people have given their thoughts on the definition of Machine Learning. Out of them, I would like to quote couple of my favorites here :

Field of Study that gives computers the ability to learn without being explicitly programmed — Arhtur Samuel

A more mathematical and widely quoted one is:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. — Tom M. Mitchell

It would be good to explain by an example. Let’s take one of very popular use-cases of Machine Learning: Spam Filter. When we open our Inbox, we choose to mark some of the mails as Spam. Over a period of time, program behind Spam Filter learns about what type of emails we mark as spam and stops similar type of mails even reaching our Inbox. Behind the scenes, all these things are not manually programmed but algorithms learn behaviors of different customers over a period of time and implement Spam filtering accordingly.

In terms of 2nd definition, Task T here would be classifying an email as Spam or not. Experience E would be recording user behavior in terms of which mails (s)he classifies as Spam and finally Performance P would be how good or bad filter is in correctly classifying a given mail message as Spam or not. In general, performance or accuracy of a Machine Learning program increases with experience or time.

What are the types of Machine Learning algorithms currently used:
Continue reading