**Problem Context** We have a social web application accessed by large number of users. We need to determine number of users and numbers of times these users accessed the system within a given time period. We have information about any user login along with the time of their access in our log files. For easy understanding, let’s take this sample log file :

user1 123 user2 123 user3 123 user4 124 user5 124 user6 125 user7 125 user8 125 ....

Every line in the log file represents a successful login by a given user along with System time in milliseconds. It’s not really important, how we may be storing the information in our log files and we will see a bit later, that it’s just a matter of few lines of code changes to adjust to different formats.

**Expected Output** We would be providing startTime and endTime between which we want to have desired information. An expected output would be something like:

user4 2 user6 2 user7 1 user8 1 user9 1 ....

**Note : All the steps and code mentioned below have been tested to work on Hadoop 1.0.3 (the latest stable release) at the time of writing the blog. Changes may be required while working with other versions**

**Step1** Download and Install Hadoop. You can find complete information about the same here

**Step2** Verify Installation and run a simple wordcount example by following instructions available here

**Step3** Copy your log files to Hadoop HDFS file system. Let’s say you have files accesslog1 and accesslog2 in your current working directory which you want processed.

hadoop fs -ls #Lists the current directory tree of your HDFS file system -rw-r--r-- 1 nkumar supergroup 29 2012-07-19 17:43 /user/nkumar/README drwxr-xr-x - nkumar supergroup 0 2012-07-20 10:48 /user/nkumar/id.out drwxr-xr-x - nkumar supergroup 0 2012-07-19 13:46 /user/nkumar/input -rw-r--r-- 1 nkumar supergroup 10 2012-08-09 09:53 /user/nkumar/test.txt drwxr-xr-x - nkumar supergroup 0 2012-07-19 17:42 /user/nkumar/test2 hadoop fs -mkdir /user/nkumar/mapreduce #Creates a directory under top level folder of HDFS file system. /user/nkumar will need to be changed according to your directory structure hadoop fs -mkdir /user/nkumar/mapreduce/useraccesslog/input #Creates a folder for storing information related to this program hadoop fs -mkdir /user/nkumar/mapreduce/useraccesslog/input #Creates a folder for storing your accesslog files to be processed later hadoop fs -put accesslog1 /user/nkumar/mapreduce/useraccesslog/input/ #Copies file accesslog1 from current working directory to HDFS at specified directory hadoop fs -put accesslog2 /user/nkumar/mapreduce/useraccesslog/input/ #Copies file accesslog2 from current working directory to HDFS at specified directory hadoop fs -ls /user/nkumar/mapreduce/useraccesslog/input #Should look something like below -rw-r--r-- 1 nkumar supergroup 233 2012-08-01 19:53 /user/nkumar/mapreduce/useraccesslog/input/accesslog1 -rw-r--r-- 1 nkumar supergroup 100 2012-08-01 19:53 /user/nkumar/mapreduce/useraccesslog/input/accesslog2

**Step4: Implement Mapper ** As you might be aware, MapReduce paradigm consists of breaking down the problem in 2 broad steps : Map and Reduce. As this blog is not an explanation of MapReduce, I will just share how different components are implemented here. For detailed introduction about MapReduce, best place to get started would be a tutorial here

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable> { private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String line = value.toString(); String[] inputArray = line.split("\\s+"); word.set(inputArray[0]); output.collect(word, new LongWritable(Long.parseLong(inputArray[1]))); } }

Mapper is the first step of any MapReduce program. It gets the contents to work upon from Hadoop and emits Key-Value pairs as output. The job of distribution of files, reliability and fail-overs are the responsibility of the platform. This is the real power of Hadoop as it lets us concentrate on implementing the business logic without getting worried about underlying details of distribution and scalability. In our case, Mapper gets the contents of the uploaded files with one line at a time. As you might recall, each line has user name and his/her access time separated by white space. We split the contents of the line across white space and provide it to OutputCollector in the form of Key-Value pairs. So typical output of our Map program would be :

user1 123 user2 123 user18 127 ....

**Step5: Implement Reducer** Reducer will receive output of one/multiple Map programs as input and results in key-value pairs as it’s output. Here is our implementation of Reducer:

public static class Reduce extends MapReduceBase implements Reducer<Text, LongWritable, Text, LongWritable> { public static long startTime; public static long endTime; @Override public void configure(JobConf jobConf) { startTime = Long.parseLong(jobConf.get("startTime")); endTime = Long.parseLong(jobConf.get("endTime")); } public void reduce(Text key, Iterator<LongWritable> values, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { long userLoggedInTime = values.next().get(); if (userLoggedInTime >= startTime && userLoggedInTime <= endTime) { ++sum; } } if (sum > 0) { output.collect(key, new LongWritable(sum)); } } }

For a given key which in our case is username, we can have 0 or more values i.e. one or more time-stamps when user has accessed the system. As the user-access information can spread across multiple files and thus multiple map outputs, it’s again the responsibility of Hadoop platform to combine the information together and give it to Reducer. Imagine, user1 has accessed the system 3 times at 123, 126 and 127. Following will be the information received by Reducer:

user1 123, 126, 127

You can see from the code snippet above, we reject the values which are less than the *startTime* or greater than the *endTime*. *startTime* and *endTime* are the parameters passed to the program which can be retrieved by Reducer through configure method. So typical output of reducer would be :

user1 2 user3 1 user5 7 ...

As mentioned earlier, depending upon the way information is stored in our log files, we can easily change the implementation of our Reducer to parse the incoming values for e.g. Date instead of System time and the entire application will behave accordingly.

**Step6: Job Configuration** Everything in a MapReduce paradigm is defined in terms of Jobs and tasks. A job can consist of one or multiple Map/Reduce tasks. We need to configure a job appropriately by passing relevant parameters. Typical configuration parameters of a Job are Map/Reduce classes which will be executing the tasks, Input/Output Format Map/Reduce classes will be working upon and key-value pair types they will be producing. Along with this, we can pass parameters to our Mapper/Reducers through JobConfiguration. In our case, we have provided startTime and endTime which get used in our Reducer implementation. Code snippet of our Job Configuration is:

JobConf conf = new JobConf(UserAccessCount.class); if (args.length != 4) { System.err.println("startTime and endTime parameters must be specified"); System.exit(0); } String startTime = args[2]; String endTime = args[3]; conf.set("startTime", startTime); conf.set("endTime", endTime); conf.setJobName("useraccesscount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(LongWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);

**Step7: Execution and Getting Results** We now need to compile together all these components and execute the program on our Hadoop environment. After successful execution, you will get results in a text file again stored in HDFS file-system. It would look something like:

user1 2 user10 1 user11 1 user12 1 user13 1 user14 1 user15 2 ...

Entire code along with sample user access log files and README containing instructions about executing the program are provided at GitHub Repo here. Feel free to check it out and let me know your comments or if you face any issues.

**Conclusion** Hadoop is a very powerful platform which can allow us solve problems involving large data in a simple and elegant manner. Though understanding of its different components and concepts have a steep learning curve for newcomers to Distributed Computing. But it’s a powerful tool while dealing with several Machine Learning problems. There is another very useful library Apache-Mahout which has been designed with Hadoop and distributed computing in mind. It combines different Machine Learning algorithms and leverages Hadoop’s distributed computing capabilities.

]]>

In the above data-set, we are providing number of Independent features like house size, number of bedrooms, number of floors, number of bathrooms and age of the house. On the right most column is the actual price, the house was sold. This is generally called target or output. When a similar data-set is provided to an appropriate learning algorithm and after the learning phase is complete, we would expect the algorithm to predict the expected selling price of a house with similar features. One of a typical requirements is mentioned in the last row of the above data-set. As you can imagine, depending upon the values of input features like house size, bedrooms, age etc, the price of the house can be anywhere between 100K USD to 1 Million USD. That’s the reason of it being also called a Continuous Valued output.

Some of the common use-cases of Regression can be predicting population of a city, Sensex Index, Stock Inventory and many more. In-Fact Regression is one of the earliest forms of Statistical analysis and also one of the most widely used Machine Learning techniques.

In this blog, we will try to implement the simplest form of Regression referred as Linear Regression. For any Machine Learning algorithm, a typical work-flow looks like :

We provide Training Data-Set to a given Algorithm. It uses a Hypothesis which is essentially a mathematical formula to derive inferences about the data-set. Once the hypothesis has been formulated, we say that Algorithm has been trained. When we provide to algorithm a set of Input features, it return us an Output based upon its hypothesis. In the above scenario, we would provide details about House Size, Number of Bedrooms, Number of Floors, Number of Bathrooms and Age to the algorithm it in-turn would respond us with an expected price the house can fetch in the market.

Let’s see what can be the hypothesis / formula for our scenario. For easier understanding, let’s simplify the context a bit and think about having only house size as a single input feature against which we need to predict house prices. If we plot both these variables on a graph, it would look something like:

If we can draw a straight line across the points, then we should be able to predict the values of houses when different sizes are provided as inputs. Let’s take a look at different lines we can draw on the above graph:

The line which touches or is quite close to as many points as possible on the graph is most likely to provide us the best predictions while the line which touches least or is far from a majority of points is likely to provide us worst predictions. In our case **line h1** appears to be the best possible scenario. The ability to draw such a straight line across a given data-set is in-essence the goal of any Linear Regression Algorithm. As we are attempting to draw a straight line, we call it a Linear Regression solution. In Mathematical terms, we can formulate our hypothesis as follows:

where is the initial constant value we choose and is the value we will multiple by a given house size. The goal of our hypothesis is to find values of and in such a way that difference between actual values provided in our training data-set and predicted values is minimum. Again if we have to represent it in mathematical terms, the formula would be :

where and are the chosen parameters, m is the number of rows in our training data-set and is the predicted value for ith element in our training data-set and is the actual target value of ith element in our training data-set. In Machine Learning terminology, this function is popularly referred as Cost Function or Squared Error Function as well. Let’s take couple of examples to understand the functioning better:

For and , using our training data-set the value of would be:

For and , using our training data-set the value of would be:

As the goal is to minimize , we seem to be getting better results while using and . But this method of hit and try is quite cumbersome and simply impossible while dealing with large data-sets and multiple input features. We would ideally like an automated method to figure out and for us. One of the most common and simplest algorithms for achieving this goal is called Gradient Descent. There are multiple variants of Gradient Descent algorithm, but for the purpose of the blog, we will be implementing the simplest of them referred as Batch Gradient Descent. The idea behind Gradient Descent is to start with an initial value of and , find corresponding values of and keep iterating the process till we achieve a minimum. Without going into the intermediate steps, the formula can be expressed mathematically as :

Repeat till we achieve Minimum {

}

here is the learning rate. Care should be taken not to take it’s value to be too small or too large. For too small values of , Gradient Descent may take long time to reach the optimum values while for too large values, Gradient Descent may just not converge at times. While optimum value of may depend on case to case, but it’s generally a good practice to start with values like 0.001 and increase it like 0.003, 0.01, 0.03, 0.1, 0.3 etc.

To allow easy understanding and visualization, we talked about only one of the input features but the above algorithm can be easily generalized for multiple input features as well :

Repeat till we achieve Minimum {

}

If we take our housing data-set as an example, after implementing the above algorithm, we will be getting values of 6 parameters i.e. and . For any set of input-features, it would be pretty straight forward to come up with the price, the house might sell by applying :

Expected Price = + * Size + * Number of Bedrooms + * Number of Floors + * Number of Bathrooms + * Age

After implementing the algorithm and training against the above mentioned data-set, I got following values for different parameters:

The expected price of the house can now be calculated as :

Expected Price = 20.5949 + 0.288 * 1700 -19.9074 * 3 -57.0084 * 2 + 0 * 3 + 0 * 25 == 367.451

As we can see, the predicted price of the house is quite within the range of our provided data-set. If you are wondering why we got 0 as parameter for numbers of bathroom and age of house. Theoretically, it’s very much possible and it’s an indication that we have some redundancy or interdependent features in our data-set. But for our case, we get these values as 0 because of very small size of our data-set.

We can implement these algorithms in any programming language of our choice. But one of the most popular programming language for Statistical and Mathematical problem solving is Octave. You can also make use of gnuplot along with Octave to visualize different graphs and analyze algorithm’s performance. A combination of Octave and gnuplot is quite helpful during modeling and choosing right algorithm for your context. Once the decision has been made, there are a lot of Math libraries in various popular programming languages like Java and C++, which will ease you task of implementation inside your application environment.

This is it about the broad concepts and internal working of a typical Linear Regression Algorithm. For easy comprehension, I have skipped several details and various optimization techniques. Hopefully this still gives you a good overview and will be helpful while working on these areas.

]]>

I have spoken in IndicThreads conferences on couple of occasions in 2009 and 2010. After having thoroughly enjoyed the experience, I talked several times with Harshad Oak and Sangeeta Oak, people behind Indic Threads about the need of such a conference in Delhi area as well. At the start of 2012, the ideas finally materialized and they both decided to plan an Indic Threads conference in Gurgaon. With this followed several discussions about the suitable venue, partners, spreading the word, speakers and all the logistics to have an impact-ful Inaugural session of the conference.

With previous history of Indic Threads and our local contacts, an impressive list of Speakers covering a wide array of topics was managed. We were able to find an appropriate venue in the form of Fortune Select Hotel in Gurgaon.

Finally everything came together on Friday, 13th July when the conference started at 9:00 AM with more than 70 participants joining in. Proceedings were kick-started by an Inaugural address from Sanket Atal, CTO of MakeMyTrip.com about how to become 10x SoftwareEngineer. Considering it being a Technology conference for and by Software Developers, I felt it to be very apt way of getting started. Next 2 days were full of interesting talks on wide area of domains we as Software developers are expected to work on. This included JavaEE7 platform, NoSQL Databases, Mobile Applications, Scala, Hadoop and Node.js.

I also presented my experiences on Machine Learning. The slides for my talk are embedded below:

Presentation decks from all other Speakers are available here. You can get complete information about the conference from it’s web-site

It was also decided during the conference to actively work towards creating a more vibrant Tech-Community in the region. A JUG-NCR (Java User Group – National Capital Region, India) was created for the purpose. Kudos to Rocky Jaiswal for taking the initiative. You can find complete information about JUG-NCR on its web-site.

All in all, a fun 2 day event with lots of learnings and meeting interesting people from local software community. Many thanks to Harshad and Sangeeta for bringing the conference to Delhi and hope to see many such events in future.

]]>

**Matrices & Vectors**

Starting with basics, A Matrix is a rectangular array of elements. Typical example is:

A Matrix is denoted by **Number of Rows * Number of Columns**. The above Matrix is normally called a 3 * 2 Matrix.

Similarly Vector is another often used Entity in Machine Learning algorithms. It’s special type of Matrix in the sense it always has only one column. It also called a **n * 1 Matrix**. A typical example of Vector would be:

Entries of Matrix or Vectors are denoted by where **i** is number of row and **j** is number of column. So in examples above, (Matrix) will be 4 and (Vector) will be 5.

**Matrix Addition**

We can perform addition operations over 2 matrices. In this, we add element at position i * j in Matrix M1 with element at the same position in Matrix M2. Let’s take an example :

We can perform subtraction in similar manner. One point to mention here is that we can perform these operations only on matrices of the same dimensions.

**Scalar Multiplication & Division**

We can also perform multiplication and divisions of Matrices and Vectors with real numbers as well. As you might have guessed, we take every number in the matrix and perform the corresponding operation with the provided number. Again an example will help:

Addition and subtraction can also be performed similarly.

**Matrix-Vector Multiplication**

Things get a bit interesting here. Firstly we can only perform multiplications between a Matrix and Vector when number of columns of Matrix equals number of rows of Vector. In mathematical terms m * n-dimensional Matrix can be multiplied with only n-dimensional Vector. The result of any Matrix – Vector multiplication is always a Vector. Let’s take help of an example to see how this is done.

As you can see, we are multiplying a 3 * 2 matrix with a 2 dimensional Vector and we get a 3 dimensional Vector in result.

**Matrix-Matrix Multiplication**

A Matrix-Matrix multiplication can be considered as an extended case of Matrix-Vector multiplication:

Above two multiplications can be done as explained in Matrix-Vector multiplications:

The final result would be :

As you can see we can only multiply a m * n matrix with n * p i.e. number of columns on 1st matrix should be equal to number of rows of 2nd matrix. The resulting matrix is of m * p dimensions.

Couple of points to keep in mind related to Matrix – Matrix multiplications. Contrary to real numbers where A * B = B * A, Matrix multiplication is not commutative i.e.

**Identity Matrix**

An Identity Matrix is a square Matrix i.e. m * m matrix where there are 1s on the diagonal from top left to bottom right and 0s else where. An Identity Matrix is generally denoted by *I* or . Couple of examples of Identity Matrices are:

All these operations are quite widely used in developing algorithms for Linear Regression and Gradient Descent. There are a lot of libraries in various popular programming languages which do these things for us but it helps to know the details for a better understanding.

There are a couple of other interesting operations like Transpose of a Matrix and Inverse of Matrix but I would leave them for you to dig further

]]>

**What is Machine Learning**

Lot of people have given their thoughts on the definition of Machine Learning. Out of them, I would like to quote couple of my favorites here :

Field of Study that gives computers the ability to learn without being explicitly programmed —

Arhtur Samuel

A more mathematical and widely quoted one is:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. —

Tom M. Mitchell

It would be good to explain by an example. Let’s take one of very popular use-cases of Machine Learning: **Spam Filter**. When we open our Inbox, we choose to mark some of the mails as Spam. Over a period of time, program behind Spam Filter learns about what type of emails we mark as spam and stops similar type of mails even reaching our Inbox. Behind the scenes, all these things are not manually programmed but algorithms learn behaviors of different customers over a period of time and implement Spam filtering accordingly.

In terms of 2nd definition, *Task T* here would be classifying an email as Spam or not. *Experience E* would be recording user behavior in terms of which mails (s)he classifies as Spam and finally *Performance P* would be how good or bad filter is in correctly classifying a given mail message as Spam or not. In general, performance or accuracy of a Machine Learning program increases with experience or time.

What are the types of Machine Learning algorithms currently used:

**Supervised Machine Learning** We try to teach Computers how to do something by providing the algorithm with set of right answers for different types of questions. It then infers/learns and tries to return correct answers when similar but not the same questions are asked from it in future.

This list of Question-Answers is called **Training Data-Set**. This Data-Set consists of multiple Training Examples or rows. Each Row is a pair of Input objects (Question) and corresponding Output value (Answer). Input objects is generally a vector of multiple attributes and in theory can scale infinitely (Question details). Each element of Input Objects is called a **Feature**. The output field is the desired answer.

Again let’s take help of an example to understand these terms more clearly. Suppose we want to create a System which provides users an estimated price they might get for their current car. In order for Machine Learning algorithm to work, we need to provide set of data on which the algorithm would base it’s learnings / predictions. This initial data can contain number of information points about the vehicle:

Price Sold, Make, Model, Year of Manufacture, Distance driven, Num Services, Num Accidents 140000, Suzuki, Alto, 2006, 37000, 8, 3

In the above example, Price Sold or 140000 will be output or answer. Remaining information will form Input Objects. Each of the car’s detail is a Feature. Everything together will form a Row and multiple rows together will constitute a Training Data-Set. In every row we are providing information that for a given set of features, a car was sold at this price. This is what we mean by *Right Answers*. Based upon this data, algorithm will make inferences and when we provide a different set of features, it would be able to predict approximate sale price for a given car. An Example question can be :

Suzuki, Alto, 2010, 15000, 8, 3

And a typical response can be

164500

It’s also referred as **Re-Inforced Learning**. Supervised Learning can further be classified into two types:

**Regression** Also referred as Continuous Valued output. Here the results are generally numeric values. Above mentioned example of predicting approximate value for a used car is an example of Regression output. Other examples of Regression output can be predicting the housing prices, sales figures, population figures etc

**Classification** Also referred as Discrete Valued Output. Results in this case are discrete values. Examples can be items to Recommend a visitor on an E-Commerce web-site, Classifying an email message as Spam or not, based upon different characteristics predicting likelihood of a patient suffering from a given disease.

**Unsupervised Machine Learning** As opposed to Supervised Learning, we provide Computer with the data and let it learn by itself. In other words, we let underlying algorithm to find some structure in the provided data-set. One of the most widely used algorithm in UnSupervised Learning is Clustering Algorithm. Some of the typical examples of UnSupervised Learning are:

- Google News
- Market Segmentation
- Social Network Analysis like Finding Facebook friends

]]>