In most of Machine Learning problems we have to deal with large data-sets. This generally requires some kind of Distributed Computing platform. One of the leaders in this space is Hadoop. In brief, we can consider Hadoop as an ecosystem providing us a layer of abstraction to easily and reliably perform large-scale distributed computing. Though Hadoop infrastructure consists of several components, I would like to introduce you to couple of its core components: HDFS and MapReduce and how we can solve complex problems in an elegant manner.
Problem Context We have a social web application accessed by large number of users. We need to determine number of users and numbers of times these users accessed the system within a given time period. We have information about any user login along with the time of their access in our log files. For easy understanding, let’s take this sample log file :
user1 123 user2 123 user3 123 user4 124 user5 124 user6 125 user7 125 user8 125 ....
Every line in the log file represents a successful login by a given user along with System time in milliseconds. It’s not really important, how we may be storing the information in our log files and we will see a bit later, that it’s just a matter of few lines of code changes to adjust to different formats.
Expected Output We would be providing startTime and endTime between which we want to have desired information. An expected output would be something like:
user4 2 user6 2 user7 1 user8 1 user9 1 ....
Note : All the steps and code mentioned below have been tested to work on Hadoop 1.0.3 (the latest stable release) at the time of writing the blog. Changes may be required while working with other versions