Understanding Internals of Linear Regression

As I talked in one of my previous blogs, Regression is a type of Supervised Machine Learning where given a set of Input features, underlying algorithm returns a continuous valued output. Let’s take an example to understand it better:

In the above data-set, we are providing number of Independent features like house size, number of bedrooms, number of floors, number of bathrooms and age of the house. On the right most column is the actual price, the house was sold. This is generally called target or output. When a similar data-set is provided to an appropriate learning algorithm and after the learning phase is complete, we would expect the algorithm to predict the expected selling price of a house with similar features. One of a typical requirements is mentioned in the last row of the above data-set. As you can imagine, depending upon the values of input features like house size, bedrooms, age etc, the price of the house can be anywhere between 100K USD to 1 Million USD. That’s the reason of it being also called a Continuous Valued output.

Some of the common use-cases of Regression can be predicting population of a city, Sensex Index, Stock Inventory and many more. In-Fact Regression is one of the earliest forms of Statistical analysis and also one of the most widely used Machine Learning techniques.

In this blog, we will try to implement the simplest form of Regression referred as Linear Regression. For any Machine Learning algorithm, a typical work-flow looks like :

We provide Training Data-Set to a given Algorithm. It uses a Hypothesis which is essentially a mathematical formula to derive inferences about the data-set. Once the hypothesis has been formulated, we say that Algorithm has been trained. When we provide to algorithm a set of Input features, it return us an Output based upon its hypothesis. In the above scenario, we would provide details about House Size, Number of Bedrooms, Number of Floors, Number of Bathrooms and Age to the algorithm it in-turn would respond us with an expected price the house can fetch in the market.

Let’s see what can be the hypothesis / formula for our scenario. For easier understanding, let’s simplify the context a bit and think about having only house size as a single input feature against which we need to predict house prices. If we plot both these variables on a graph, it would look something like:

If we can draw a straight line across the points, then we should be able to predict the values of houses when different sizes are provided as inputs. Let’s take a look at different lines we can draw on the above graph:

The line which touches or is quite close to as many points as possible on the graph is most likely to provide us the best predictions while the line which touches least or is far from a majority of points is likely to provide us worst predictions. In our case line h1 appears to be the best possible scenario. The ability to draw such a straight line across a given data-set is in-essence the goal of any Linear Regression Algorithm. As we are attempting to draw a straight line, we call it a Linear Regression solution. In Mathematical terms, we can formulate our hypothesis as follows:

$h_{\theta}x = \theta_{0} + \theta_{1}x$

where $\theta_{0}$ is the initial constant value we choose and $\theta_{1}$ is the value we will multiple by a given house size. The goal of our hypothesis is to find values of $\theta_{0}$ and $\theta_{1}$ in such a way that difference between actual values provided in our training data-set and predicted values is minimum. Again if we have to represent it in mathematical terms, the formula would be :

$J(\theta_{0}, \theta_{1}) = \dfrac {1}{2m} \sum \limits_{i=1}^{m} (h_{\theta} (x^{(i)}) - y^{(i)})^2$

where $\theta_{0}$ and $\theta_{1}$ are the chosen parameters, m is the number of rows in our training data-set and $h_{\theta} (x^{(i)}$ is the predicted value for ith element in our training data-set and $y^{(i)}$ is the actual target value of ith element in our training data-set. In Machine Learning terminology, this function is popularly referred as Cost Function or Squared Error Function as well. Let’s take couple of examples to understand the functioning better:

For $\theta_{0} = 90$ and $\theta_{1} = 0.15$, using our training data-set the value of $J(\theta_{0}, \theta_{1})$ would be:

$\dfrac {1}{2*6} \{ (570.6 - 673)^{2} + (495.9 - 580)^{2} + (405.6 - 460)^{2} + (302.4 - 232)^{2} + (320.1 - 315)^{2} + (217.8 - 178)^{2}\} == 2257.01$

For $\theta_{0} = 100$ and $\theta_{1} = 0.20$, using our training data-set the value of $J(\theta_{0}, \theta_{1})$ would be:

$\dfrac {1}{2*6} \{ (740.8 - 673)^{2} + (641.2 - 580)^{2} + (520.8 - 460)^{2} + (383.2 - 232)^{2} + (406.8 - 315)^{2} + (270.4 - 178)^{2}\} == 4322.11$

As the goal is to minimize $J(\theta_{0}, \theta_{1})$, we seem to be getting better results while using $\theta_{0} = 90$ and $\theta_{1} = 0.15$. But this method of hit and try is quite cumbersome and simply impossible while dealing with large data-sets and multiple input features. We would ideally like an automated method to figure out $\theta_{0}$ and $\theta_{1}$ for us. One of the most common and simplest algorithms for achieving this goal is called Gradient Descent. There are multiple variants of Gradient Descent algorithm, but for the purpose of the blog, we will be implementing the simplest of them referred as Batch Gradient Descent. The idea behind Gradient Descent is to start with an initial value of $\theta_{0}$ and $\theta_{1}$, find corresponding values of $J(\theta_{0}, \theta_{1})$ and keep iterating the process till we achieve a minimum. Without going into the intermediate steps, the formula can be expressed mathematically as :

Repeat till we achieve Minimum {

$\theta_{0} := \theta_{0} - \alpha \dfrac {1}{m} \sum \limits_{i=1}^{m} (h_{\theta} (x^{(i)}) - y^{(i)})$

$\theta_{1} := \theta_{1} - \alpha \dfrac {1}{m} \sum \limits_{i=1}^{m} (h_{\theta} (x^{(i)}) - y^{(i)}) * x^{(i)}$

}

$\alpha$ here is the learning rate. Care should be taken not to take it’s value to be too small or too large. For too small values of $\alpha$, Gradient Descent may take long time to reach the optimum values while for too large values, Gradient Descent may just not converge at times. While optimum value of $\alpha$ may depend on case to case, but it’s generally a good practice to start with values like 0.001 and increase it like 0.003, 0.01, 0.03, 0.1, 0.3 etc.

To allow easy understanding and visualization, we talked about only one of the input features but the above algorithm can be easily generalized for multiple input features as well :

Repeat till we achieve Minimum {

$\theta_{j} := \theta_{j} - \alpha \dfrac {1}{m} \sum \limits_{i=1}^{m} (h_{\theta} (x^{(i)}) - y^{(i)}) * x_{j}^{(i)}$

}

If we take our housing data-set as an example, after implementing the above algorithm, we will be getting values of 6 parameters i.e. $\theta_{0}, \theta_{1}, \theta_{2}, \theta_{3}, \theta_{4}$ and $\theta_{5}$. For any set of input-features, it would be pretty straight forward to come up with the price, the house might sell by applying :

Expected Price = $\theta_{0}$ + $\theta_{1}$ * Size + $\theta_{2}$ * Number of Bedrooms + $\theta_{3}$ * Number of Floors + $\theta_{4}$ * Number of Bathrooms + $\theta_{5}$ * Age

After implementing the algorithm and training against the above mentioned data-set, I got following values for different parameters:

$\theta_{0} = 20.5949$
$\theta_{1} = 0.288$
$\theta_{2} = -19.9074$
$\theta_{3} = -57.0084$
$\theta_{4} = 0$
$\theta_{5} = 0$

The expected price of the house can now be calculated as :

Expected Price = 20.5949 + 0.288 * 1700 -19.9074 * 3 -57.0084 * 2 + 0 * 3 + 0 * 25 == 367.451

As we can see, the predicted price of the house is quite within the range of our provided data-set. If you are wondering why we got 0 as parameter for numbers of bathroom and age of house. Theoretically, it’s very much possible and it’s an indication that we have some redundancy or interdependent features in our data-set. But for our case, we get these values as 0 because of very small size of our data-set.

We can implement these algorithms in any programming language of our choice. But one of the most popular programming language for Statistical and Mathematical problem solving is Octave. You can also make use of gnuplot along with Octave to visualize different graphs and analyze algorithm’s performance. A combination of Octave and gnuplot is quite helpful during modeling and choosing right algorithm for your context. Once the decision has been made, there are a lot of Math libraries in various popular programming languages like Java and C++, which will ease you task of implementation inside your application environment.

This is it about the broad concepts and internal working of a typical Linear Regression Algorithm. For easy comprehension, I have skipped several details and various optimization techniques. Hopefully this still gives you a good overview and will be helpful while working on these areas.