The Machine Learns

In this post we will be summarizing Andrews ML class first week in form of concise notes.

The linear regression with one variable is basically fitting the data with a straight fit line.
A simple $y=m.X+c$

Where y = Output, X= input, c = constant value (or the line intercept).
So if we are given a two dimensional data (y vs X), then we have to identify such a straight line which is the best fit to this data. We can generate infinite such lines by changing m & c, because X is given to us (& thus is constant).
The problem now is how do we measure whether or not a certain straight line is best fit or not.
We can see that a line which is closest to maximum number of input data points (from X, that is) is the best fit. Thus we can calculate the errors for each line, as in how a line deviates from the data points (X).
Thus for example $\theta_{0}$ (intercept) =0 , $\theta_{1}$ (slope) =0 would mean a horizontal line with slope and intercept as zero. Whereas $\theta_{0}$ (intercept) =1 , $\theta_{1}$ (slope) =0 would mean a straight line with slope=1(45 degree) and intercept as zero. Thus we can change the values of thetas to an infinite number of combination to come up with infinite number of lines. But there is only one such combination of $\theta_{0}$ & $\theta_{1}$ which is the best fit (or gives the lowest error).
So we can write our hypothesis equation as
$h_{\theta}(x)=\theta_{0}+\theta_{1}.x$

Now that we have the equation of a straight line to fit our data with, but ofcourse we wouldn't be able to fit a straight line to any real world data and thus there will be lots of errors due to the misfits.

The misfits will be between the predicted value (h) and the given values(y). In order to calculate this error we will go for square of differences approach and average this sum of errors over m , thus we get our cost function/ error function as

$J(\theta_{0},\theta_{1})=1/(2m) \sum\limits_{i=1}^m(h_{\theta}(x^{(i)} - y^{(i)})^2$

Notice that in the hypothesis equation we can only change the thetas. The value X is fixed, whereas h is the output. Also note that the value m is also fixed, because you will be given (or rather you will have to consider) only fixed number of training examples. So the only values that we can play with is the thetas, so using these thetas we will generate lots of lines in our quest to fit the input dataset of size 'm'. And we can have infinitely many values of theta, but we choose only that pair which gives us the minimum error cost/cost function.

We can try inserting such theta value pairs in the hypothesis equation manually and then calculate the error cost for each such pair of theta values. Then possibly hope to find a theta value pair which gives the least error (i.e. is the best fit to data).
The algorithm would be :-
for 1000 iterations (lets consider only 1000 iterations) :-
1.] Devise the hypothesis equation
2.] Start with a theta value

3.] For this theta value , calculate the hypothesis
4.] Calculate the error/cost for this hypothesis

5.] Choose another value of theta. go to 3

 
After 1000 such iterations we will have 1000 values of error/cost for each 
of the respective

hypothesis (or theta values).

As you can see , doing all this manually could take forever until and unless you had a lucky guess of theta values. Now this is where writing a program comes in handy, wherein we can leave this series of manual calculation to the computer.

Gradient descent
$\theta_{j} := \theta_{j} - \alpha . \frac{\partial}{\partial \theta_{j}}.J(\theta_{0},\theta_{1})$
Now the purpose of the GD is to minimize a function. What the above equation is trying to achieve is that we are starting with a value of theta, then we are tweaking it by very small amount . Then we substitute this new theta back into cost function, calculate the cost.
The key point behind GD is that if we calculate the slope at the steepest point , it will be very high, whereas as we approach a stable base point the slope (rate of change of slope) decreases by small amount. Thus if you are standing at a very steep point the GD will take you down (decrease the value of theta) by a huge amount, thereafter as you approach the (zero potential) base the GD will take small cautious steps. The essence of GD is in the second term ( gradient of cost function that is). This is a very intelligent yet simple trick indeed, if you try to think over the intuition of it. But this approach gives us only local minimum, there's no guarantee of achieving the global minimum by this technique. So you'll have to perform gradient descent quite a number of times, each time with a different value of initial theta and compare the local minimums from each of them.

The Machine Learns

Saturday, 7 June 2014

Gradient descent derivation

Thursday, 5 June 2014

Latex testing page

Linear Regression with one variable