this is a short summary of the first week of the machine learning course by andrew ng.
supervised ml: giving the computer a data set with sample answers of interest and telling it “find the correlation between the dataset and the answers of interest”.
can be: regression (used for in discrete and continuous answers) or classsification (used for discrete incontinuous answers, typically called classes)>
regression:
can be linear, logistic or more idk yet.
linear regression:
involves fitting a line to the spread of the data set. by fitting I mean getting the direction and orientation of a line that lines up with the data set to minimize the distance between the line itself and each point of the data set.
when the items of the set have a single feature, a univariate l.r. function should be adequate, viz:
\[ f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{1} \]where: \(x^{(i)}\) is the x variable / feature of ith item in the data set.
\(w\)w is the weight of said feature. this describes how much the feature contributes to the final predicted output.
\(w\)b is the bias of this function. it is a way to shift the first part of the equation closer to an accurate answer. this number has to be somewhat applicable to the whole data set.
\(f_{w,b}(x^{(i)})\) means “here’s a function containing both w and b but dependent on a variable \(x^{(i)}\).
cost function:
q1: given that the point of a l.r. model is a line best fitted to the data set, how do we make it happen - especially for any data set?
a: by providing a value describing the inaccuracy of this fitting, then gradually reducing this value till it is as low as possible. the value above is termed a cost function as it describes the cost of using the equation (a.k.a m.l model) to predict the output instead of getting the real life answer of interest.
it is given by:
$$J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{2}$$where \(y^{(i)}\) is the sample output / answer of interest for the ith item in the data set.
fun fact: if the whole data set falls on the line, the cost function \((J(w,b))\) will be 0.
to develop a good linear regression model, the software picks a model equation, say, \(f_{w,b}(x^{(i)}) = 2x^{(i)} + 7\) , then inputs the variable for each item in the dataset for \(x^{(i)}\) and gets the prediction \(f_{w,b}(x^{(i)})\). after doing this for the full data set, it then subtracts the prediction \((f_{w,b}(x^{(i)}))\) from the provided \(y^{(i)}\) for each item in the data set and sums the differences.
q2: why the m and 2 in the denominator?
a: the m in the denominator is to reduce / normalize the size of \((J(w,b))\) as the data set grows. since the curve will have the same slope scaled up or down, the division by 2 serves to scaledown \((J(w,b))\) here and later on do the same for its derivative.
gradient descent:
next, how do we minimize this \((J(w,b))\) value?
given the cost function \((J(w,b))\) is a quadratic equation, the plot will always be a U shaped curve. this means there is definitely a minimum \((J(w,b))\) and to get there you’ll need to find the slope at a point of \((J(w,b))\) where you currently are. with this information, you’ll know which direction will take you to the minimum. e.g. if its \((J(w,b))\) (y axis) against w (yx axis), and you’re on the left side of the evenly split U, a negative slope will get you down and the reverse for the right side.
to get this slope you differentiate \((J(w,b))\):
because \((J(w,b))\) depends on 2 variables, you’ll need to differentiate w.r.t each. By this, I mean, \(\frac{\partial J(w,b)}{\partial w}\) and \(\frac{\partial J(w,b)}{\partial b}\), which I solve below.
starting with \(\frac{\partial J(w,b)}{\partial w}\), which is:
\(=\frac{\partial}{\partial \omega}\left[\frac{1}{2m}(fx-y)^2\right]\) from equation \((2)\)
\(= \frac{1}{2m}\left[\frac{\partial}{\partial \omega}(fx-y)^2\right]\),
according to the chain rule, \(\frac{\delta(J^2)}{\delta x} = 2J \cdot \frac{\partial J}{\partial x}\)
therefore:
$$\frac{1}{2m}\left[\frac{\partial}{\partial \omega}(fx-y)^2\right] = \frac{1}{2m}\left[2(fx-y) \cdot \frac{\delta(fx-y)}{\delta \omega}\right]$$since,
$$ f_x = \omega x + b$$therefore,
$$f_x - y = \omega x + b - y$$$$\frac{\partial(\omega x + b - y)}{\partial \omega} = x $$thus:
$$\frac{\partial(J)}{\partial \omega} = \frac{1}{m}(f_x-y)x $$for \(\frac{\partial J(w,b)}{\partial b}\):
because \(b\) has no coefficient in \(\omega x + b - y\), the derivative \(\frac{\delta(fx-y)}{\delta \omega}\) will go to 0. and since they share the same \(\frac{1}{2m}2(fx-y)\) part:
$$\frac{\partial(J)}{\partial b} = \frac{1}{m}(f_x-y) $$while this gets us the slope of \((J(w,b))\) for one point on the plot, we have to move in the direction of this slope to get to the bottom of the plot (i.e. descend down the gradient/slope of the plot). for this we have to do the above \((J(w,b))\) differentiation and update the parameters w and b at avery step. To do this we use a learning rate variable \((\alpha)\) to multiply the slope and determine how fast we want to change w and b. This efficiently minimizes \((J(w,b))\) in that direction, till we get to a point where \(\frac{\partial J(w,b)}{\partial w}\) and \(\frac{\partial J(w,b)}{\partial b}\) give zero and we stop updating \(w\) and \(b\).
i.e. we repeat the following until convergence:
$$ w = w - \alpha \frac{\partial J(w,b)}{\partial w} $$and,
$$ b = b - \alpha \frac{\partial J(w,b)}{\partial b} $$this process eventually gives us the optimal \(f_{w,b}(x^{(i)}) = wx^{(i)} + b\) for the specific dataset, because at these values of \(w\) and \(b\), the cost function is the lowest.
notes:
- if at a point \(w\) = 2 and \(b\) = 4, and after updating \(w\) = 6, do not use this new \(w\) to update \(b\). make sure to always use the corresponding values for each updating step.
- if \(\alpha\) is too low, the descent will be slow but eventual, however if it is too high, it will be ineffective because instead of simply going down, the value of \((J(w,b))\) will flip flop from one side of U to the other while also increasing in magnitude at every step.
- near the minimum, the slope reduces therefore the magnitudes of changes is w and b also reduce.
- getting katex to work for this post was a pain until i found Tom’s guide here, making sure to modify my theme’s default files.