logistic regression

here the predicted outputs are binary, either a 0 or a 1, therefore the challenge is finding an equation that can translate the input features to either of the choices.

To do this, they use the sigmoid function,viz:

f(x)=11+ez f(x)= \frac{1}{1+e^{-z}}

where z is w.x+b\vec{w}.\vec{x} + b .

This means if z is high or simply +ve, eze^{-z} is very small and f(x) approaches 1 - thus, it is approximated as 1. And for the erverse, it is approximately 0. This way all the data is transformed.

when applied to a dataset, this will give a plot like this.

cost function

to measure the losses for the predictions, the log functions for f(x) is used. however, the predictions do not determine the losses with a single overall equation because for a target of 0, a prediction near 0 has minimal loss and a prediction near 1, has pretty much infinite loss, while it is the reverse for cases with target of 1. therefore there are different loss functions describing this reverse logic, viz:

when the target is 1, loss is log(f(xi)-log(f(x_i)

and

when target is 0, loss is log(1(f(xi)))-log (1-(f(x_i)))

these give, this.

however, the reality is we care about minimizing the losses for both predictions with targets 0 and 1 well.

Thus we need to focus on a loss function combining both versions of the loss function equation.

To do this we use the prvious idea of a cost function:

J(w,b)=1mi=1m[lossesfortargets0andlossesfortargets1] J(w,b)= \frac{1}{m} \sum_{i=1}^{m}[\mathrm{losses-for-targets-0-and-losses-for-targets-1} ] J(w,b)=1mi=1m[log(f(xi)+log(1(f(xi)))] J(w,b)= \frac{1}{m} \sum_{i=1}^{m}[\mathrm{-log(f(x_i)+ -log (1-(f(x_i)))} ]

but very quickly we see that this doesn’t include the conditional effects of if yiy_i is 0 or 1, thus, we rewrite each part as:

yi(log(f(xi))-y_i(log(f(x_i)) and ((1yi)log(1(f(xi))))((1-yi)-log (1-(f(x_i)))) respectively, thus if yiy_i = 0, the first part goes to 0 and the other part is active and when yiy_i =1 the first part works and the second goes to 0. this is quite simple and elegant now that I think of it. a logical tit for tat.

thus the cost function (J)(J) , becomes:

J(w,b)=1mi=1m[yi(log(f(xi))+((1yi)log(1f(xi)))] J(w,b)= \frac{1}{m} \sum_{i=1}^{m} [-y_i(log(f(x_i))+((1−y_i)-log(1−f(x_i)))]

which we can observe graphically as such.

with this, the idea of minimizing the cost function is more intuitive since it is clearly a sum of all loss functions for that w,b state.

notes: to converge this with gradient descent we use the same eqautions and principle for gradient descent as univariate and multivariate vectorized linear regression.