logistic regression
here the predicted outputs are binary, either a 0 or a 1, therefore the challenge is finding an equation that can translate the input features to either of the choices.
To do this, they use the sigmoid function,viz:
where z is .
This means if z is high or simply +ve, is very small and f(x) approaches 1 - thus, it is approximated as 1. And for the erverse, it is approximately 0. This way all the data is transformed.
when applied to a dataset, this will give a plot like this.
cost function
to measure the losses for the predictions, the log functions for f(x) is used. however, the predictions do not determine the losses with a single overall equation because for a target of 0, a prediction near 0 has minimal loss and a prediction near 1, has pretty much infinite loss, while it is the reverse for cases with target of 1. therefore there are different loss functions describing this reverse logic, viz:
when the target is 1, loss is
and
when target is 0, loss is
these give, this.
however, the reality is we care about minimizing the losses for both predictions with targets 0 and 1 well.
Thus we need to focus on a loss function combining both versions of the loss function equation.
To do this we use the prvious idea of a cost function:
but very quickly we see that this doesn’t include the conditional effects of if is 0 or 1, thus, we rewrite each part as:
and respectively, thus if = 0, the first part goes to 0 and the other part is active and when =1 the first part works and the second goes to 0. this is quite simple and elegant now that I think of it. a logical tit for tat.
thus the cost function , becomes:
which we can observe graphically as such.
with this, the idea of minimizing the cost function is more intuitive since it is clearly a sum of all loss functions for that w,b state.
notes: to converge this with gradient descent we use the same eqautions and principle for gradient descent as univariate and multivariate vectorized linear regression.