F1 得分

F1 得分的定义

F1 得分是精度和召回率的调和平均数，即 $F_1 = \left ( \frac{recall^{-1} + precision^{-1}}{2} \right ) = 2 \cdot \frac{ precision * recall}{precision + recall}$

F-beta 得分 $F_\beta$ 是一般形式的F度量，其中 $\beta$是一个正实数，当 $\beta$为1时，我们即获得前面提到的F1得分。 $F_\beta$ 的定义如下： $F_\beta = (1 + \beta^2) \cdot \frac{precision \cdot recall}{(\beta^2 \cdot precision) + recall}$ $F_\beta = \frac{precision \cdot recall}{(\frac{\beta^2}{1 + \beta^2} \cdot precision) + \frac{recall}{1 + \beta^2}}$

Backprogation

In a nutshell, backpropagation will consist of:

• Doing a feedforward operation.
• Comparing the output of the model with the desired output.
• Calculating the error.
• Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
• Use this to update the weights, and get a better model.
• Continue this until we have a model that is good. $h_1 = W_{11}^{(1)}x_1 + W_{21}^{(1)}x_2 + W_{31}^{(1)} \\ h_2 = W_{12}^{(1)}x_1 + W_{22}^{(1)}x_2 + W_{32}^{(1)} \\ h = W_{11}^{(2)}\sigma(h_1) + W_{21}^{(2)}\sigma(h_2) + W_{31}^{(2)} \\ \hat y = \sigma(h) \\ \hat y = \sigma \,\circ W^{(2)} \,\circ \sigma \,\circ W^{(1)}(x)$ $E(W) = -\frac{1}{m}\sum_{i=1}^{m}y_i\ln(\hat y_i) + (1 - y_i)\ln(1 - \hat y_i) \\ E(W) = E(W_{11}^{(1)}, W_{12}^{(1)}, \cdots, W_{31}^{(2)}) \\ \nabla E = (\frac{\partial E}{\partial W_{11}^{(1)}}, \cdots, \frac{\partial E}{\partial W_{31}^{(2)}}) \\ \frac{\partial E}{\partial W_{11}^{(1)}} = \frac{\partial E}{\partial \hat y} \frac{\partial \hat y}{\partial h} \frac{\partial h}{\partial h_1} \frac{\partial h_1}{\partial W_{11}^{(1)}}$

In order to minimize the error function, we need to take some derivatives. So let’s get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely, $\sigma'(x) = \sigma(x) (1-\sigma(x))$
The reason for this is the following, we can calculate it using the quotient formula: And now, let’s recall that if we have $m$ points labelled $x^{(1)}, x^{(2)}, \ldots, x^{(m)}$,the error formula is: $E = -\frac{1}{m} \sum_{i=1}^m \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right)$
where the prediction is given by $\hat{y_i} = \sigma(Wx^{(i)} + b)$.

Our goal is to calculate the gradient of $E$, at a point $x = (x_1, \ldots, x_n)$, given by the partial derivatives $\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)$
To simplify our calculations, we’ll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. The error produced by each point is, simply, $E = - y \ln(\hat{y}) - (1-y) \ln (1-\hat{y})$
In order to calculate the derivative of this error with respect to the weights, we’ll first calculate $\frac{\partial}{\partial w_j} \hat{y}$. Recall that $\hat{y} = \sigma(Wx+b)$, so: The last equality is because the only term in the sum which is not a constant with respect to $w_j$ is precisely $w_j x_j$, which clearly has derivative $x_j$.

Now, we can go ahead and calculate the derivative of the error $E$ at a point $x$, with respect to the weight $w_j$. A similar calculation will show us that This actually tells us something very important. For a point with coordinates $(x_1, \ldots, x_n)$, label $y$, and prediction $\hat{y}$, the gradient of the error function at that point is $\left(-(y - \hat{y})x_1, \cdots, -(y - \hat{y})x_n, -(y - \hat{y}) \right)$. In summary, the gradient is $\nabla E = -(y - \hat{y}) (x_1, \ldots, x_n, 1)$
If you think about it, this is fascinating. The gradient is actually a scalar times the coordinates of the point! And what is the scalar? Nothing less than a multiple of the difference between the label and the prediction. What significance does this have?

So, a small gradient means we’ll change our coordinates by a little bit, and a large gradient means we’ll change our coordinates by a lot.

If this sounds anything like the perceptron algorithm, this is no coincidence! We’ll see it in a bit.

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way: $w_i' \leftarrow w_i -\alpha [-(y - \hat{y}) x_i]$,

which is equivalent to $w_i' \leftarrow w_i + \alpha (y - \hat{y}) x_i$.

Similarly, it updates the bias in the following way: $b' \leftarrow b + \alpha (y - \hat{y})$,

Note: Since we’ve taken the average of the errors, the term we are adding should be $\frac{1}{m} \cdot \alpha$ instead of $\alpha$, but as $\alpha$ is a constant, then in order to simplify calculations, we’ll just take $\frac{1}{m} \cdot \alpha$ to be our learning rate, and abuse the notation by just calling it $\alpha$.

At this point, it seems that we’ve seen two ways of doing linear regression.

• By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.
• By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.

More specifically, the squared (or absolute) trick, when applied to a point, gives us some values to add to the weights of the model. We can add these values, update our weights, and then apply the squared (or absolute) trick on the next point. Or we can calculate these values for all the points, add them, and then update the weights with the sum of these values.

The latter is called batch gradient descent. The former is called stochastic gradient descent. The question is, which one is used in practice?

Actually, in most cases, neither. Think about this: If your data is huge, both are a bit slow, computationally. The best way to do linear regression, is to split your data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update your weights. This is still called mini-batch gradient descent. 