## F1 得分

### F1 得分的定义

F1 得分是精度和召回率的调和平均数，即

$F_1 = \left ( \frac{recall^{-1} + precision^{-1}}{2} \right ) = 2 \cdot \frac{ precision * recall}{precision + recall}$

### F-beta 得分

$F_\beta$ 是一般形式的F度量，其中$\beta$是一个正实数，当$\beta$为1时，我们即获得前面提到的F1得分。$F_\beta$ 的定义如下：

$F_\beta = (1 + \beta^2) \cdot \frac{precision \cdot recall}{(\beta^2 \cdot precision) + recall}$

$F_\beta = \frac{precision \cdot recall}{(\frac{\beta^2}{1 + \beta^2} \cdot precision) + \frac{recall}{1 + \beta^2}}$

## Backprogation

In a nutshell, backpropagation will consist of:

• Doing a feedforward operation.
• Comparing the output of the model with the desired output.
• Calculating the error.
• Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
• Use this to update the weights, and get a better model.
• Continue this until we have a model that is good.
$h_1 = W_{11}^{(1)}x_1 + W_{21}^{(1)}x_2 + W_{31}^{(1)} \\ h_2 = W_{12}^{(1)}x_1 + W_{22}^{(1)}x_2 + W_{32}^{(1)} \\ h = W_{11}^{(2)}\sigma(h_1) + W_{21}^{(2)}\sigma(h_2) + W_{31}^{(2)} \\ \hat y = \sigma(h) \\ \hat y = \sigma \,\circ W^{(2)} \,\circ \sigma \,\circ W^{(1)}(x)$
$E(W) = -\frac{1}{m}\sum_{i=1}^{m}y_i\ln(\hat y_i) + (1 - y_i)\ln(1 - \hat y_i) \\ E(W) = E(W_{11}^{(1)}, W_{12}^{(1)}, \cdots, W_{31}^{(2)}) \\ \nabla E = (\frac{\partial E}{\partial W_{11}^{(1)}}, \cdots, \frac{\partial E}{\partial W_{31}^{(2)}}) \\ \frac{\partial E}{\partial W_{11}^{(1)}} = \frac{\partial E}{\partial \hat y} \frac{\partial \hat y}{\partial h} \frac{\partial h}{\partial h_1} \frac{\partial h_1}{\partial W_{11}^{(1)}}$

In order to minimize the error function, we need to take some derivatives. So let’s get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely,
$\sigma'(x) = \sigma(x) (1-\sigma(x))$
The reason for this is the following, we can calculate it using the quotient formula:

And now, let’s recall that if we have $m$ points labelled $x^{(1)}, x^{(2)}, \ldots, x^{(m)}$,the error formula is:
$E = -\frac{1}{m} \sum_{i=1}^m \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right)$
where the prediction is given by $\hat{y_i} = \sigma(Wx^{(i)} + b)$.

Our goal is to calculate the gradient of $E$, at a point $x = (x_1, \ldots, x_n)$, given by the partial derivatives
$\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)$
To simplify our calculations, we’ll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. The error produced by each point is, simply,
$E = - y \ln(\hat{y}) - (1-y) \ln (1-\hat{y})$
In order to calculate the derivative of this error with respect to the weights, we’ll first calculate $\frac{\partial}{\partial w_j} \hat{y}$. Recall that $\hat{y} = \sigma(Wx+b)$, so:

The last equality is because the only term in the sum which is not a constant with respect to $w_j$ is precisely $w_j x_j$, which clearly has derivative $x_j$.

Now, we can go ahead and calculate the derivative of the error $E$ at a point $x$, with respect to the weight $w_j$.

A similar calculation will show us that

This actually tells us something very important. For a point with coordinates $(x_1, \ldots, x_n)$, label $y$, and prediction $\hat{y}$, the gradient of the error function at that point is $\left(-(y - \hat{y})x_1, \cdots, -(y - \hat{y})x_n, -(y - \hat{y}) \right)$. In summary, the gradient is
$\nabla E = -(y - \hat{y}) (x_1, \ldots, x_n, 1)$
If you think about it, this is fascinating. The gradient is actually a scalar times the coordinates of the point! And what is the scalar? Nothing less than a multiple of the difference between the label and the prediction. What significance does this have?

So, a small gradient means we’ll change our coordinates by a little bit, and a large gradient means we’ll change our coordinates by a lot.

If this sounds anything like the perceptron algorithm, this is no coincidence! We’ll see it in a bit.

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way:

$w_i' \leftarrow w_i -\alpha [-(y - \hat{y}) x_i]$,

which is equivalent to

$w_i' \leftarrow w_i + \alpha (y - \hat{y}) x_i$.

Similarly, it updates the bias in the following way:

$b' \leftarrow b + \alpha (y - \hat{y})$,

Note: Since we’ve taken the average of the errors, the term we are adding should be $\frac{1}{m} \cdot \alpha$ instead of $\alpha$, but as $\alpha$ is a constant, then in order to simplify calculations, we’ll just take $\frac{1}{m} \cdot \alpha$ to be our learning rate, and abuse the notation by just calling it $\alpha$.

## Remove punctuation from string in Python

Say we have a string “Hello, are you still there?”, we want to transform it into “Hello are you still there”, the question is how to do it?

Before we go ahead, we need to first define what punctuation is, it’s easy to do this by using string.punctuation. Let’s check its value using following command:

import string
print(string.punctuation)


Above code just outputs a string containing all punctuation, its content as follows:

'!"#\$%&\'()*+,-./:;?@[\\]^_{|}~'


Now let’s make the transformation using str.translate, see following code for details:

import string

# Create a string to operate on
s = "Hello, are you still there?"

# Create a translation table
translator = str.maketrans('', '', string.punctuation)

# Make the translate
s = s.translate(translator)

# Check the result
print(s)  # prints "Hello are you still there"

`