Mount disks of extfat format on Ubuntu

I encountered a problem when mounting a flash disk of format extfat on Ubuntu, I finally solved this by installing two packages:

sudo apt install exfat-fuse exfat-utils

The above solution is found at site



阅读别人的博文,尤其是优秀的博文,可以让一个人获益很多。对于机器学习,存在一些优秀的博文是需要认真阅读学习的,它们主要来自Kaggle Competition Winners,发布在Kaggle的官方博客上。在这些博文中,作者站在自己的视角上解释了怎么做特征工程和怎么进行模型选择,这是我们在做机器学习时都要解决的问题,这些博文给我们提供了一个很好的可借鉴的经验,并且其相比官方文档而言在模型解释方面可读性更高,更易理解。



F1 得分

在这篇帖子中我谈下对F1 得分(F1 score)的理解,如果你想了解更多,请移步这里

F1 得分的定义

F1 得分是精度和召回率的调和平均数,即

F_1 = \left ( \frac{recall^{-1} + precision^{-1}}{2} \right ) = 2 \cdot \frac{ precision * recall}{precision + recall}

其中,precisionrecall 分别表示精度和召回率。这两个概念不再在这里进行解释,如果你不确定自己是否已掌握或清楚,请参考这里这里


F-beta 得分

F_\beta 是一般形式的F度量,其中\beta是一个正实数,当\beta为1时,我们即获得前面提到的F1得分。F_\beta 的定义如下:

F_\beta = (1 + \beta^2) \cdot \frac{precision \cdot recall}{(\beta^2 \cdot precision) + recall}

可以看到,\beta在这里扮演着十分重要的角色,那么当\beta 变化时F_\beta 如何改变呢?


F_\beta = \frac{precision \cdot recall}{(\frac{\beta^2}{1 + \beta^2} \cdot precision) + \frac{recall}{1 + \beta^2}}

从以上公式出发,当\beta不断变小接近零时,在分母中\frac{\beta^2}{1 + \beta^2}不断接近0,1 + \beta^2不断接近1,整个F_\beta接近precision。类似的分析可以得到,当\beta不断变大时,整个F_\beta接近recall





In a nutshell, backpropagation will consist of:

  • Doing a feedforward operation.
  • Comparing the output of the model with the desired output.
  • Calculating the error.
  • Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
  • Use this to update the weights, and get a better model.
  • Continue this until we have a model that is good.
h_1 = W_{11}^{(1)}x_1 +  W_{21}^{(1)}x_2 +  W_{31}^{(1)} \\ h_2 = W_{12}^{(1)}x_1 +  W_{22}^{(1)}x_2 +  W_{32}^{(1)} \\ h = W_{11}^{(2)}\sigma(h_1) +  W_{21}^{(2)}\sigma(h_2) +  W_{31}^{(2)} \\ \hat y = \sigma(h) \\ \hat y = \sigma \,\circ W^{(2)} \,\circ \sigma \,\circ W^{(1)}(x)
E(W) = -\frac{1}{m}\sum_{i=1}^{m}y_i\ln(\hat y_i) + (1 - y_i)\ln(1 - \hat y_i) \\ E(W) = E(W_{11}^{(1)}, W_{12}^{(1)}, \cdots, W_{31}^{(2)}) \\ \nabla E = (\frac{\partial E}{\partial W_{11}^{(1)}}, \cdots, \frac{\partial E}{\partial W_{31}^{(2)}}) \\ \frac{\partial E}{\partial W_{11}^{(1)}} = \frac{\partial E}{\partial \hat y} \frac{\partial \hat y}{\partial h} \frac{\partial h}{\partial h_1} \frac{\partial h_1}{\partial W_{11}^{(1)}}

Gradient Calculation

In order to minimize the error function, we need to take some derivatives. So let’s get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely,
\sigma'(x) = \sigma(x) (1-\sigma(x))
The reason for this is the following, we can calculate it using the quotient formula:


And now, let’s recall that if we have m points labelled x^{(1)}, x^{(2)}, \ldots, x^{(m)},the error formula is:
E = -\frac{1}{m} \sum_{i=1}^m \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right)
where the prediction is given by \hat{y_i} = \sigma(Wx^{(i)} + b).

Our goal is to calculate the gradient of E, at a point x = (x_1, \ldots, x_n), given by the partial derivatives
\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)
To simplify our calculations, we’ll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. The error produced by each point is, simply,
E = - y \ln(\hat{y}) - (1-y) \ln (1-\hat{y})
In order to calculate the derivative of this error with respect to the weights, we’ll first calculate \frac{\partial}{\partial w_j} \hat{y}. Recall that \hat{y} = \sigma(Wx+b), so:


The last equality is because the only term in the sum which is not a constant with respect to w_j is precisely w_j x_j, which clearly has derivative x_j.

Now, we can go ahead and calculate the derivative of the error E at a point x, with respect to the weight w_j.


A similar calculation will show us that


This actually tells us something very important. For a point with coordinates (x_1, \ldots, x_n), label y, and prediction \hat{y}, the gradient of the error function at that point is \left(-(y - \hat{y})x_1, \cdots, -(y - \hat{y})x_n, -(y - \hat{y}) \right). In summary, the gradient is
\nabla E = -(y - \hat{y}) (x_1, \ldots, x_n, 1)
If you think about it, this is fascinating. The gradient is actually a scalar times the coordinates of the point! And what is the scalar? Nothing less than a multiple of the difference between the label and the prediction. What significance does this have?

So, a small gradient means we’ll change our coordinates by a little bit, and a large gradient means we’ll change our coordinates by a lot.

If this sounds anything like the perceptron algorithm, this is no coincidence! We’ll see it in a bit.

Gradient Descent Step

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way:

w_i' \leftarrow w_i -\alpha [-(y - \hat{y}) x_i],

which is equivalent to

w_i' \leftarrow w_i + \alpha (y - \hat{y}) x_i.

Similarly, it updates the bias in the following way:

b' \leftarrow b + \alpha (y - \hat{y}),

Note: Since we’ve taken the average of the errors, the term we are adding should be \frac{1}{m} \cdot \alpha instead of \alpha, but as \alpha is a constant, then in order to simplify calculations, we’ll just take \frac{1}{m} \cdot \alpha to be our learning rate, and abuse the notation by just calling it \alpha.