机器学习博文列表

阅读别人的博文,尤其是优秀的博文,可以让一个人获益很多。对于机器学习,存在一些优秀的博文是需要认真阅读学习的,它们主要来自Kaggle Competition Winners,发布在Kaggle的官方博客上。在这些博文中,作者站在自己的视角上解释了怎么做特征工程和怎么进行模型选择,这是我们在做机器学习时都要解决的问题,这些博文给我们提供了一个很好的可借鉴的经验,并且其相比官方文档而言在模型解释方面可读性更高,更易理解。

闲言少叙,我们看看都有哪些博文:

这个列表以后可能会变得很长。我会不定期地去阅读一下。

Advertisements

F1 得分

在这篇帖子中我谈下对F1 得分(F1 score)的理解,如果你想了解更多,请移步这里

F1 得分的定义

F1 得分是精度和召回率的调和平均数,即

F_1 = \left ( \frac{recall^{-1} + precision^{-1}}{2} \right ) = 2 \cdot \frac{ precision * recall}{precision + recall}

其中,precisionrecall 分别表示精度和召回率。这两个概念不再在这里进行解释,如果你不确定自己是否已掌握或清楚,请参考这里这里

从F1得分的定义可以看到,当我们使用F1得分作为度量时,我们均等地考虑精度和召回率,这里的均等意味着精度和召回率得到的权重一样,后面我们会看到不同权重的例子。

F-beta 得分

F_\beta 是一般形式的F度量,其中\beta是一个正实数,当\beta为1时,我们即获得前面提到的F1得分。F_\beta 的定义如下:

F_\beta = (1 + \beta^2) \cdot \frac{precision \cdot recall}{(\beta^2 \cdot precision) + recall}

可以看到,\beta在这里扮演着十分重要的角色,那么当\beta 变化时F_\beta 如何改变呢?

让我们对F_\beta的定义稍作调整,分子和分母同时除以1+\beta^2,得到:

F_\beta = \frac{precision \cdot recall}{(\frac{\beta^2}{1 + \beta^2} \cdot precision) + \frac{recall}{1 + \beta^2}}

从以上公式出发,当\beta不断变小接近零时,在分母中\frac{\beta^2}{1 + \beta^2}不断接近0,1 + \beta^2不断接近1,整个F_\beta接近precision。类似的分析可以得到,当\beta不断变大时,整个F_\beta接近recall

结论:

当我们选择较小的\beta时,我们更多的关心精度,而当选择较大的\beta时,我们更关注召回率。与此相应,我们有两个常见的指标分别为F_{0.5}F_2

欢迎评论!

Backprogation

In a nutshell, backpropagation will consist of:

  • Doing a feedforward operation.
  • Comparing the output of the model with the desired output.
  • Calculating the error.
  • Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
  • Use this to update the weights, and get a better model.
  • Continue this until we have a model that is good.
h_1 = W_{11}^{(1)}x_1 +  W_{21}^{(1)}x_2 +  W_{31}^{(1)} \\ h_2 = W_{12}^{(1)}x_1 +  W_{22}^{(1)}x_2 +  W_{32}^{(1)} \\ h = W_{11}^{(2)}\sigma(h_1) +  W_{21}^{(2)}\sigma(h_2) +  W_{31}^{(2)} \\ \hat y = \sigma(h) \\ \hat y = \sigma \,\circ W^{(2)} \,\circ \sigma \,\circ W^{(1)}(x)
E(W) = -\frac{1}{m}\sum_{i=1}^{m}y_i\ln(\hat y_i) + (1 - y_i)\ln(1 - \hat y_i) \\ E(W) = E(W_{11}^{(1)}, W_{12}^{(1)}, \cdots, W_{31}^{(2)}) \\ \nabla E = (\frac{\partial E}{\partial W_{11}^{(1)}}, \cdots, \frac{\partial E}{\partial W_{31}^{(2)}}) \\ \frac{\partial E}{\partial W_{11}^{(1)}} = \frac{\partial E}{\partial \hat y} \frac{\partial \hat y}{\partial h} \frac{\partial h}{\partial h_1} \frac{\partial h_1}{\partial W_{11}^{(1)}}

Gradient Calculation

In order to minimize the error function, we need to take some derivatives. So let’s get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely,
\sigma'(x) = \sigma(x) (1-\sigma(x))
The reason for this is the following, we can calculate it using the quotient formula:

img

And now, let’s recall that if we have m points labelled x^{(1)}, x^{(2)}, \ldots, x^{(m)},the error formula is:
E = -\frac{1}{m} \sum_{i=1}^m \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right)
where the prediction is given by \hat{y_i} = \sigma(Wx^{(i)} + b).

Our goal is to calculate the gradient of E, at a point x = (x_1, \ldots, x_n), given by the partial derivatives
\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)
To simplify our calculations, we’ll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. The error produced by each point is, simply,
E = - y \ln(\hat{y}) - (1-y) \ln (1-\hat{y})
In order to calculate the derivative of this error with respect to the weights, we’ll first calculate \frac{\partial}{\partial w_j} \hat{y}. Recall that \hat{y} = \sigma(Wx+b), so:

img

The last equality is because the only term in the sum which is not a constant with respect to w_j is precisely w_j x_j, which clearly has derivative x_j.

Now, we can go ahead and calculate the derivative of the error E at a point x, with respect to the weight w_j.

img

A similar calculation will show us that

img

This actually tells us something very important. For a point with coordinates (x_1, \ldots, x_n), label y, and prediction \hat{y}, the gradient of the error function at that point is \left(-(y - \hat{y})x_1, \cdots, -(y - \hat{y})x_n, -(y - \hat{y}) \right). In summary, the gradient is
\nabla E = -(y - \hat{y}) (x_1, \ldots, x_n, 1)
If you think about it, this is fascinating. The gradient is actually a scalar times the coordinates of the point! And what is the scalar? Nothing less than a multiple of the difference between the label and the prediction. What significance does this have?

So, a small gradient means we’ll change our coordinates by a little bit, and a large gradient means we’ll change our coordinates by a lot.

If this sounds anything like the perceptron algorithm, this is no coincidence! We’ll see it in a bit.

Gradient Descent Step

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way:

w_i' \leftarrow w_i -\alpha [-(y - \hat{y}) x_i],

which is equivalent to

w_i' \leftarrow w_i + \alpha (y - \hat{y}) x_i.

Similarly, it updates the bias in the following way:

b' \leftarrow b + \alpha (y - \hat{y}),

Note: Since we’ve taken the average of the errors, the term we are adding should be \frac{1}{m} \cdot \alpha instead of \alpha, but as \alpha is a constant, then in order to simplify calculations, we’ll just take \frac{1}{m} \cdot \alpha to be our learning rate, and abuse the notation by just calling it \alpha.

Batch vs Stochastic Gradient Descent

At this point, it seems that we’ve seen two ways of doing linear regression.

  • By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.
  • By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.

More specifically, the squared (or absolute) trick, when applied to a point, gives us some values to add to the weights of the model. We can add these values, update our weights, and then apply the squared (or absolute) trick on the next point. Or we can calculate these values for all the points, add them, and then update the weights with the sum of these values.

The latter is called batch gradient descent. The former is called stochastic gradient descent.

img

The question is, which one is used in practice?

Actually, in most cases, neither. Think about this: If your data is huge, both are a bit slow, computationally. The best way to do linear regression, is to split your data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update your weights. This is still called mini-batch gradient descent.

img