Gradient Calculation

In order to minimize the error function, we need to take some derivatives. So let’s get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely,
\sigma'(x) = \sigma(x) (1-\sigma(x))
The reason for this is the following, we can calculate it using the quotient formula:


And now, let’s recall that if we have m points labelled x^{(1)}, x^{(2)}, \ldots, x^{(m)},the error formula is:
E = -\frac{1}{m} \sum_{i=1}^m \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right)
where the prediction is given by \hat{y_i} = \sigma(Wx^{(i)} + b).

Our goal is to calculate the gradient of E, at a point x = (x_1, \ldots, x_n), given by the partial derivatives
\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)
To simplify our calculations, we’ll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. The error produced by each point is, simply,
E = - y \ln(\hat{y}) - (1-y) \ln (1-\hat{y})
In order to calculate the derivative of this error with respect to the weights, we’ll first calculate \frac{\partial}{\partial w_j} \hat{y}. Recall that \hat{y} = \sigma(Wx+b), so:


The last equality is because the only term in the sum which is not a constant with respect to w_j is precisely w_j x_j, which clearly has derivative x_j.

Now, we can go ahead and calculate the derivative of the error E at a point x, with respect to the weight w_j.


A similar calculation will show us that


This actually tells us something very important. For a point with coordinates (x_1, \ldots, x_n), label y, and prediction \hat{y}, the gradient of the error function at that point is \left(-(y - \hat{y})x_1, \cdots, -(y - \hat{y})x_n, -(y - \hat{y}) \right). In summary, the gradient is
\nabla E = -(y - \hat{y}) (x_1, \ldots, x_n, 1)
If you think about it, this is fascinating. The gradient is actually a scalar times the coordinates of the point! And what is the scalar? Nothing less than a multiple of the difference between the label and the prediction. What significance does this have?

So, a small gradient means we’ll change our coordinates by a little bit, and a large gradient means we’ll change our coordinates by a lot.

If this sounds anything like the perceptron algorithm, this is no coincidence! We’ll see it in a bit.

Gradient Descent Step

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way:

w_i' \leftarrow w_i -\alpha [-(y - \hat{y}) x_i],

which is equivalent to

w_i' \leftarrow w_i + \alpha (y - \hat{y}) x_i.

Similarly, it updates the bias in the following way:

b' \leftarrow b + \alpha (y - \hat{y}),

Note: Since we’ve taken the average of the errors, the term we are adding should be \frac{1}{m} \cdot \alpha instead of \alpha, but as \alpha is a constant, then in order to simplify calculations, we’ll just take \frac{1}{m} \cdot \alpha to be our learning rate, and abuse the notation by just calling it \alpha.


Batch vs Stochastic Gradient Descent

At this point, it seems that we’ve seen two ways of doing linear regression.

  • By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.
  • By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.

More specifically, the squared (or absolute) trick, when applied to a point, gives us some values to add to the weights of the model. We can add these values, update our weights, and then apply the squared (or absolute) trick on the next point. Or we can calculate these values for all the points, add them, and then update the weights with the sum of these values.

The latter is called batch gradient descent. The former is called stochastic gradient descent.


The question is, which one is used in practice?

Actually, in most cases, neither. Think about this: If your data is huge, both are a bit slow, computationally. The best way to do linear regression, is to split your data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update your weights. This is still called mini-batch gradient descent.


Mean vs Total Squared (or Absolute) Error

A potential confusion is the following: How do we know if we should use the mean or the total squared (or absolute) error?

The total squared error is the sum of errors at each point, given by the following equation:

M = \sum_{i=1}^m \frac{1}{2} (y - \hat{y})^2,

whereas the mean squared error is the average of these errors, given by the equation, where m is the number of points:

T = \sum_{i=1}^m \frac{1}{2m}(y - \hat{y})^2.

The good news is, it doesn’t really matter. As we can see, the total squared error is just a multiple of the mean squared error, since

M = mT.

Therefore, since derivatives are linear functions, the gradient of T is also mm times the gradient of M.

However, the gradient descent step consists of subtracting the gradient of the error times the learning rate \alpha. Therefore, choosing between the mean squared error and the total squared error really just amounts to picking a different learning rate.

In real life, we’ll have algorithms that will help us determine a good learning rate to work with. Therefore, if we use the mean error or the total error, the algorithm will just end up picking a different learning rate.





维护四个子数组:bottom(元素小于pivot),middle (元素等于pivot),unclassifiedtop(元素大于pivot)。一开始所有的元素都是unclassified,遍历这个组的元素,根据当前元素与pivot的相对关系将元素移动到bottommiddletop组。


  • nums[5] < pivot,例如nums[5] = -5。我们将其与第一个1进行交换,得到新数组为[-3,0,-1,-5,1,1,?,?,4,2]

  • nums[5] == pivot,例如nums[5] = 1。我们无需任何移动,只需继续考察下一个未分类的元素,新数组为[-3,0,-1,1,1,1,?,?,4,2]

  • nums[5] > pivot,例如nums[5] = 3。我们将其与未分类的最后一个元素进行交换,得到新数组为[-3,0,-1,1,1,?,?,3,4,2]

C++ 实现

void partition(vector<int>& nums, int k) {
    int pivot = nums[k];
    // invairants:
    // bottom: nums[0, smaller - 1]
    // middle: nums[smaller, equal - 1]
    // unclassified: nums[equal, larger - 1]
    // top: nums[larger, nums.size() - 1]
    int smaller = 0, equal = 0, larger = nums.size();
    // keep iterating as long as there is an unclassified element
    while (equal < larger) {
        if (nums[equal] < pivot) {
            swap(nums[smaller++], nums[equal++]);
        } else if (nums[equal] == pivot) {
        } else {
            swap(nums[equal], nums[--larger]);


Isolate the lowest set bit

If we want to isolate the lowest bit that is 1 in an unsigned integer x, we can do this by calculating x & ~(x - 1).

It can be easily to derive. First, suppose x is not 0, it must have a bit that is 1. The lowest set bit is the first bit that is one from least significant bit to most significant bit. Subtracting one from x changes the lowest bit to 0 and sets all the lower bits to 1. All the higher bits remains unchanged. Therefore x & ~(x - 1) has a single bit set to 1, namely the lowest set bit. Now, suppose x is 0, subtracting one from x underflows, resulting a word in which all bits are set to one. Again, x & ~(x - 1) is 0.

A similar derivation shows that x & (x - 1) replaces the lowest bit that is one with zero. For example, if x = (00101100), then x - 1 = (00101011), so x & (x - 1) = (0010100). This fact is very useful.

Parity calculation

The parity of a binary word is 1 if the number of 1s in the word is odd, otherwise, it is 0.

Solution #1

We iteratively test the value of each bit while tracking the number of 1s so far seen. Since we only care if the number of 1s is odd or even, we can store the number modulo 2.

int parity(unsigned int x) {
    int res = 0;

    while (x) {
        result ^= (x & 1);
        x >>= 1;
    return res;

Above method is a brute-force algorithm, the time complexity is O(n), where n is the word size.

Solution #2

Use aforementioned x & (x - 1) clears the lowest bit that is one, we can improve solution #1, let the time complexity O(k), where k is the number of 1s in word.

int parity(unsigned int x) {
    int res = 0;

    while (x) {
        result ^= 1;
        x &= (x - 1);
    return res;