Gradient Calculation

In order to minimize the error function, we need to take some derivatives. So let’s get our hands dirty and actually compute the derivative of the error function. The first thing to notice is that the sigmoid function has a really nice derivative. Namely,
\sigma'(x) = \sigma(x) (1-\sigma(x))
The reason for this is the following, we can calculate it using the quotient formula:

img

And now, let’s recall that if we have m points labelled x^{(1)}, x^{(2)}, \ldots, x^{(m)},the error formula is:
E = -\frac{1}{m} \sum_{i=1}^m \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right)
where the prediction is given by \hat{y_i} = \sigma(Wx^{(i)} + b).

Our goal is to calculate the gradient of E, at a point x = (x_1, \ldots, x_n), given by the partial derivatives
\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)
To simplify our calculations, we’ll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. The error produced by each point is, simply,
E = - y \ln(\hat{y}) - (1-y) \ln (1-\hat{y})
In order to calculate the derivative of this error with respect to the weights, we’ll first calculate \frac{\partial}{\partial w_j} \hat{y}. Recall that \hat{y} = \sigma(Wx+b), so:

img

The last equality is because the only term in the sum which is not a constant with respect to w_j is precisely w_j x_j, which clearly has derivative x_j.

Now, we can go ahead and calculate the derivative of the error E at a point x, with respect to the weight w_j.

img

A similar calculation will show us that

img

This actually tells us something very important. For a point with coordinates (x_1, \ldots, x_n), label y, and prediction \hat{y}, the gradient of the error function at that point is \left(-(y - \hat{y})x_1, \cdots, -(y - \hat{y})x_n, -(y - \hat{y}) \right). In summary, the gradient is
\nabla E = -(y - \hat{y}) (x_1, \ldots, x_n, 1)
If you think about it, this is fascinating. The gradient is actually a scalar times the coordinates of the point! And what is the scalar? Nothing less than a multiple of the difference between the label and the prediction. What significance does this have?

So, a small gradient means we’ll change our coordinates by a little bit, and a large gradient means we’ll change our coordinates by a lot.

If this sounds anything like the perceptron algorithm, this is no coincidence! We’ll see it in a bit.

Gradient Descent Step

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way:

w_i' \leftarrow w_i -\alpha [-(y - \hat{y}) x_i],

which is equivalent to

w_i' \leftarrow w_i + \alpha (y - \hat{y}) x_i.

Similarly, it updates the bias in the following way:

b' \leftarrow b + \alpha (y - \hat{y}),

Note: Since we’ve taken the average of the errors, the term we are adding should be \frac{1}{m} \cdot \alpha instead of \alpha, but as \alpha is a constant, then in order to simplify calculations, we’ll just take \frac{1}{m} \cdot \alpha to be our learning rate, and abuse the notation by just calling it \alpha.

Advertisements

Remove punctuation from string in Python

Say we have a string “Hello, are you still there?”, we want to transform it into “Hello are you still there”, the question is how to do it?

Before we go ahead, we need to first define what punctuation is, it’s easy to do this by using string.punctuation. Let’s check its value using following command:

import string
print(string.punctuation)

Above code just outputs a string containing all punctuation, its content as follows:

'!"#$%&\'()*+,-./:;?@[\\]^_`{|}~'

Now let’s make the transformation using str.translate, see following code for details:

import string

# Create a string to operate on
s = "Hello, are you still there?"

# Create a translation table
translator = str.maketrans('', '', string.punctuation)

# Make the translate
s = s.translate(translator)

# Check the result
print(s)  # prints "Hello are you still there"

References

Python List Sorting Techniques

  1. How do I sort a list ?

You can sort a list by using the built-in list.sort() method that modifies the list in-place, and it’s stable. The other way to sort a list is via built-in function sorted() that builds a new sorted list. Let me give you two examples.

Example using list.sort():

import random
# a list with 10 integers
nums = [num for num in range(10)]
print(nums)
# shuffle nums to see the effect of sorting, note that this is a in-place operation
random.shuffle(nums)
print(nums)
# sort it
nums.sort()
print(nums)

The above piece of code will have an output as following:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 7, 5, 3, 6, 8, 4, 1, 9, 2]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Note: you may have different output per run because of the shuffle operation.

Example using sorted():

import random
# a list with 10 integers
nums = [num for num in range(10)]
print(nums)
# shuffle nums to see the effect of sorting, note that this is a in-place operation
random.shuffle(nums)
print(nums)
# sort it using sorted
nums_sorted = sorted(nums)  # a new list returned
print(nums_sorted)

Here, we’ll have a similar output as in the previous example:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[4, 5, 3, 7, 6, 2, 9, 8, 1, 0]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
  1. How do I sort a list with multiple keys?

Suppose you have a list of records with each record representing a student. A student consists of a name, a gender, an age, etc. For simplicity, we can assume the record just consists of the three attributes aforementioned, thus the list looks like the following:

students = [('john', 'male', 23), ('alice', 'female', 18)]

In this case, you would like to sort the students first by name, then by the age. How can you do it? The answer is you can either use list.sort() or sorted(), but this time you need to supply a comparison function or specify the key to sort. Before we go ahead, let’s check the signatures:

The signature of list.sort() is sort(*, key=None, reverse=None) while the signature of sorted() is sorted(iterable[,key][,reverse].

Now let’s sort a list with multiple keys, see following for an example:

Here I only give an example using list.sort().

Example 1: specify key using a lambda

from pprint import pprint

# create a list to sort
students = [('dave', 'male', 10), ('jane', 'female', 12), ('john', 'male', 15)]
# specify key using a lambda
students.sort(key=lambda student: (student[0], student[2]))
pprint(students)

The above code gives us following output:

[('dave', 'male', 10), ('jane', 'female', 12), ('john', 'male', 15)]

You can see that it just works as we expected. Let’s check another example:

Example 2: specify key using itemgetter

from operator import itemgetter
from pprint import pprint

# create a list to sort
students = [('dave', 'male', 10), ('jane', 'female', 12), ('john', 'male', 15)]
# specify key using a lambda
students.sort(key=itemgetter(0, 2))
pprint(students)

This example has a same output as in example 1. There is a third way to do it, we can explicitly sort the students first by name, then by the age. Let’s have a look at it:

Example 3: explicitly do the sorting

from operator import itemgetter
from pprint import pprint

# create a list to sort
students = [('dave', 'male', 10), ('jane', 'female', 12), ('john', 'male', 15)]
students.sort(key=itemgetter(0))  # sort by name
students.sort(key=itemgetter(2), reverse=True) # sort by age in descending order
pprint(students)

The output of above code:

[('john', 'male', 15), ('jane', 'female', 12), ('dave', 'male', 10)]

This time we have a different output from previous two examples, we first sort by name, then sort by age in descending order. When we want to have fine controls over the order to sort, we use this method.

Welcome your comments!

Batch vs Stochastic Gradient Descent

At this point, it seems that we’ve seen two ways of doing linear regression.

  • By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.
  • By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.

More specifically, the squared (or absolute) trick, when applied to a point, gives us some values to add to the weights of the model. We can add these values, update our weights, and then apply the squared (or absolute) trick on the next point. Or we can calculate these values for all the points, add them, and then update the weights with the sum of these values.

The latter is called batch gradient descent. The former is called stochastic gradient descent.

img

The question is, which one is used in practice?

Actually, in most cases, neither. Think about this: If your data is huge, both are a bit slow, computationally. The best way to do linear regression, is to split your data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update your weights. This is still called mini-batch gradient descent.

img

Mean vs Total Squared (or Absolute) Error

A potential confusion is the following: How do we know if we should use the mean or the total squared (or absolute) error?

The total squared error is the sum of errors at each point, given by the following equation:

M = \sum_{i=1}^m \frac{1}{2} (y - \hat{y})^2,

whereas the mean squared error is the average of these errors, given by the equation, where m is the number of points:

T = \sum_{i=1}^m \frac{1}{2m}(y - \hat{y})^2.

The good news is, it doesn’t really matter. As we can see, the total squared error is just a multiple of the mean squared error, since

M = mT.

Therefore, since derivatives are linear functions, the gradient of T is also mm times the gradient of M.

However, the gradient descent step consists of subtracting the gradient of the error times the learning rate \alpha. Therefore, choosing between the mean squared error and the total squared error really just amounts to picking a different learning rate.

In real life, we’ll have algorithms that will help us determine a good learning rate to work with. Therefore, if we use the mean error or the total error, the algorithm will just end up picking a different learning rate.