The Perceptron is a supervised linear classifier, combining weights with a feature vector to classify an item. Given a data set of items, we train the Perceptron (by adjusting the weights) so that we can classify new items in the future.

### Classifying

In the general case, with multiple classes, we need a matrix of the weight vectors (each for each class). Then, given an item with a feature vector of x, we calculate its dot product with each weight vector and we pick the maximum. The item is classified in that class. In a formula:

$Class_{j} = \max\limits_{1 \leq i \leq c}{\{w_{i}x_{j} + b_{i}\}}$

Note: The feature vector is a vector of numbers, where each element corresponds to the value the item has for the feature.

Where c is the number of classes and j is the index of the item. We also have the number bj which signifies the bias. Essentially it moves the function away from the point of origin (in a two dimensional plane, it would move a function up/down). The bias does not depend on the input vector. Because it is not elegant to have to work with another variable, we will augment the vectors to include the bias. That means we will add to the weight and feature vectors another element. For the weight vectors, we will “stick” the bias value at the end of the vectors. So, a weight vector for class i and f features now becomes:

$[w_{i,1}, w_{i,2}, ... , w_{i,f}, b_{i}]$

For the feature vector, we augment it with a 1 (because when we multiply the weight and feature vector, we get bi times 1, for the bias).

$[x_{i,1}, x_{i,2}, ... , x_{i,f}, 1]$

Back on the dot product. In this tutorial, we can call it confidence, as it is an indicator of how confident we are that we picked the right class over other dot products/classes.

The confidence of a weight and a feature vector can be calculated by breaking down the two vectors into simpler multiplications and then adding those together.

$\sum\limits_{1 \leq k \leq f+1} {w_{j,k}*x_{i,k}}$

Where f is the number of features (therefore f+1 is the length of the augmented vectors). The indices j and i are the indices of the weight vector and the feature vector respectively.

We are looping through all the elements (using k as the index) in the weight vector (with index j) and the feature vector (with index i). is the number of features, and we add 1 to it because we have augmented the vectors. k takes values between 1 and f+1.

### Training

We train the Perceptron so it can “learn” to classify items by adjusting the augmented weight vectors.

When we make a prediction that is not correct, we need to learn from that mistake and correct our weight matrix as best as possible. Say the correct classification for an item was g but we classified it in h. We need to adjust the weight vectors where the mistake happened.

If we imagine those two vectors as lines that “cut” the space into parts, we want the item to fall under the part that was “cut” by the vector g. Currently though it falls under the h one instead. To fix this, we will move the g one towards the item and the h one away from it. We do that by adding the item feature vector to the first one and subtracting it from the second.

To minimize the radicality of this action, we will implement a rate to these arithmetics to have better control on the outcome.

More formally, when we make an error in the categorization, we do the following:

For the correct class –

$w_{g} = w_{g} + r*x$

For the guessed/wrong class –

$w_{h} = w_{h} - r*x$

With the above we fixed the weight functions for the given item, so if we were to re-classify the item, we would probably classify it under the correct class.

What about the other items in the data set? We changed the weight vectors, so there is a high chance we messed it up for some other items. So we need to go back and fix that for them too. Then we might break it for some other items and the cycle continues.

In fact, if the classes aren’t linearly separable (which means they can’t be separated by a line), the cycle can continue forever.

To combat this infinite loop, we have many options. One is to stop after the amount of corrections becomes small and the weight vectors don’t change much. Another, is to stop once you hit a certain threshold of accuracy.

The most popular solution though is to loop through the items a fixed amount of times. Each loop is called an epoch. After the epochs are finished, we take the weight matrix that was produced.

The training of a weight vector now becomes:

For the correct class –

$w_{g}(t) = w_{g}(t-1) + r*x$

For the guessed/wrong class –

$w_{h}(t) = w_{h}(t-1) - r*x$

Where t is the current epoch.

### Closing Notes

In this post I touched on the multi-class Perceptron, which can also be referenced as the Kesler’s Construction Perceptron. In the binary class case (where there are only two classes) the algorithm is simplified, by using only a single weight vector instead of a matrix of weight vectors. In that case, we do not pick the maximum dot product of the weight and the feature vector, but instead we check if the dot product is greater or less than 0. Above 0 it is classified in class A, below 0 it is classified in class B.

Another variation is the Pocket Algorithm. As we go through the epochs, we keep (in a “pocket”) the best solution we have found yet (by best I mean the most accurate). If we find a solution with better accuracy, we throw out the one in our pocket and add the new one. We continue until we finish our epochs.

With that, this little introduction is finished.

Perceptron Implementation