Raghav Bali

This definition holds pretty true even when we talk in the context of Machine Learning. A Perceptron is a Supervised Classification algorithm.

“a computer model or computerized machine devised to represent or simulate the ability of the brain to recognize and discriminate”

The algorithm is very true to its dictionary meaning and its origins are rather interesting as well. Its origins date back to the 1950s and it was one the first ever Neural Networks to be implemented. In a sense, the Perceptron learning algorithm is the fore-father of modern day Neural and Deep Learning Networks.

A Perceptron is a very simple yet elegant algorithm which is proven to work given the data meets its required constraints of linear separability. Invented and developed by Frank Rosenblatt in 1957, Perceptrons showcased the ability to classify images. The algorithm raised may eyebrows and interest alike. Further research and experimentation led to its proof of convergance along with identification of its limitations. See Minsky and Papert’s book

This algorithm carries a sort of special meaning for me personally as well. Perceptron happens to be the very first learning algorithm we discussed and implemented as part of our course Machine Learning 101 at IIIT-Bangalore. The following is a snapshot of my class notes (you may overlook the handwriting please). #nostalgia alt text

The Classic Perceptron

A Perceptron is a Supervised Classification Algorithm. In its most basic and classic sense, it is a binary classifier which is proven to converge and find a classifier for linearly separable dataset. Before we discuss the actual algorithm, let us build up a story around it. Say we have a dataset consisting of two kinds of data points which are known to be linearly separable. Though it might be difficult to find linearly separable datasets in real life, such is the constraint for a Perceptron to work. Even though this may sound as a very limiting constraint, Perceptrons are very powerful along with the fact that they help us understand other more complex learning algorithms like Neural Networks and so on.

Sample Data

For the sake of this post, let us assume the following data points have been generated by some unknown function \(f(X)\). These data points can be visualized in a 2-D space as follows: alt text

As we can see in the above dataset, there are two different classes, one represented as red and the other as green. Our job here is to train a perceptron which can classify the data points into two classes.

What is a Perceptron

A perceptron can be thought of as a computational equivalent of a single neuron. Very much like a neuron, a perceptron can be explained as:

It receives data from input nodes (say vector \(X\)).
All inputs nodes are connected to a node in the next layer.
This node in the next layer takes the weighted sum of the input data (using weight vector \(W\)).
The output of this network is passed through an activation function (say \(f\)), which is a type of a binary step function.
The output of this function determines the class of the current input.

The following is a visual representation of a perceptron: alt text

The Algorithm

Now that we understand and can visualize how a perceptron looks, let us get to know the algorithm to train it. The ultimate aim of a binary classifier is to identify a decision boundary which can separate the dataset into different classes. Since a simple perceptron is a binary classifier, it attempts to find a decision boundary which separates data into two different classes. As the input data can be an n-dimensional vector, the decision boundary would then be a hyperplane separating our data into two. Our sample dataset consists of points in a 2-D space, thus our decision boundary would be a simple straight line separating the green and red data points.

The algorithm works as follows-

Initialization:
- Assume we have an activation function \(f(Y)\) which generates an output of \(+1\) or \(-1\) based on a given decision threshold \(\theta\).
- Where Y is the weighted sum of input vector \(X\) and weights vector \(W\). \(\implies Y = w_0 + \sum_{i=1}^n x_i.w_i\)
- We initialize the weights vector \(W\) to sample random values
- Let the \(O\) be the output label vector for input vector \(X\).
Update Step:
- For each sample data point \(x_j\) in the training data set with labeled outputs \(o_j\):
  - Calculate the output label with current weight vector as:
    \[\begin{align*} & y_j = w_0 + \sum_{i=1}^n x_i.w_i \\ & \implies output\_label : o^` = f(Y) \end{align*}\]
  - Update the weights vector \(W\) as: \(\begin{align*} & W^` = W + learning\_rate*(o_j - o^`)*x_j \end{align*}\)
- The algorithm utilizes the updated weights for each subsequent data point in the training set until all data points are correctly classified or the error drops below a certain limit.
- This trick of utilizing the updated weights immediately allows perceptrons to be used for online learning rather than waiting for a complete pass through each data point.

The Perceptron algorithm has been proven to converge mathematically. For mitigating the risk of overfitting and dependence on the order of input samples, we apply the perceptron through shuffled input dataset through multiple epochs.

Things to Remember

While understanding the perceptron algorithm, we introduced quite a few vectors and other symbols. Each of those carries a specific meaning. Some of the important ones are:

\(w_0\) : This is called the bias of the weight vector \(W\). This helps in assigning a non-zero probability for a particular vector to be chosen
Decision Threshold (\(\theta\)) : This keeps the algorithm from overfitting on the training dataset.

Implementing the Perceptron

Perceptron is one the most elegant yet easy to implement learning algorithms. Though most packages/libraries provide implementation for this algorithm, let us try our hand on it as well. The following python function helps in updating the weight vector \(W\):

# update weights of the perceptron
def update_weights(row,weights_array=[1,0,0],lr=0.001):
    # get predicted label using current weight vector
    predicted_label = get_output_label(process_input(row[['X','Y']],
                                      weights_array=weights_array))

    # calculate the error as:
    # learning_rate *(actual_label - predicted_label)
    error = lr*(row['label'] - predicted_label)

    # adjust the weights
    weights_array[0] += error
    weights_array[1] += row['X']*error
    weights_array[2] += row['Y']*error
    return weights_array

Now that we have a function to update the weights, the final piece of the puzzle is to write a function which emulates the steps of the training algorithm:

# train the perceptron
def train_perceptron(input_df,learning_rate=0.001,epochs=10):
    # start with arbitrary weight vector
    weights_array = [rand(1)[0],0,0]

    # iterate epoch number of times
    result = None
    for itr in range(epochs):
        print "Weights for epoch:"+str(itr+1)+"=>",weights_array
        # weights_array
        weights_array = list(input_df.apply(update_weights,
                                                axis=1,
                                                weights_array=weights_array,
                                                lr=0.001).ix[0])
    return weights_array

To test our perceptron, we pass our input dataset with training labels, initial weight vector, learning rate and decision threshold to our learning algorithm. We set the number of epochs to 10 and get the final output: alt text

The final weight vector \(W\) learnt by our perceptron is: [0.59227970265095897, -0.15017879177525117, 0.18604559224437517]

As we can see from the above visualization, our code seems to work as it has classified our data perfectly into green and red classes!

The Complete code is available on github here:github.com/raghavbali/python_notebooks

Enhancements

As we have discussed already, Perceptron requires its input dataset to be linearly separable. But we know that in real world such is not the case usually. Also, the classic perceptron was a binary classifier and in real world scenarios we may have data consisting of multiple classes. Over the years, Perceptron has gone through a lot of rigor and evolution. With the enhancements, Perceptrons have not only been tweaked to reach a solution faster, but also have been improved to work on non-linearly separable datasets (see Pocket Perceptron) and multi-class datasets (see Multi-class Perceptron).

Perceptrons have been recently utilized in scenarios with class imbalance and in the field of NLP.

Resources

For further reading, you may consult these resources

Do share your comments/feedback with me on Twitter:@rghv_bali