I got around to building a NN using multilayer perceptrons (MLP) to recognize handwritten numerals. This is a classic beginners NN problem. In this blog, I will focus on comparing the predictions of the NN with different parameters.
The MNIST database was used for the problem. MNIST has about 70,000 different handwritten numbers. The data is well arranged; gray-scale; in 28x28 matrix; centered in the matrix; that makes this a “relatively” easy problem to tackle.
As you can see from the serif on 1 that the numbers are slightly different and classifying these right is the gist of the problem. The NN views each image as a grayscale image i.e 0-255 bitmap.
The NN Architecture
I have built a 2 layer NN that takes in 784 (28x28) input nodes. The 784 is “flattened 28x28” into a single row because MLPs cannot understand multidimensional inputs. The NN outputs 10 nodes (0-9). The final activation function is a softmax as in it gives us a probability of a number being either 0, 1 …9. The architecture looks like following:
The Impact of the Hyper-Parameters
I ran about 10 experiments to see the impact of the hyper-parameters on the prediction capabilities of the NN.
The boxes in green show the parameters that were tuned.
The accuracy w/o training is the number without training the network. As you can see, without training most NN’s performed close to 10% that is the NN might as well be guessing the answer (0-9 i.e 1/10 probability of estimating it right).
The prediction test data is how the NN performed after the training on the training data. This is what we are after.
The validation error is how close the NN is to the actual answer on every run (or epoch). This is a key parameter to observe if you want to make sure that the data is not over-fitting. I think of overfitting as the student who gets the paper before the exam is extremely well prepared for the exam but when he goes to the real world, he is in trouble.
- Increasing the batch size or the rows of data that the NN can digest increases the speed of learning.
- Changing the activation function has a good chunk size impact on the accuracy of the NN. Sigmoid’s are not in flavor and we can see that the NN lost about 1 percent accuracy.
- My hypothesis that I could increase the accuracy of the NN if I increased the number of nodes or the depth of the NN was disproven. I am not quite sure why.
- Changing the optimizer has the biggest impact on the performance of a NN. This isn’t surprising at all. Optimizers are the functions that perform the gradient descent to converge to the solution and gradient descent is what makes a NN work. Choosing an inefficient gradient descent is going to nuke your results.
Finding the right hyper-parameter within a NN is key to its performance and finding the numbers is based on experiments rather than on theory.
I am starting to enjoy NN with libraries such as Keras. Building a simpler NN in Python was like getting a root canal :-) and I don’t think I could do the MNIST data in Python.
I am beginning to be dazzled by NNs. If you had told me a week earlier that I could sit down and write a program to understand handwritten numbers to predict what they were - I wouldn’t have believed you. And here I started off by saying this problem was “relatively” easy to solve. Amazing!
Disclaimer: Most of the source code was provided by Udacity - I filled along the key NN architecture. Here is the complete source from Udacity.