# Neural Networks

## Recognizing Digits in the MNIST Dataset

***** This material is difficult but not impossible.
Be patient. Try more than once. Understanding deep learning is the reward. *****

Neural networks are computational devices whose structure is inspired by (but do not necessarily work exactly) the way neurons work in the brain.

A neuron processes and transmits information. In the human brain, there are about 85 billion neurons. A typical neuron consists in a cell body, dendrites, and an axon. The dendrites take input from other neurons in the form of electrical impulses. The cell body processes these inputs, and the axon terminals transmit outputs in the form of an electrical impulse.

A *perceptron* is an artificial neuron. It takes binary inputs and computes a binary output.
The computation involves weights and a threshold value.
If the weighted sum **Σ _{j} w_{j} x_{j}** is greater than the threshold value,
then the output
is

**1**. Otherwise, it is

**0**. The computation is typically expressed in terms of the dot product

**w · x**(=

**w**=

^{T}x**Σ**), where

_{j}w_{j}x_{j}**w**is a

**1×n**matrix of weights and and

**x**is a

**n×1**matrix of inputs. Further, the threshold is said to be the perceptron's

**bias**,

**b**(=

**-threshold**). In these terms, the value of the output activation function is

**1**if

**w · x + b > 0**and is

**0**otherwise.

Perceptrons can implement logic functions. Conjunction (φ ∧ ψ) is an example. Let the perceptron have two inputs, a weight of 0.6 each, and a threshold value of 1. If both inputs are 1, the sum exceeds the threshold and thus the output is 1. Otherwise, the output is 0. These conditions for activating the perceptron match the truth-table for conjunction (∧).

Artificial neurons may be linked together in a *feedfoward* network in which the ouput from one layer
is the input to the next layer. The first layer is input layer of neurons. The last layer is the output layer. The hidden layers are
the neurons that are neither input nor output neurons.

A feedforward network of artificial neurons may be understood as a device that makes "decisions about decisions." The first layer of neurons makes a "decision" by weighing the input, the next layer makes a "decision about the decision" of the prior layer, and so on.

# Sigmoid Neurons

A *sigmoid neuron* has an important feature a perceptron lacks: small changes
in the weights and bias cause only small changes in the output. This allows sigmoid neurons to "learn."

A sigmoid neuron has the same parts as a perceptron (inputs, weights, and a bias), but the inputs are not binary. In a sigmoid neuron,
the inputs may take on any value between **0** and **1**. The output is not binary either.
Instead, it is **f(w · x + b)**,
where the *activation function* **f** is the sigmoid function.

The *sigmoid function* is σ(*x*) =
` 1/(1 + e^-x)`

As the activation function, the sigmoid function maps **w · x + b** to a smooth curve that
also preserves desirable features of the activation function for perceptrons. When **w · x + b** is a large positive
number, the output is close to **1** because `e^-x` is close to **0**.
When **w · x + b** is a large negative
number, the output is close to **0** because `e^-x` is close to infinity.

(The sigmoid function is sometimes called the *logistic* function, and
sigmoid neurons are sometimes called *logistic neurons*.)

# A Network to Classify Digits

The MNIST data set contains scanned images of handwritten digits. (MNST is a modified subset of two data sets collected by the National Institute of Standards and Technology (NIST).) The images are greyscale and 28 by 28 pixels in size. They are split into 60,000 training images and 10,000 test images.

The input to each neuron in the input layer is one pixel from the input image. Since each image is 28 x 28 pixels, the input layer has 784 neurons (28 x 28). In the original MNIST data set, the images are in greyscale (where 0 is black, 255 is white, and values in between are decreasing shades of gray). To make the data set convenient to use in a Python program, an image takes the form of a NumPy (the fundamental package for scientific computing with Python) one-dimensional array of 784 values between 0 and 1 (where 0 is black, 1 is white, and values in between are decreasing shades of gray).

The output layer has 10 neurons. The first neuron indicates whether the image is a **0**, the second
whether the image is a **1**, and so on.

**Minimizing the Error Function**

This network needs to be "trained" to classify the digits correctly. The error in a network is a function of its weights and biases. Training a network is a matter of finding weights and biases that minimize the value of this function. Finding these weights and biases is a matter of descending along the gradient of the function.

To get some insight into the general idea, consider the function`f(x,y) = x^2y`.

The gradient (`gradf`) is the vector of partial derivatives

`[(delf)/(delx)(x,y) = 2xy, (delf)/(dely)(x,y) = x^2]`.

This vector points in the direction the function increases most rapidly. If the starting-point is `(2,2)`, the direction of steepest ascent is toward

`gradf(2,2) = ((delf)/(delx)(2,2) = 8, (delf)/(dely)(2,2) = 4) = (8,4) `

In training a neural network, the goal is to reduce the value of the function. If we step down from `(2,2)` with step size `eta = 0.5`, we arrive at

`(2 - eta(delf)/(delx)(2,2), 2 - eta(delf)/(dely)(2,2)) = (-2, 0)`

The code to plot the function is in the computer language Python (2.7).

% python2 Python 2.7.12 (default, Nov 7 2016, 11:55:55) [GCC 6.2.1 20160830] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np # NumPy (the fundamental package for scientific computing with Python) >>> import matplotlib.pyplot as plt # Matpltlib: Python plotting >>> from mpl_toolkits.mplot3d import Axes3D # The mplot3d Toolkit >>> >>> def fun(x, y): return x**2 * y ... >>> fig = plt.figure() >>> ax = fig.add_subplot(111, projection='3d') >>> x = y = np.arange(-10.0, 10.0, 0.05) >>> X, Y = np.meshgrid(x, y) >>> zs = np.array([fun(x,y) for x,y in zip(np.ravel(X), np.ravel(Y))]) >>> Z = zs.reshape(X.shape) >>> ax.plot_surface(X, Y, Z, cmap="hot") <mpl_toolkits.mplot3d.art3d.Poly3DCollection object at 0x7fad580d5fd0> >>> ax.set_xlabel('X Label') Text(0.5,0,'X Label') >>> ax.set_ylabel('Y Label') Text(0.5,0,'Y Label') >>> ax.set_zlabel('Z Label') Text(0.5,0,'Z Label') >>> plt.show()

**An Example Image from the MNIST Data Set**

The image (of the handwritten numeral "5") is in **training_data**.

**training_data** is a list of 50,000
2-tuples **(x, y)**.

**x** is a 784-dimensional array
containing the input image.

**y** is a 10-dimensional
array corresponding to the
label for image.

**training_data[0][0]** is the **x** in the first tuple.

**training_data[0][1]** is the **y** in the first tuple.

tom:arch [~/git/neural-networks-and-deep-learning/src] % python2 Python 2.7.12 (default, Jun 28 2016, 08:31:05) [GCC 6.1.1 20160602] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import mnist_loader >>> training_data, validation_data, test_data = mnist_loader.load_data_wrapper() >>> training_data[0][1].shape (10, 1) >>> training_data[0][1] array([[ 0.], [ 0.], [ 0.], [ 0.], [ 0.], [ 1.], [ 0.], [ 0.], [ 0.], [ 0.]]) >>> training_data[0][0].shape (784, 1) >>> import numpy as np >>> image_array = np.reshape(training_data[0][0], (28, 28)) >>> import matplotlib.pyplot as plt >>> image = plt.imshow(image_array, cmap ='gray') >>> plt.show()

# mnist_loader.py

The data set is from a tutorial on the website Deep Learning. The file is a "pickled" tuple of three lists. Each of the three lists is formed from a list of images and list of labels. An image is represented as NumPy one-dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands for black, 1 for white). The labels are numbers between 0 and 9 indicating which digit the image represents.

The function **load_data_wrapper()** returns **training_data**, **validation_data**, **test_data**.

**validation_data** and **test_data** are lists containing 10,000
2-tuples **(x, y)**. **x** is a 784-dimensional array
containing the input image. **y** is the label for the image.

import cPickle # Python object serialization import gzip # gzip import numpy as np def load_data(): f = gzip.open('../data/mnist.pkl.gz', 'rb') training_data, validation_data, test_data = cPickle.load(f) f.close() return (training_data, validation_data, test_data) def load_data_wrapper(): tr_d, va_d, te_d = load_data() training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]] training_results = [vectorized_result(y) for y in tr_d[1]] training_data = zip(training_inputs, training_results) validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]] validation_data = zip(validation_inputs, va_d[1]) test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]] test_data = zip(test_inputs, te_d[1]) return (training_data, validation_data, test_data) def vectorized_result(j): e = np.zeros((10, 1)) e[j] = 1.0 return e

# The Rest of the Python Program

We will not try to understand the code (which belongs to Michael Nielsen) or the underlying algorithm in complete detail.

**The Network Class**

class Network(object): def __init__(self, sizes): self.num_layers = len(sizes) self.sizes = sizes self.biases = [np.random.randn(y, 1) for y in sizes[1:]] self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

We can use this class to create a neural network. (Python is an object-oriented programming language.) The instruction

net = network.Network([2, 3, 1])

creates a neural network (net) whose input layer has two neurons, whose middle layer has three neurons, and whose output layer has one neuron.

tom:arch [~/git/neural-networks-and-deep-learning/src] % python2 Python 2.7.12 (default, Jun 28 2016, 08:31:05) [GCC 6.1.1 20160602] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import network # import module >>> net = network.Network([2, 3, 1]) # create instance of class >>>

The biases and weights are set as random numbers. The input layer has no bias. Biases are only used in
computing the output from later layers.

For the [2, 3, 1] network,
the biases are in a 3 x 1 array and a 1 x 1 array.

tom:arch [~/git/neural-networks-and-deep-learning/src] % python2 Python 2.7.12 (default, Jun 28 2016, 08:31:05) [GCC 6.1.1 20160602] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import network >>> net = network.Network([2, 3, 1]) >>> net.biases[0].shape (3, 1) >>> net.biases[0] array([[ 1.36630966], [ 1.05788544], [ 0.80606255]]) >>>net.biases[1].shape (1, 1) >>>net.biases[1] array([[ 1.54813682]]) >>>

For the [2, 3, 1] network, the weights are in a 3 x 2 array and a 1 x 3 array.

The first row in **net.weights[0]** are the respective weights the first neuron in the hidden layer attributes to
the outputs of the first and second neurons in the input layer.

>>> net.weights[0].shape (3, 2) >>> net.weights[0] array([[-0.27640848, 0.13942239], [ 1.13350606, 1.51767629], [-0.03836741, 0.06409297]]) >>> net.weights[1].shape (1, 3) >>> net.weights[1] array([[-0.72105625, 1.76366748, 1.49408987]]) >>>

**Stochastic (Mini-Batch) Gradient Descent**

For each epoch of training, the training data is randomly shuffled and partitioned into mini-batches. Once the last mini-batch has been processed, the network is evaluated against the test data.

def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None): if test_data: n_test = len(test_data) n = len(training_data) for j in xrange(epochs): random.shuffle(training_data) mini_batches = [training_data[k:k+mini_batch_size] for k in xrange(0, n, mini_batch_size)] for mini_batch in mini_batches: self.update_mini_batch(mini_batch, eta) if test_data: print "Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test) else: print "Epoch {0} complete".format(j)

The method **update_mini_batch** updates the weights and biases in the network.
It calculates
the gradient for each input in the mini-batch.
(The "nabla" is the inverted Greek delta.`gradf` is the *gradient* of the function `f`.)
Given the learning rate and the average of these gradients for the number of inputs, it updates the weights and basis in the network.

def update_mini_batch(self, mini_batch, eta): nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] for x, y in mini_batch: delta_nabla_b, delta_nabla_w = self.backprop(x, y) nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]

The
**update_mini_batch** uses the method **backprop** to compute the gradient.

The **backprop** method has two parts.

In **#feedforward**,
it forward feeds the training input (**x**) through the network. It stores the **zs** and **activations**
layer by layer.

In **#backward pass**, it uses the **zs** and **activations**
to compute the the *gradient* (`grad`) of the error function
at the current weight and biases.

def backprop(self, x, y): nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] # # feedforward # activation = x activations = [x] zs = [] for b, w in zip(self.biases, self.weights): z = np.dot(w, activation)+b zs.append(z) activation = sigmoid(z) activations.append(activation) # # The first time through the loop the activation is the input to the network # and w and b are the weights and biases the second layer imposes on this # input. In the [784, 30, 10] network, the input image is a 784x1 array and w # and b are a 30x784 array and a 30x1 array. The weighted sum input to the second # layer is stored in the array zs. The output of the second layer is stored in # the array activations. # # backward pass # delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1]) # fundamental equation 1 nabla_b[-1] = delta # fundamental equation 3 nabla_w[-1] = np.dot(delta, activations[-2].transpose()) # fundamental equation 4 for l in xrange(2, self.num_layers): z = zs[-l] sp = sigmoid_prime(z) delta = np.dot(self.weights[-l+1].transpose(), delta) * sp # fundamental equation 2 nabla_b[-l] = delta # fundamental equation 3 nabla_w[-l] = np.dot(delta, activations[-l-1].transpose()) # fundamental equation 4 return (nabla_b, nabla_w) def cost_derivative(self, output_activations, y): # derivative of the total error in the network return (output_activations-y)

When the pattern `x_i` from the training set is fed through the network, it produces
an

output `o_i` different in general from the target `t_i.` The ideal is to make `o_i = t_i`.
That is to say,

we want to minimize the
total error in the network

`E = 1/2sum_(i=1) norm(o_i - t_i)^2`

Q: Why the constant `1/2` in `E`?

A: To cancel the exponent when differentiating.

Since the learning rate is arbitrary, the introduction of a constant does not matter.

def sigmoid(z): return 1.0/(1.0+np.exp(-z)) def sigmoid_prime(z): # derivative of the sigmoid function return sigmoid(z)*(1-sigmoid(z))

**The Four Fundamental Equations (** under construction **)**

`delta_l^j` (the error in the `j^(th)` neuron in the `l^(th)` layer) is defined as `(delE)/(delz_l^j)`, where

`z_l^j = sum_k w_l^(jk) a_(l-1)^k + b_l^j`, for neuron `k` in the `(l-1)^(th)` layer

`w_l^(jk)` is the weight for the connection from the `k^(th)` neuron in the `(l - 1)^(th)` layer
to the `j^(th)` neuron in the `l^(th)` layer

`a_l^j` (the activation of the `j^(th)` neuron in the `l^(th)` layer) `= sigma(sum_k w_l^(jk) a_(l-1)^k + b_l^j)`, for neuron `k` in the `(l-1)^(th)` layer

`b_l^j` is the bias of the `j^(th)` neuron in the `l^(th)` layer

**1.** `delta_L^j`(the error in the `j^(th)` neuron in the output layer, `L`) = `(delE)/(dela_L^j)sigma'(z_L^j)`. The proof
is as follows:

By definition, `delta_L^j = (delE)/(delz_L^j)`.

Since
the error in the network is the sum of the error `z_L^j`
contributes

to the error in `a_L^k` (where `k in L`), it follows by the chain rule that

`(delE)/(delz_L^j) = sum_k^(k in L)(delE)/(dela_L^k) (dela_L^k)/(delz_L^j)`

Since `a_L^k` depends on `z_L^j` only when `k=j`, it follows that

`sum_k^(k in L)(delE)/(dela_L^k) (dela_L^k)/(delz_L^j) = (delE)/(dela_L^j) (dela_L^j)/(delz_L^j)`

Finally, because `a_L^j = sigma(z_L^j)`, it follows that

`(delE)/(dela_L^j) (dela_L^j)/(delz_L^j) = (delE)/(dela_L^j)sigma'(z_L^j)`

**2.** `delta_l^j = sum_k w_(l+1)^(jk)delta_(l+1)^ksigma'(z_l^j)`. The proof is as follows:

By definition, `delta_l^j = (delE)/(delz_l^j)`.

By the chain rule,

`delta_l^j = (delE)/(delz_l^j) = sum_k (delz_(l+1)^k)/(delz_l^j) (delE)/(delz_(l+1)^k)`

By the definition of `delta`,

`delta_l^j = (delE)/(delz_l^j) = sum_k (delz_(l+1)^k)/(delz_l^j) (delE)/(delz_(l+1)^k) = sum_k (delz_(l+1)^k)/(delz_l^j) delta_(l+1)^k`

Since

`z_(l+1)^k = sum_m w_(l+1)^(mk)a_l^m + b_(l+1)^k = sum_m w_(l+1)^(mk)sigma(z_l^m) + b_(l+1)^k`

it follows that

`(delz_(l+1)^k)/(delz_l^j) = w_(l+1)^(jk)sigma'(z_l^j)`

Hence,

`delta_l^j = (delE)/(delz_l^j) = sum_k (delz_(l+1)^k)/(delz_l^j) (delE)/(delz_(l+1)^k) = sum_k (delz_(l+1)^k)/(delz_l^j) delta_(l+1)^k = sum_k w_(l+1)^(jk)delta_(l+1)^ksigma'(z_l^j)`

**3.** `(delE)/(delb_l^j) = delta_l^j`. The proof is as follows:

By the chain rule,

`(delE)/(delb_l^j) = (delz_l^j)/(delb_l^j) (delE)/(delz_l^j)`

By the definition of `delta`,

`(delE)/(delb_l^j) = (delz_l^j)/(delb_l^j) (delE)/(delz_l^j) = (delz_l^j)/(delb_l^j)delta_l^j`

Since

`z_l^j = sum_k w_l^(jk)a_(l-1)^j + b_l^j`

it follows that

`(delz_l^j)/(delb_l^j) = 1`

Hence

`(delE)/(delb_l^j) = (delz_l^j)/(delb_l^j) (delE)/(delz_l^j) = (delz_l^j)/(delb_l^j)delta_l^j = delta_l^j`

**4.** `(delE)/(delw_l^(jk)) = a_(l-1)^kdelta_l^j`. The proof is as follows:

By the chain rule,

`(delE)/(delw_l^(jk)) = (delz_l^j)/(delw_l^(jk)) (delE)/(delz_l^j)`

By the definition of `delta`,

`(delE)/(delw_l^(jk)) = (delz_l^j)/(delw_l^(jk)) (delE)/(delz_l^j) = (delz_l^j)/(delw_l^(jk))delta_l^j`

Since

`z_l^j = sum_k w_l^(jk)a_(l-1)^j + b_l^j`

it follows that

`(delz_l^j)/(delw_l^(jk)) = a_(l-1)^j`

Hence

`(delE)/(delw_l^(jk)) = (delz_l^j)/(delw_l^(jk)) (delE)/(delz_l^j) = (delz_l^j)/(delw_l^(jk))delta_l^j = a_(l-1)^jdelta_l^j`

**The evaluate method**

The **evaluate** method returns the number of test inputs for which the
network outputs the correct result.

(Note that the output is the index of whichever neuron in the final layer has the highest activation.)

def evaluate(self, test_data): test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data] return sum(int(x == y) for (x, y) in test_results) def feedforward(self, a): for b, w in zip(self.biases, self.weights): a = sigmoid(np.dot(w, a)+b) return a

The *feedforward* method returns the output of the network given the input.

Consider the initial iteration for the [2, 3, 1] network. The input array is 2x1. The weights array
for the hidden layer is 3x2. The dot product is a 3x1 array. The biases array on the hidden layer is 3x1.
When the input is an array,
**Numpy** automatically applies the sigmoid function elementwise.

# The [784,30,10] Network in Action

The network has 784 neurons in the input layer, 30 in the hidden layer, and 10 in the output layer.

The code uses mini-batch, stochastic gradient descent to learn from the MNIST training_data over 30 epochs. The mini-batch size is 10. The learning rate (η) is 3.0.

After the network is trained, a random image is tested.

tom:arch [~/git/neural-networks-and-deep-learning/src] % python2 Python 2.7.12 (default, Nov 7 2016, 11:55:55) [GCC 6.2.1 20160830] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import mnist_loader >>> training_data, validation_data, test_data = mnist_loader.load_data_wrapper() >>> import network >>> net = network.Network([784, 30, 10]) >>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data) Epoch 0: 8268 / 10000 Epoch 1: 8393 / 10000 Epoch 2: 8422 / 10000 Epoch 3: 8466 / 10000 Epoch 4: 9321 / 10000 Epoch 5: 9385 / 10000 Epoch 6: 9383 / 10000 Epoch 7: 9391 / 10000 Epoch 8: 9392 / 10000 Epoch 9: 9422 / 10000 Epoch 10: 9423 / 10000 Epoch 11: 9427 / 10000 Epoch 12: 9462 / 10000 Epoch 13: 9480 / 10000 Epoch 14: 9453 / 10000 Epoch 15: 9474 / 10000 Epoch 16: 9466 / 10000 Epoch 17: 9447 / 10000 Epoch 18: 9488 / 10000 Epoch 19: 9501 / 10000 Epoch 20: 9481 / 10000 Epoch 21: 9487 / 10000 Epoch 22: 9493 / 10000 Epoch 23: 9461 / 10000 Epoch 24: 9485 / 10000 Epoch 25: 9454 / 10000 Epoch 26: 9503 / 10000 Epoch 27: 9497 / 10000 Epoch 28: 9495 / 10000 Epoch 29: 9478 / 10000 >>> import numpy as np >>> imgnr = np.random.randint(0,10000) >>> prediction = net.feedforward( test_data[imgnr][0] ) >>> print("Image number {0} is a {1}, and the network predicted a {2}".format(imgnr, test_data[imgnr][1], np.argmax(prediction))) Image number 4709 is a 2, and the network predicted a 2 >>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots(1,2,figsize=(8,4)) >>> ax[0].matshow( np.reshape(test_data[imgnr][0], (28,28) ), cmap='gray' ) >>> ax[1].plot( prediction, lw=3 ) >>> ax[1].set_aspect(9) >>> plt.show()

# Convolutional Neural Networks

The layers in a convolutional neural networks are not fully-connected. This allows them to be sensitive to spatial structure.

These images come from the explanation of convolutional neural networks

in Neural Networks and Deep Learning.

Convolutional neural networks have *convolutional layers*.

Each neuron
in the first hidden layer is connected to a small region of the input image. This region in the input image is
the *local receptive field* for the hidden neuron. The next neuron in the hidden layer is connected to a
local receptive field that overlaps with the previous field. The extent of the overlap is determined
by the *stride length*.

The size of the input image and the local receptive field determines the size the first hidden layer. If the input image is 28 x 28, the local receptive field is 5 x 5, and the stride length is 1, then the first hidden layer is 24 x 24.

Each hidden neuron has a bias and a set of weights. If the local receptive field is 5 x 5, then the hidden neuron has a 5 x 5 set of weights. Moreover, the biases and weights are the same for each neuron in the hidden layer. So the neurons in the first hidden layer detect the same input pattern or feature, no matter where it is in the image.

We can think of the hidden layer as consisting of a set of feature maps. If a feature map is 24 x 24, then a hidden layer consisting of 2 x 24 x 24 neurons consists in two maps and can detect two features.

Convolutional neural networks also have *pooling layers*.

Pooling layers summarize the information in
a region of a feature map. (There are different forms of pooling. In *max pooling*, the pooling neuron outputs
the maximum of the region in the feature map. In *L2 pooling*, the neuron outputs
the square root of the sum of the squares of the activations in the region.) If the hidden layer is 24 x 24, and the region to
summarize is 2 x 2, then the pooling layer is 12 x 12.

In the first example (Conv architecture), the input of the convolutional neural network is 28 x 28. The next layer in the network is the convolutional layer. In the first example, it uses a 5 x 5 local receptive field and 3 feature maps. So the convolutional layer is 3 x 24 x 24. The pooling layer is next in the network. The size of the region summarized in the feature maps is 2 x 2. So the pooling layer is 3 x 12 x 12. The final layer is fully-connected. Every neuron in the pooling layer is connected to everyone one of the 10 output neurons.

In the second example (Conv + FC architecture), the convolutional nueral network is more complicated. There are 20 feature maps. In addition, the output layer is a *softmax layer*. Further, there is
a fully-connected layer between the pooling layer and the softmax layer.

In a softmax layer, the softmax function (not the sigmoid function) is applied to get the activation. The output of the softmax function is a probability distribution. So `a_L^j` is the probability that the digit the image represents is `j`.

**The Python/Theano Program** (network3_tab.py)

In the following session, the convolutional neural network (**net**) has the
"Conv + FC architecture" represented in the second example.
The input is a 28 x 28 image from the MNIST dataset. The convolutional layer is
20 x 24 x 24. The pooling layer is 3 x 12 x 12. These layers are followed by a fully-connected layer
and a softmax output layer.

% python2 Python 2.7.12 (default, Nov 7 2016, 11:55:55) [GCC 6.2.1 20160830] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import network3_tab >>> from network3_tab import Network >>> from network3_tab import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer >>> training_data, validation_data, test_data = network3_tab.load_data_shared() >>> mini_batch_size = 10 >>> net = Network([ ... ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5)), ... FullyConnectedLayer(n_in=20*12*12, n_out=100), ... SoftmaxLayer(n_in=100, n_out=10)], ... mini_batch_size) >>>

**Load the MNIST data**

The MNIST data is pickled as a tuple of three lists.
Each of the three lists is formed from a list of images and list of labels.
The images and labels are stored in
Theano
shared variables so that the calculations
can be processed on the GPU. In GPU memory,
the data must be stored as a floating point. The program uses the labels as integers, so
**shared_y** is returned as an integer.

def load_data_shared(filename="../data/mnist.pkl.gz"): f = gzip.open(filename, 'rb') training_data, validation_data, test_data = cPickle.load(f) f.close() def shared(data): shared_x = theano.shared( np.asarray(data[0], dtype=theano.config.floatX), borrow=True) shared_y = theano.shared( np.asarray(data[1], dtype=theano.config.floatX), borrow=True) return shared_x, T.cast(shared_y, "int32") return [shared(training_data), shared(validation_data), shared(test_data)]

**The ConvPoolLayer, FullyConnectedLayer, and SoftmaxLayer**

The first layer in **net** is really two layers: a convolutional layer and a max-pooling layer.

**ConvPoolLayer** initializes the weights using a Gaussian distribution with mean 0
and standard deviation 1 over the square root of the number of
weights connecting to the same neuron. (This helps prevent saturation.) It initializes the biases
using a Gaussian distribution with mean 0 and standard
deviation 1. It loads these weights and biases into shared variables.
The method **set_inpt** defines the algorithm for symbolically calculating
the output of the layer. It uses
theano.tensor.nnet.conv2d and
theano.tensor.signal.pool.pool_2d.
(Convolution arithmetic tutorial)

class ConvPoolLayer(object): def __init__(self, filter_shape, image_shape, poolsize=(2, 2), activation_fn=sigmoid): self.filter_shape = filter_shape self.image_shape = image_shape self.poolsize = poolsize self.activation_fn=activation_fn # initialize weights and biases n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize)) self.w = theano.shared( np.asarray( np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape), dtype=theano.config.floatX), borrow=True) self.b = theano.shared( np.asarray( np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)), dtype=theano.config.floatX), borrow=True) self.params = [self.w, self.b] def set_inpt(self, inpt, mini_batch_size): self.inpt = inpt.reshape(self.image_shape) conv_out = conv.conv2d( input=self.inpt, filters=self.w, filter_shape=self.filter_shape, image_shape=self.image_shape) pooled_out = pool.pool_2d( input=conv_out, ds=self.poolsize, ignore_border=True) self.output = self.activation_fn( pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))

The other two layer classes (**FullyConnectedLayer** and **SoftmaxLayer**) are
similar to **ConvPoolLayer**. The primary difference is in the **set_inpt** method.

class FullyConnectedLayer(object): def __init__(self, n_in, n_out, activation_fn=sigmoid): self.n_in = n_in self.n_out = n_out self.activation_fn = activation_fn # Initialize weights and biases self.w = theano.shared( np.asarray( np.random.normal( loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)), dtype=theano.config.floatX), name='w', borrow=True) self.b = theano.shared( np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)), dtype=theano.config.floatX), name='b', borrow=True) self.params = [self.w, self.b] def set_inpt(self, inpt, mini_batch_size): self.inpt = inpt.reshape((mini_batch_size, self.n_in)) self.output = self.activation_fn( T.dot(self.inpt, self.w) + self.b) self.y_out = T.argmax(self.output, axis=1) def accuracy(self, y): return T.mean(T.eq(y, self.y_out))

The cost function in **SoftmaxLayer** is the negative log-likelihood function.

If x is the input to the network and y is the desired output, then the log-likelihood cost of x is `-ln a_L^j`. As the probability of output approaches 1, the cost approaches 0. As the probability of the output approaches 0, the cost approaches infinity.

class SoftmaxLayer(object): def __init__(self, n_in, n_out): self.n_in = n_in self.n_out = n_out # Initialize weights and biases self.w = theano.shared( np.zeros((n_in, n_out), dtype=theano.config.floatX), name='w', borrow=True) self.b = theano.shared( np.zeros((n_out,), dtype=theano.config.floatX), name='b', borrow=True) self.params = [self.w, self.b] def set_inpt(self, inpt, mini_batch_size): self.inpt = inpt.reshape((mini_batch_size, self.n_in)) self.output = softmax(T.dot(self.inpt, self.w) + self.b) self.y_out = T.argmax(self.output, axis=1) def cost(self, net): # net.y.shape[0] is the number of the training examples in the minibatch (N) # T.arange(net.y.shape[0]) is a symbolic vector of integers [0,1,2,...,N-1] # T.log(self.output) is a NxK matrix, where in this case K = 10 (the number of digits 0..9) # T.log(self.output)[T.arange(net.y.shape[0]), net.y] is a vector of length N with the log-likelihoods of the labels # The mean is the average across the all the training examples in the minibatch return -T.mean(T.log(self.output)[T.arange(net.y.shape[0]), net.y]) def accuracy(self, y): return T.mean(T.eq(y, self.y_out))

**The Network Class**

The **Network** class creates a network from a list of layers and a minibatch size. It
defines the symbolic variables for the input (**self.x**) to and desired output (**self.y**) from the network. It sets
the input to the initial layer. It propagates **self.x** forward through the layers of the network in
to symbolically define the output from the network.

The method **SGD** trains the network using mini-batch stochastic gradient descent. The functions
**train_mb** and **test_mb_accuracy** are called in the training.

class Network(object): def __init__(self, layers, mini_batch_size): self.layers = layers self.mini_batch_size = mini_batch_size self.params = [param for layer in self.layers for param in layer.params] self.x = T.matrix("x") self.y = T.ivector("y") init_layer = self.layers[0] init_layer.set_inpt(self.x, self.mini_batch_size) for j in xrange(1, len(self.layers)): prev_layer, layer = self.layers[j-1], self.layers[j] layer.set_inpt(prev_layer.output, self.mini_batch_size) self.output = self.layers[-1].output def SGD(self, training_data, epochs, mini_batch_size, eta, test_data): training_x, training_y = training_data test_x, test_y = test_data num_training_batches = size(training_data)/mini_batch_size num_test_batches = size(test_data)/mini_batch_size cost = self.layers[-1].cost(self) grads = T.grad(cost, self.params) updates = [(param, param-eta*grad) for param, grad in zip(self.params, grads)] # define functions to train a mini-batch compute the accuracy in test mini-batches. i = T.lscalar() # mini-batch index train_mb = theano.function( [i], cost, updates=updates, givens={ self.x: training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size], self.y: training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size] }) test_mb_accuracy = theano.function( [i], self.layers[-1].accuracy(self.y), givens={ self.x: test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size], self.y: test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size] }) # train the network for epoch in xrange(epochs): for minibatch_index in xrange(num_training_batches): iteration = num_training_batches*epoch+minibatch_index if iteration % 1000 == 0: print("Training mini-batch number {0}".format(iteration)) train_mb(minibatch_index) if (iteration+1) % num_training_batches == 0: if test_data: test_accuracy = np.mean([test_mb_accuracy(j) for j in xrange(num_test_batches)]) print("The network accuracy on test data is {0:.2%}".format(test_accuracy)) def size(data): return data[0].get_value(borrow=True).shape[0]

# The Convolutional Neural Network in Action

Training this network takes time, about 75 minutes on my (relatively old) Arch Linux 4x Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz (launch date Q1'11).

% python2 Python 2.7.12 (default, Nov 7 2016, 11:55:55) [GCC 6.2.1 20160830] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import network3_tab >>> from network3_tab import Network >>> from network3_tab import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer >>> training_data, validation_data, test_data = network3_tab.load_data_shared() >>> mini_batch_size = 10 >>> net = Network([ ... ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5)), ... FullyConnectedLayer(n_in=20*12*12, n_out=100), ... SoftmaxLayer(n_in=100, n_out=10)], ... mini_batch_size) >>> net.SGD(training_data, 60, mini_batch_size, 0.1, test_data) Training mini-batch number 0 Training mini-batch number 1000 Training mini-batch number 2000 Training mini-batch number 3000 Training mini-batch number 4000 The network accuracy on test data is 92.99% Training mini-batch number 5000 Training mini-batch number 6000 Training mini-batch number 7000 Training mini-batch number 8000 Training mini-batch number 9000 The network accuracy on test data is 95.47% Training mini-batch number 10000 Training mini-batch number 11000 Training mini-batch number 12000 Training mini-batch number 13000 Training mini-batch number 14000 The network accuracy on test data is 96.68% Training mini-batch number 15000 Training mini-batch number 16000 Training mini-batch number 17000 Training mini-batch number 18000 Training mini-batch number 19000 The network accuracy on test data is 97.17% Training mini-batch number 20000 Training mini-batch number 21000 Training mini-batch number 22000 Training mini-batch number 23000 Training mini-batch number 24000 The network accuracy on test data is 97.64% Training mini-batch number 25000 Training mini-batch number 26000 Training mini-batch number 27000 Training mini-batch number 28000 Training mini-batch number 29000 The network accuracy on test data is 97.82% Training mini-batch number 30000 Training mini-batch number 31000 Training mini-batch number 32000 Training mini-batch number 33000 Training mini-batch number 34000 The network accuracy on test data is 97.83% Training mini-batch number 35000 Training mini-batch number 36000 Training mini-batch number 37000 Training mini-batch number 38000 Training mini-batch number 39000 The network accuracy on test data is 97.91% Training mini-batch number 40000 Training mini-batch number 41000 Training mini-batch number 42000 Training mini-batch number 43000 Training mini-batch number 44000 The network accuracy on test data is 97.99% Training mini-batch number 45000 Training mini-batch number 46000 Training mini-batch number 47000 Training mini-batch number 48000 Training mini-batch number 49000 The network accuracy on test data is 98.16% Training mini-batch number 50000 Training mini-batch number 51000 Training mini-batch number 52000 Training mini-batch number 53000 Training mini-batch number 54000 The network accuracy on test data is 98.24% Training mini-batch number 55000 Training mini-batch number 56000 Training mini-batch number 57000 Training mini-batch number 58000 Training mini-batch number 59000 The network accuracy on test data is 98.23% Training mini-batch number 60000 Training mini-batch number 61000 Training mini-batch number 62000 Training mini-batch number 63000 Training mini-batch number 64000 The network accuracy on test data is 98.29% Training mini-batch number 65000 Training mini-batch number 66000 Training mini-batch number 67000 Training mini-batch number 68000 Training mini-batch number 69000 The network accuracy on test data is 98.31% Training mini-batch number 70000 Training mini-batch number 71000 Training mini-batch number 72000 Training mini-batch number 73000 Training mini-batch number 74000 The network accuracy on test data is 98.44% Training mini-batch number 75000 Training mini-batch number 76000 Training mini-batch number 77000 Training mini-batch number 78000 Training mini-batch number 79000 The network accuracy on test data is 98.49% Training mini-batch number 80000 Training mini-batch number 81000 Training mini-batch number 82000 Training mini-batch number 83000 Training mini-batch number 84000 The network accuracy on test data is 98.56% Training mini-batch number 85000 Training mini-batch number 86000 Training mini-batch number 87000 Training mini-batch number 88000 Training mini-batch number 89000 The network accuracy on test data is 98.57% Training mini-batch number 90000 Training mini-batch number 91000 Training mini-batch number 92000 Training mini-batch number 93000 Training mini-batch number 94000 The network accuracy on test data is 98.60% Training mini-batch number 95000 Training mini-batch number 96000 Training mini-batch number 97000 Training mini-batch number 98000 Training mini-batch number 99000 The network accuracy on test data is 98.60% Training mini-batch number 100000 Training mini-batch number 101000 Training mini-batch number 102000 Training mini-batch number 103000 Training mini-batch number 104000 The network accuracy on test data is 98.63% Training mini-batch number 105000 Training mini-batch number 106000 Training mini-batch number 107000 Training mini-batch number 108000 Training mini-batch number 109000 The network accuracy on test data is 98.66% Training mini-batch number 110000 Training mini-batch number 111000 Training mini-batch number 112000 Training mini-batch number 113000 Training mini-batch number 114000 The network accuracy on test data is 98.66% Training mini-batch number 115000 Training mini-batch number 116000 Training mini-batch number 117000 Training mini-batch number 118000 Training mini-batch number 119000 The network accuracy on test data is 98.69% Training mini-batch number 120000 Training mini-batch number 121000 Training mini-batch number 122000 Training mini-batch number 123000 Training mini-batch number 124000 The network accuracy on test data is 98.72% Training mini-batch number 125000 Training mini-batch number 126000 Training mini-batch number 127000 Training mini-batch number 128000 Training mini-batch number 129000 The network accuracy on test data is 98.71% Training mini-batch number 130000 Training mini-batch number 131000 Training mini-batch number 132000 Training mini-batch number 133000 Training mini-batch number 134000 The network accuracy on test data is 98.71% Training mini-batch number 135000 Training mini-batch number 136000 Training mini-batch number 137000 Training mini-batch number 138000 Training mini-batch number 139000 The network accuracy on test data is 98.71% Training mini-batch number 140000 Training mini-batch number 141000 Training mini-batch number 142000 Training mini-batch number 143000 Training mini-batch number 144000 The network accuracy on test data is 98.72% Training mini-batch number 145000 Training mini-batch number 146000 Training mini-batch number 147000 Training mini-batch number 148000 Training mini-batch number 149000 The network accuracy on test data is 98.72% Training mini-batch number 150000 Training mini-batch number 151000 Training mini-batch number 152000 Training mini-batch number 153000 Training mini-batch number 154000 The network accuracy on test data is 98.72% Training mini-batch number 155000 Training mini-batch number 156000 Training mini-batch number 157000 Training mini-batch number 158000 Training mini-batch number 159000 The network accuracy on test data is 98.71% Training mini-batch number 160000 Training mini-batch number 161000 Training mini-batch number 161000 Training mini-batch number 162000 Training mini-batch number 163000 Training mini-batch number 164000 The network accuracy on test data is 98.70% Training mini-batch number 165000 Training mini-batch number 166000 Training mini-batch number 167000 Training mini-batch number 168000 Training mini-batch number 169000 The network accuracy on test data is 98.68% Training mini-batch number 170000 Training mini-batch number 171000 Training mini-batch number 172000 Training mini-batch number 173000 Training mini-batch number 174000 The network accuracy on test data is 98.68% Training mini-batch number 175000 Training mini-batch number 176000 Training mini-batch number 177000 Training mini-batch number 178000 Training mini-batch number 179000 The network accuracy on test data is 98.69% Training mini-batch number 180000 Training mini-batch number 181000 Training mini-batch number 182000 Training mini-batch number 183000 Training mini-batch number 184000 The network accuracy on test data is 98.68% Training mini-batch number 185000 Training mini-batch number 186000 Training mini-batch number 187000 Training mini-batch number 188000 Training mini-batch number 189000 The network accuracy on test data is 98.69% Training mini-batch number 190000 Training mini-batch number 191000 Training mini-batch number 192000 Training mini-batch number 193000 Training mini-batch number 194000 The network accuracy on test data is 98.69% Training mini-batch number 195000 Training mini-batch number 196000 Training mini-batch number 197000 Training mini-batch number 198000 Training mini-batch number 199000 The network accuracy on test data is 98.69% Training mini-batch number 200000 Training mini-batch number 201000 Training mini-batch number 202000 Training mini-batch number 203000 Training mini-batch number 204000 The network accuracy on test data is 98.71% Training mini-batch number 205000 Training mini-batch number 206000 Training mini-batch number 207000 Training mini-batch number 208000 Training mini-batch number 209000 The network accuracy on test data is 98.72% Training mini-batch number 210000 Training mini-batch number 211000 Training mini-batch number 212000 Training mini-batch number 213000 Training mini-batch number 214000 The network accuracy on test data is 98.73% Training mini-batch number 215000 Training mini-batch number 216000 Training mini-batch number 217000 Training mini-batch number 218000 Training mini-batch number 219000 The network accuracy on test data is 98.73% Training mini-batch number 220000 Training mini-batch number 221000 Training mini-batch number 222000 Training mini-batch number 223000 Training mini-batch number 224000 The network accuracy on test data is 98.74% Training mini-batch number 225000 Training mini-batch number 226000 Training mini-batch number 227000 Training mini-batch number 228000 Training mini-batch number 229000 The network accuracy on test data is 98.74% Training mini-batch number 230000 Training mini-batch number 231000 Training mini-batch number 232000 Training mini-batch number 233000 Training mini-batch number 234000 The network accuracy on test data is 98.74% Training mini-batch number 235000 Training mini-batch number 236000 Training mini-batch number 237000 Training mini-batch number 238000 Training mini-batch number 239000 The network accuracy on test data is 98.73% Training mini-batch number 240000 Training mini-batch number 241000 Training mini-batch number 242000 Training mini-batch number 243000 Training mini-batch number 244000 The network accuracy on test data is 98.73% Training mini-batch number 245000 Training mini-batch number 246000 Training mini-batch number 247000 Training mini-batch number 248000 Training mini-batch number 249000 The network accuracy on test data is 98.74% Training mini-batch number 250000 Training mini-batch number 251000 Training mini-batch number 252000 Training mini-batch number 253000 Training mini-batch number 254000 The network accuracy on test data is 98.75% Training mini-batch number 255000 Training mini-batch number 256000 Training mini-batch number 257000 Training mini-batch number 258000 Training mini-batch number 259000 The network accuracy on test data is 98.76% Training mini-batch number 260000 Training mini-batch number 261000 Training mini-batch number 262000 Training mini-batch number 263000 Training mini-batch number 264000 The network accuracy on test data is 98.78% Training mini-batch number 265000 Training mini-batch number 266000 Training mini-batch number 267000 Training mini-batch number 268000 Training mini-batch number 269000 The network accuracy on test data is 98.79% Training mini-batch number 270000 Training mini-batch number 271000 Training mini-batch number 272000 Training mini-batch number 273000 Training mini-batch number 274000 The network accuracy on test data is 98.80% Training mini-batch number 275000 Training mini-batch number 276000 Training mini-batch number 277000 Training mini-batch number 278000 Training mini-batch number 279000 The network accuracy on test data is 98.80% Training mini-batch number 280000 Training mini-batch number 281000 Training mini-batch number 282000 Training mini-batch number 283000 Training mini-batch number 284000 The network accuracy on test data is 98.80% Training mini-batch number 285000 Training mini-batch number 286000 Training mini-batch number 287000 Training mini-batch number 288000 Training mini-batch number 289000 The network accuracy on test data is 98.80% Training mini-batch number 290000 Training mini-batch number 291000 Training mini-batch number 292000 Training mini-batch number 293000 Training mini-batch number 294000 The network accuracy on test data is 98.80% Training mini-batch number 295000 Training mini-batch number 296000 Training mini-batch number 297000 Training mini-batch number 298000 Training mini-batch number 299000 The network accuracy on test data is 98.80% >>>

# The Street View House Numbers (SVHN) Dataset

The SVHN is obtained from from images of the house numbers in the Google Street View images. Recognizing digits in this "real world" data set is considerably more challenging.

% python2 Python 2.7.12 (default, Nov 7 2016, 11:55:55) [GCC 6.2.1 20160830] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import scipy.io as sio >>> import matplotlib.pyplot as plt >>> >>> train_data = sio.loadmat('train_32x32.mat') >>> >>> x_train = train_data['X'] >>> y_train = train_data['y'] >>> >>> image_index = 109 >>> image=plt.imshow(x_train[:,:,:,image_index]) >>> print y_train[image_index] [3] >>> plt.show()