Neural Networks

Neural networks are computational devices whose structure is inspired by the way neurons work in the brain.

A neuron processes and transmits information. In the human brain, there are about 85 billion neurons. A typical neuron consists a cell body, dendrites, and an axon. The dendrites takes input from other neurons in form of an electrical impulse. The cell body processes these inputs, and the axon terminals transmit outputs in form of an electrical impulse.

A perceptron is an artificial neuron. It takes binary inputs and computes a binary output. The computation involves weights (x1, x2, ...) and a threshold value. If the weighted sum Σj wj xj is greater than the threshold value, then the output is 1. Otherwise, it is 0. It is typical express the computation in terms of the dot product w · x (= wT·x = Σj wj xj), where w is a 1×n matrix of weights and and x is a n×1 matrix of inputs. Further, the threshold is said to be the perceptron's bias, b (= -threshold). In these terms, the output is 1 if w · x + b > 0. Otherwise, it is 0.

Perceptrons can implement logic functions. Conjunction (φ ∧ ψ) is an example. Let the perceptron have two inputs, a weight of 0.6 each, and a threshold value of 1. If both inputs are 1, the sum exceeds the threshold and thus the output is 1. Otherwise, the output is 0. This matches the truth-table for conjunction.

Artificial neurons may be linked in a feedfoward network. This a network in which the ouput from one layer is the input to the next layer. The first layer is input layer of neurons. The last layer is the output layer. The hidden layers are the neurons that are neither input nor output neurons.

Networks of artificial neurons may be understood as devices to make "decisions about decisions." The first layer of neurons makes a "decision" by weighing the input evidence, the next layer of neurons makes a "decision about the decisions" of the prior layer, and so on.

Sigmoid Neurons

A sigmoid neuron have an important feature a perceptron lacks: small changes in the weights and bias cause only small changes in the output. This allows sigmoid neurons to "learn."


A sigmoid neuron has the same parts as a perceptron (inputs, weights, and a bias), but the inputs are not binary. In a sigmoid neuron, they may take on any value between 0 and 1. The output is not binary either. Instead, it is f(w · x + b), where the activation function f is the sigmoid function.

The sigmoid function is σ(x) = ` 1/(1 + e^-x)`

As the activation function, the sigmoid function maps w · x + b to a smooth curve that also preserves desirable features of the activation function for perceptrons. When w · x + b is a large positive number, the output is close to 1 because `e^-x` is close to 0. When w · x + b is a large negative number, output is close to 0 because `e^-x` is close to infinity.

(The sigmoid function is sometimes called the logistic function, and sigmoid neurons are sometimes called logistic neurons.)

A Network to Classify Digits

The MNIST data set contains scanned images of handwritten digits. (MNST is a modified subset of two data sets collected by the National Institute of Standards and Technology (NIST).) The images are greyscale and 28 by 28 pixels in size. They are split into 60,000 training images and 10,000 test images.

The input to each neuron in the input layer is one pixel from a given image. Since each image is 28 x 28 pixels, the input layer has 784 neurons (28 x 28). In the original MNIST data set, the images are in greyscale (where 0 is black, 255 is white, and values in between are decreasing shades of gray). To make the data set convenient to use in a Python program, an image is a 1-dimensional array of 784 values between 0 and 1 (where 0 is black, 1 is white, and values in between are decreasing shades of gray).

The output layer has 10 neurons. The first neuron indicates whether the image is a 0, the second whether the image is a 1, and so on.


Minimizing the Error Function

This network needs to be "trained" to classify the digits correctly. The error in a network is a function of its weights and biases. Training a network is a matter of finding weights and biases that minimize the value of this function. Finding these weights and biases is a matter of descending along the gradient of the function.

To get some insight into the general idea, consider the function`f(x,y) = x^2y`.


The gradient (`gradf`) is the vector of partial derivatives

`[(delf)/(delx)(x,y) = 2xy, (delf)/(dely)(x,y) = x^2]`.

This vector points in the direction the function increases most rapidly. If the starting-point is `(2,2)`, the direction of steepest ascent is to

`gradf(2,2) = ((delf)/(delx)(2,2) = 8, (delf)/(dely)(2,2) = 4) = (8,4) `

In training a neural network, the goal is to reduce the value of the function. If we step down from `(2,2)` with step size `eta = 0.5`, we arrive at

`(2 - eta(delf)/(delx)(2,2), 2 - eta(delf)/(dely)(2,2)) = (-2, 0)`



The code to plot the function is in the computer language Python (2.7).


% python2
Python 2.7.12 (default, Nov  7 2016, 11:55:55) 
[GCC 6.2.1 20160830] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> from mpl_toolkits.mplot3d import Axes3D
>>> import matplotlib.pyplot as plt
>>> 
>>> def fun(x, y): return x**2 * y
... 
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111, projection='3d')
>>> x = y = np.arange(-3.0, 3.0, 0.05)
>>> X, Y = np.meshgrid(x, y)
>>> zs = np.array([fun(x,y) for x,y in zip(np.ravel(X), np.ravel(Y))])
>>> Z = zs.reshape(X.shape)
>>> ax.plot_surface(X, Y, Z, cmap="hot")
>>> ax.set_xlabel('X Label')
>>> ax.set_ylabel('Y Label')
>>> ax.set_zlabel('Z Label')
>>> plt.show()

     

An Example Image from the MNIST Data Set

This is the first image in the training data. The image is in training_data[0][0]. The label is in training_data[0][1].


	
% python2
Python 2.7.12 (default, Jun 28 2016, 08:31:05) 
[GCC 6.1.1 20160602] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mnist_loader
>>> training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
>>> training_data[0][1].shape
(10, 1)
>>> training_data[0][1]
array([[ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 1.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.]])
>>> training_data[0][0].shape
(784, 1)
>>> import numpy as np
>>> image_array = np.reshape(training_data[0][0], (28, 28))       
>>> import matplotlib.pyplot as plt
>>> image = plt.imshow(image_array, cmap ='gray')
>>> plt.show()	

	

mnist_loader.py

The data set is from a tutorial on the website Deep Learning. They have "pickled" the dataset to make it easier to use in Python. The file is a tuple of three lists. Each of the three lists is formed from a list of images and list of labels. An image is represented as numpy 1-dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands for black, 1 for white). The labels are numbers between 0 and 9 indicating which digit the image represents. The function load_data_wrapper() returns a tuple containing training_data, validation_data, test_data.

training_data is a list of 50,000 2-tuples (x, y). x is a 784-dimensional array containing the input image. y is a 10-dimensional array corresponding to the label for image.

validation_data and test_data are lists containing 10,000 2-tuples (x, y). x is a 784-dimensional array containing the input image. y is the label for the image.

	
import cPickle
import gzip
import numpy as np

def load_data():
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e
		
	

The Rest of the Python Program

We will not try to understand the code (which belongs to Michael Nielsen) or the underlying algorithm in complete detail.


The Network Class

	
class Network(object):

    def __init__(self, sizes):      
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
	

We can use this class to create a neural network. The instruction

net = network.Network([2, 3, 1])

creates a neural network whose input layer has two neurons, whose middle layer has three neurons, and whose output layer has one neuron.

	
% python2
Python 2.7.12 (default, Jun 28 2016, 08:31:05) 
[GCC 6.1.1 20160602] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import network
>>> net = network.Network([2, 3, 1])
>>> 
	

The biases and weights are set as random numbers. The input layer has no bias. (Biases are only used in computing the output from later layers.) randn() generates an array. For the [2, 3, 1] network, the biases are in a 3 x 1 array and a 1 x 1 array. (randn is in the library NumPy (the fundamental package for scientific computing with Python).)

	
% python2
Python 2.7.12 (default, Jun 28 2016, 08:31:05) 
[GCC 6.1.1 20160602] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import network
>>> net = network.Network([2, 3, 1])
>>> net.biases[0].shape
(3, 1)
>>> net.biases[0]
array([[ 1.36630966],
       [ 1.05788544],
       [ 0.80606255]])
>>>net.biases[1].shape
(1, 1)       
>>>net.biases[1]
array([[ 1.54813682]])
>>>	
	

The function zip() makes a list of tuples out of the lists its zips together.

	
% python2
Python 2.7.12 (default, Jun 28 2016, 08:31:05) 
[GCC 6.1.1 20160602] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]
>>> 
	
	

For the [2, 3, 1] network, the weights are in a 3 x 2 array and a 1 x 3 array.

The first row in net.weights[0] are the respective weights the first neuron in the hidden layer attributes to the outputs of the first and second neurons in the input layer.

	
% python2
Python 2.7.12 (default, Jun 28 2016, 08:31:05) 
[GCC 6.1.1 20160602] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> net.weights[0].shape
(3, 2)
>>> net.weights[0]
array([[-0.27640848,  0.13942239],
       [ 1.13350606,  1.51767629],
       [-0.03836741,  0.06409297]])
>>> net.weights[1].shape
(1, 3)       
>>> net.weights[1]
array([[-0.72105625,  1.76366748,  1.49408987]])
>>> 
	
	

Mini-Batch Stochastic Gradient Descent

For each epoch of training, the training data is randomly shuffled and partitioned into mini-batches of training data. The weights and biases in the network are updated for each mini-batch. The argument eta to the method update_mini_batch is the learning rate. Once the last mini-batch has been processed, the program evaluates the network against the test data.

	
def SGD(self, training_data, epochs, mini_batch_size, eta,
        test_data=None):
    if test_data: n_test = len(test_data)
    n = len(training_data)
    for j in xrange(epochs):
        random.shuffle(training_data)
        mini_batches = [training_data[k:k+mini_batch_size] for k in xrange(0, n, mini_batch_size)]
        for mini_batch in mini_batches:
            self.update_mini_batch(mini_batch, eta)
        if test_data:
            print "Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test)
        else:
            print "Epoch {0} complete".format(j)	
            	
	

The method update_mini_batch updates the weights and biases in the network. It calculates the gradients for the inputs in mini-batch of training data. (The "nabla" is the inverted Greek delta, ∇.`gradf` is the gradient of the function `f`.) Given the average of these gradient and the learning rate, it updates the weights and basis in the network.

	
def update_mini_batch(self, mini_batch, eta):
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]	
	
	

The update_mini_batch uses the method backprop to compute the gradient.

backprop has two parts. In #feedforward, it forward feeds the training input (x) through the network. It stores the activations and zs layer by layer. In #backward pass, it uses activations and zs to compute the the gradient (`grad`) of the error function at the current weight and biases in terms of the intermediate value delta.

`delta_l^j` is defined as `(delE)/(delz_l^j)`, where

`delta_l^j` is the error in the `j^(th)` neuron in the `l^(th)` layer

`w_l^(jk)` is the weight for the connection from the `k^(th)` neuron in the `(l - 1)^(th)` layer to the `j^(th)` neuron in the `l^(th)` layer

`b_l^j` is the bias of the `j^(th)` neuron in the `l^(th)` layer

`a_l^j` is the activation of the `j^(th)` neuron in the `l^(th)` layer

`a_l^j = sigma(sum_k w_l^(jk) a_(l-1)^k + b_l^j)`, for neurons `k` in the `(l-1)^(th)` layer

`z_l^j = sum_k w_l^(jk) a_(l-1)^k + b_l^j`, for `k` in the `(l-1)^(th)` layer


The Four Fundamental Equations

1.The proof of `delta_L^j = (delE)/(dela_L^j)sigma'(z_L^j)` is as follows. By definition, `delta_L^j = (delE)/(delz_L^j)`. The change `z_L^j` makes to the total error is a function the change `z_L^j` makes to `a_L^j`. So by the chain rule,

`delta_L^j = (delE)/(delz_L^j) = sum_k(delE)/(dela_L^k) (dela_L^k)/(delz_L^j)` for neurons `k` in the output layer (`L`), where

`a_l^j` is the activation of the `j^(th)` neuron in the `l^(th)` layer, `a_l^j = sigma (sum_k w_l^(jk) a_(l-1)^k + b_l^j)`, for `k` in the `(l-1)^(th)` layer

Since `a_L^k` depends on `z_L^j` only when `k=j`, it follows that

`delta_L^j = (delE)/(delz_L^j) = sum_k^(k in L)(delE)/(dela_L^k) (dela_L^k)/(delz_L^j) = (delE)/(dela_L^j) (dela_L^j)/(delz_L^j)`

Finally, because `a_L^j = sigma(z_L^j)`, it follows that

`delta_L^j = (delE)/(delz_L^j) = sum_k^(k in L)(delE)/(dela_L^k) (dela_L^k)/(delz_L^j) = (delE)/(dela_L^j) (dela_L^j)/(delz_L^j) = (delE)/(dela_L^j)sigma'(z_L^j)`


2.The proof of `delta_l^j = sum_k w_(l+1)^(jk)delta_(l+1)^ksigma'(z_l^j)` is as follows. By definition, `delta_l^j = (delE)/(delz_l^j)`. By the chain rule,

`delta_l^j = (delE)/(delz_l^j) = sum_k (delz_(l+1)^k)/(delz_l^j) (delE)/(delz_(l+1)^k)`

By the definition of `delta`,

`delta_l^j = (delE)/(delz_l^j) = sum_k (delz_(l+1)^k)/(delz_l^j) (delE)/(delz_(l+1)^k) = sum_k (delz_(l+1)^k)/(delz_l^j) delta_(l+1)^k`

Since

`z_(l+1)^k = sum_m w_(l+1)^(mk)a_l^m + b_(l+1)^k = sum_m w_(l+1)^(mk)sigma(z_l^m) + b_(l+1)^k`

it follows that

`(delz_(l+1)^k)/(delz_l^j) = w_(l+1)^(jk)sigma'(z_l^j)`

Hence,

`delta_l^j = (delE)/(delz_l^j) = sum_k (delz_(l+1)^k)/(delz_l^j) (delE)/(delz_(l+1)^k) = sum_k (delz_(l+1)^k)/(delz_l^j) delta_(l+1)^k = sum_k w_(l+1)^(jk)delta_(l+1)^ksigma'(z_l^j)`


3.The proof of `(delE)/(delb_l^j) = delta_l^j` is as follows. By the chain rule,

`(delE)/(delb_l^j) = (delz_l^j)/(delb_l^j) (delE)/(delz_l^j)`

By the definition of `delta`,

`(delE)/(delb_l^j) = (delz_l^j)/(delb_l^j) (delE)/(delz_l^j) = (delz_l^j)/(delb_l^j)delta_l^j`

Since

`z_l^j = sum_k w_l^(jk)a_(l-1)^j + b_l^j`

it follows that

`(delz_l^j)/(delb_l^j) = 1`

Hence

`(delE)/(delb_l^j) = (delz_l^j)/(delb_l^j) (delE)/(delz_l^j) = (delz_l^j)/(delb_l^j)delta_l^j = delta_l^j`


4.The proof of `(delE)/(delw_l^(jk)) = a_(l-1)^kdelta_l^j` is as follows. By the chain rule,

`(delE)/(delw_l^(jk)) = (delz_l^j)/(delw_l^(jk)) (delE)/(delz_l^j)`

By the definition of `delta`,

`(delE)/(delw_l^(jk)) = (delz_l^j)/(delw_l^(jk)) (delE)/(delz_l^j) = (delz_l^j)/(delw_l^(jk))delta_l^j`

Since

`z_l^j = sum_k w_l^(jk)a_(l-1)^j + b_l^j`

it follows that

`(delz_l^j)/(delw_l^(jk)) = a_(l-1)^j`

Hence

`(delE)/(delw_l^(jk)) = (delz_l^j)/(delw_l^(jk)) (delE)/(delz_l^j) = (delz_l^j)/(delw_l^(jk))delta_l^j = a_(l-1)^jdelta_l^j`



The backprop method


def backprop(self, x, y):
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]    
    # feedforward
    activation = x
    activations = [x]                            
    zs = []                                      
    for b, w in zip(self.biases, self.weights):
        z = np.dot(w, activation)+b
        zs.append(z)
        activation = sigmoid(z)
        activations.append(activation)       
    # backward pass
    delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])
    nabla_b[-1] = delta
    nabla_w[-1] = np.dot(delta, activations[-2].transpose())
    for l in xrange(2, self.num_layers):
        z = zs[-l]
        sp = sigmoid_prime(z)
        delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
        nabla_b[-l] = delta
        nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())        
    return (nabla_b, nabla_w)	
    
    
def cost_derivative(self, output_activations, y):
    return (output_activations-y)	
    
def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))     
	 
def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))


The evaluate method

The evaluate method returns the number of test inputs for which the network outputs the correct result.

(Note that the output is the index of whichever neuron in the final layer has the highest activation.)

	
def evaluate(self, test_data):
        test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)
	
def feedforward(self, a):
    for b, w in zip(self.biases, self.weights):
        a = sigmoid(np.dot(w, a)+b)
    return a	  
	

The feedforward method returns the output of the network given the input.

Consider the initial iteration for the [2, 3, 1] network. The input array is 2x1. The weights array for the hidden layer is 3x2. The dot product is a 3x1 array. The biases array on the hidden layer is 3x1. When the input is an array, Numpy automatically applies the sigmoid function elementwise.




The [784,30,10] Network in Action

The network has 784 neurons in the input layer, 30 in the hidden layer, and 10 in the output layer.

The code uses stochastic gradient descent to learn from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of η = 3.0.

After the network is trained, a random image is tested.

	
	
% python2
Python 2.7.12 (default, Nov  7 2016, 11:55:55) 
[GCC 6.2.1 20160830] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mnist_loader
>>> training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
>>> import network
>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
Epoch 0: 8268 / 10000
Epoch 1: 8393 / 10000
Epoch 2: 8422 / 10000
Epoch 3: 8466 / 10000
Epoch 4: 9321 / 10000
Epoch 5: 9385 / 10000
Epoch 6: 9383 / 10000
Epoch 7: 9391 / 10000
Epoch 8: 9392 / 10000
Epoch 9: 9422 / 10000
Epoch 10: 9423 / 10000
Epoch 11: 9427 / 10000
Epoch 12: 9462 / 10000
Epoch 13: 9480 / 10000
Epoch 14: 9453 / 10000
Epoch 15: 9474 / 10000
Epoch 16: 9466 / 10000
Epoch 17: 9447 / 10000
Epoch 18: 9488 / 10000
Epoch 19: 9501 / 10000
Epoch 20: 9481 / 10000
Epoch 21: 9487 / 10000
Epoch 22: 9493 / 10000
Epoch 23: 9461 / 10000
Epoch 24: 9485 / 10000
Epoch 25: 9454 / 10000
Epoch 26: 9503 / 10000
Epoch 27: 9497 / 10000
Epoch 28: 9495 / 10000
Epoch 29: 9478 / 10000
>>> import numpy as np
>>> imgnr = np.random.randint(0,10000)
>>> prediction = net.feedforward( test_data[imgnr][0] )
>>> print("Image number {0} is a {1}, and the network predicted a {2}".format(imgnr, test_data[imgnr][1], np.argmax(prediction)))
Image number 4709 is a 2, and the network predicted a 2
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(1,2,figsize=(8,4))
>>> ax[0].matshow( np.reshape(test_data[imgnr][0], (28,28) ), cmap='gray' )
>>> ax[1].plot( prediction, lw=3 )
>>> ax[1].set_aspect(9)
>>> plt.show()	
	




Convolutional Neural Networks

The layers in a convolutional neural networks are not fully-connected. This allows them to be sensitive to spatial structure.





These images come from the explanation of convolutional neural networks
in Neural Networks and Deep Learning.


Convolutional neural networks have convolutional layers.

Each neuron in the first hidden layer is connected to a small region of the input image. This region in the input image is the local receptive field for the hidden neuron. The next neuron in the hidden layer is connected to a local receptive field that overlaps with the previous field. The extent of the overlap is determined by the stride length.

The size of the input image and the local receptive field determines the size the first hidden layer. If the input image is 28 x 28, the local receptive field is 5 x 5, and the stride length is 1, then the first hidden layer is 24 x 24.

Each hidden neuron has a bias and a set of weights. If the local receptive field is 5 x 5, then the hidden neuron has a 5 x 5 set of weights. Moreover, the biases and weights are the same for each neuron in the hidden layer. So the neurons in the first hidden layer detect the same input pattern or feature, no matter where it is in the image.

We can think of the hidden layer as consisting of a set of feature maps. If a feature map is 24 x 24, then a hidden layer consisting of 2 x 24 x 24 neurons consists in two maps and can detect two features.

Convolutional neural networks also have pooling layers.

Pooling layers summarize the information in a region of a feature map. (There are different forms of pooling. In max pooling, the pooling neuron outputs the maximum of the region in the feature map. In L2 pooling, the neuron outputs the square root of the sum of the squares of the activations in the region.) If the hidden layer is 24 x 24, and the region to summarize is 2 x 2, then the pooling layer is 12 x 12.

In the first example (Conv architecture), the input of the convolutional neural network is 28 x 28. The next layer in the network is the convolutional layer. In the first example, it uses a 5 x 5 local receptive field and 3 feature maps. So the convolutional layer is 3 x 24 x 24. The pooling layer is next in the network. The size of the region summarized in the feature maps is 2 x 2. So the pooling layer is 3 x 12 x 12. The final layer is fully-connected. Every neuron in the pooling layer is connected to everyone one of the 10 output neurons.

In the second example (Conv + FC architecture), the convolutional nueral network is more complicated. There are 20 feature maps. In addition, the output layer is a softmax layer. Further, there is a fully-connected layer between the pooling layer and the softmax layer.

In a softmax layer, the softmax function (not the sigmoid function) is applied to get the activation. The output of the softmax function is a probability distribution. So `a_L^j` is the probability that the digit the image represents is `j`.



The Python/Theano Program (network3_tab.py)

In the following session, the convolutional neural network (net) has the "Conv + FC architecture" represented in the second example. The input is a 28 x 28 image from the MNIST dataset. The convolutional layer is 20 x 24 x 24. The pooling layer is 3 x 12 x 12. These layers are followed by a fully-connected layer and a softmax output layer.

% python2
Python 2.7.12 (default, Nov  7 2016, 11:55:55) 
[GCC 6.2.1 20160830] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import network3_tab
>>> from network3_tab import Network
>>> from network3_tab import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
>>> training_data, validation_data, test_data = network3_tab.load_data_shared()     
>>> mini_batch_size = 10
>>> net = Network([
...      ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5)),
...      FullyConnectedLayer(n_in=20*12*12, n_out=100),
...      SoftmaxLayer(n_in=100, n_out=10)], 
...      mini_batch_size)
>>> 
		
	

Load the MNIST data

The MNIST data is pickled as a tuple of three lists. Each of the three lists is formed from a list of images and list of labels. The images and labels are stored in Theano shared variables so that the calculations can be processed on the GPU. In GPU memory, the data must be stored as a floating point. The program uses the labels as integers, so shared_y is returned as an integer.

	
def load_data_shared(filename="../data/mnist.pkl.gz"):
    f = gzip.open(filename, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    def shared(data):
        shared_x = theano.shared(
            np.asarray(data[0], dtype=theano.config.floatX), borrow=True)
        shared_y = theano.shared(
            np.asarray(data[1], dtype=theano.config.floatX), borrow=True)
        return shared_x, T.cast(shared_y, "int32")
    return [shared(training_data), shared(validation_data), shared(test_data)]	
	
	

The ConvPoolLayer, FullyConnectedLayer, and SoftmaxLayer

The first layer in net is really two layers: a convolutional layer and a max-pooling layer.

ConvPoolLayer initializes the weights using a Gaussian distribution with mean 0 and standard deviation 1 over the square root of the number of weights connecting to the same neuron. (This helps prevent saturation.) It initializes the biases using a Gaussian distribution with mean 0 and standard deviation 1. It loads these weights and biases into shared variables. The method set_inpt defines the algorithm for symbolically calculating the output of the layer. It uses theano.tensor.nnet.conv2d and theano.tensor.signal.pool.pool_2d. (Convolution arithmetic tutorial)

		
class ConvPoolLayer(object):
    def __init__(self, filter_shape, image_shape, poolsize=(2, 2), activation_fn=sigmoid):       
        self.filter_shape = filter_shape
        self.image_shape = image_shape
        self.poolsize = poolsize
        self.activation_fn=activation_fn
        # initialize weights and biases
        n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize))
        self.w = theano.shared(
            np.asarray(
                np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), size=filter_shape),
                dtype=theano.config.floatX),
            borrow=True)
        self.b = theano.shared(
            np.asarray(
                np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),
                dtype=theano.config.floatX),
            borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, mini_batch_size):
        self.inpt = inpt.reshape(self.image_shape)
        conv_out = conv.conv2d(
            input=self.inpt, filters=self.w, filter_shape=self.filter_shape,
            image_shape=self.image_shape)
        pooled_out = pool.pool_2d(
            input=conv_out, ds=self.poolsize, ignore_border=True)
        self.output = self.activation_fn(
            pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
		
	

The other two layer classes (FullyConnectedLayer and SoftmaxLayer) are similar to ConvPoolLayer. The primary difference is in the set_inpt method.



class FullyConnectedLayer(object):
    def __init__(self, n_in, n_out, activation_fn=sigmoid):
        self.n_in = n_in
        self.n_out = n_out
        self.activation_fn = activation_fn
        # Initialize weights and biases
        self.w = theano.shared(
            np.asarray(
                np.random.normal(
                    loc=0.0, scale=np.sqrt(1.0/n_out), size=(n_in, n_out)),
                dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
                       dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, mini_batch_size):
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        self.output = self.activation_fn(
            T.dot(self.inpt, self.w) + self.b)
        self.y_out = T.argmax(self.output, axis=1)

    def accuracy(self, y):
        return T.mean(T.eq(y, self.y_out))
 
 


The cost function in SoftmaxLayer is the negative log-likelihood function.

If x is the input to the network and y is the desired output, then the log-likelihood cost of x is `-ln a_L^j`. As the probability of output approaches 1, the cost approaches 0. As the probability of the output approaches 0, the cost approaches infinity.

       
        

class SoftmaxLayer(object):
    def __init__(self, n_in, n_out):
        self.n_in = n_in
        self.n_out = n_out
        # Initialize weights and biases
        self.w = theano.shared(
            np.zeros((n_in, n_out), dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.zeros((n_out,), dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, mini_batch_size):
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        self.output = softmax(T.dot(self.inpt, self.w) + self.b)
        self.y_out = T.argmax(self.output, axis=1)
        
    def cost(self, net):  
        # net.y.shape[0] is the number of the training examples in the minibatch (N)    
        # T.arange(net.y.shape[0]) is a symbolic vector of integers [0,1,2,...,N-1]
        # T.log(self.output) is a NxK matrix, where in this case K = 10 (the number of digits 0..9)
        # T.log(self.output)[T.arange(net.y.shape[0]), net.y] is a vector of length N with the log-likelihoods of the labels
        # The mean is the average across the all the training examples in the minibatch
        return -T.mean(T.log(self.output)[T.arange(net.y.shape[0]), net.y])
        
    def accuracy(self, y):
        return T.mean(T.eq(y, self.y_out))



The Network Class

The Network class creates a network from a list of layers and a minibatch size. It defines the symbolic variables for the input (self.x) to and desired output (self.y) from the network. It sets the input to the initial layer. It propagates self.x forward through the layers of the network in to symbolically define the output from the network.

The method SGD trains the network using mini-batch stochastic gradient descent. The functions train_mb and test_mb_accuracy are called in the training.


class Network(object):

    def __init__(self, layers, mini_batch_size):
        self.layers = layers
        self.mini_batch_size = mini_batch_size
        self.params = [param for layer in self.layers for param in layer.params]
        self.x = T.matrix("x")
        self.y = T.ivector("y")
        init_layer = self.layers[0]
        init_layer.set_inpt(self.x, self.mini_batch_size)
        for j in xrange(1, len(self.layers)):
            prev_layer, layer  = self.layers[j-1], self.layers[j]
            layer.set_inpt(prev_layer.output, self.mini_batch_size)
        self.output = self.layers[-1].output

    def SGD(self, training_data, epochs, mini_batch_size, eta, test_data):
        training_x, training_y = training_data
        test_x, test_y = test_data
        num_training_batches = size(training_data)/mini_batch_size
        num_test_batches = size(test_data)/mini_batch_size 
        cost = self.layers[-1].cost(self)
        grads = T.grad(cost, self.params)
        updates = [(param, param-eta*grad) for param, grad in zip(self.params, grads)]
        # define functions to train a mini-batch compute the accuracy in test mini-batches.
        i = T.lscalar() # mini-batch index
        train_mb = theano.function(
            [i], cost, updates=updates,
            givens={
                self.x:
                training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        test_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y),
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        # train the network
        for epoch in xrange(epochs):
            for minibatch_index in xrange(num_training_batches):
                iteration = num_training_batches*epoch+minibatch_index
                if iteration % 1000 == 0:
                    print("Training mini-batch number {0}".format(iteration))
                train_mb(minibatch_index)
                if (iteration+1) % num_training_batches == 0:
                     if test_data:
                       test_accuracy = np.mean([test_mb_accuracy(j) for j in xrange(num_test_batches)])
                       print("The network accuracy on test data is {0:.2%}".format(test_accuracy))


def size(data):
    return data[0].get_value(borrow=True).shape[0]
     

The Convolutional Neural Network in Action

Training this network takes time, about 75 minutes on my (relatively old) Arch Linux 4x Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz (launch date Q1'11).


% python2
Python 2.7.12 (default, Nov  7 2016, 11:55:55) 
[GCC 6.2.1 20160830] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import network3_tab
>>> from network3_tab import Network
>>> from network3_tab import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
>>> training_data, validation_data, test_data = network3_tab.load_data_shared()     
>>> mini_batch_size = 10
>>> net = Network([
...      ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28), filter_shape=(20, 1, 5, 5)),
...      FullyConnectedLayer(n_in=20*12*12, n_out=100),
...      SoftmaxLayer(n_in=100, n_out=10)], 
...      mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1, test_data)
Training mini-batch number 0
Training mini-batch number 1000
Training mini-batch number 2000
Training mini-batch number 3000
Training mini-batch number 4000
The network accuracy on test data is 92.99%
Training mini-batch number 5000
Training mini-batch number 6000
Training mini-batch number 7000
Training mini-batch number 8000
Training mini-batch number 9000
The network accuracy on test data is 95.47%
Training mini-batch number 10000
Training mini-batch number 11000
Training mini-batch number 12000
Training mini-batch number 13000
Training mini-batch number 14000
The network accuracy on test data is 96.68%
Training mini-batch number 15000
Training mini-batch number 16000
Training mini-batch number 17000
Training mini-batch number 18000
Training mini-batch number 19000
The network accuracy on test data is 97.17%
Training mini-batch number 20000
Training mini-batch number 21000
Training mini-batch number 22000
Training mini-batch number 23000
Training mini-batch number 24000
The network accuracy on test data is 97.64%
Training mini-batch number 25000
Training mini-batch number 26000
Training mini-batch number 27000
Training mini-batch number 28000
Training mini-batch number 29000
The network accuracy on test data is 97.82%
Training mini-batch number 30000
Training mini-batch number 31000
Training mini-batch number 32000
Training mini-batch number 33000
Training mini-batch number 34000
The network accuracy on test data is 97.83%
Training mini-batch number 35000
Training mini-batch number 36000
Training mini-batch number 37000
Training mini-batch number 38000
Training mini-batch number 39000
The network accuracy on test data is 97.91%
Training mini-batch number 40000
Training mini-batch number 41000
Training mini-batch number 42000
Training mini-batch number 43000
Training mini-batch number 44000
The network accuracy on test data is 97.99%
Training mini-batch number 45000
Training mini-batch number 46000
Training mini-batch number 47000
Training mini-batch number 48000
Training mini-batch number 49000
The network accuracy on test data is 98.16%
Training mini-batch number 50000
Training mini-batch number 51000
Training mini-batch number 52000
Training mini-batch number 53000
Training mini-batch number 54000
The network accuracy on test data is 98.24%
Training mini-batch number 55000
Training mini-batch number 56000
Training mini-batch number 57000
Training mini-batch number 58000
Training mini-batch number 59000
The network accuracy on test data is 98.23%
Training mini-batch number 60000
Training mini-batch number 61000
Training mini-batch number 62000
Training mini-batch number 63000
Training mini-batch number 64000
The network accuracy on test data is 98.29%
Training mini-batch number 65000
Training mini-batch number 66000
Training mini-batch number 67000
Training mini-batch number 68000
Training mini-batch number 69000
The network accuracy on test data is 98.31%
Training mini-batch number 70000
Training mini-batch number 71000
Training mini-batch number 72000
Training mini-batch number 73000
Training mini-batch number 74000
The network accuracy on test data is 98.44%
Training mini-batch number 75000
Training mini-batch number 76000
Training mini-batch number 77000
Training mini-batch number 78000
Training mini-batch number 79000
The network accuracy on test data is 98.49%
Training mini-batch number 80000
Training mini-batch number 81000
Training mini-batch number 82000
Training mini-batch number 83000
Training mini-batch number 84000
The network accuracy on test data is 98.56%
Training mini-batch number 85000
Training mini-batch number 86000
Training mini-batch number 87000
Training mini-batch number 88000
Training mini-batch number 89000
The network accuracy on test data is 98.57%
Training mini-batch number 90000
Training mini-batch number 91000
Training mini-batch number 92000
Training mini-batch number 93000
Training mini-batch number 94000
The network accuracy on test data is 98.60%
Training mini-batch number 95000
Training mini-batch number 96000
Training mini-batch number 97000
Training mini-batch number 98000
Training mini-batch number 99000
The network accuracy on test data is 98.60%
Training mini-batch number 100000
Training mini-batch number 101000
Training mini-batch number 102000
Training mini-batch number 103000
Training mini-batch number 104000
The network accuracy on test data is 98.63%
Training mini-batch number 105000
Training mini-batch number 106000
Training mini-batch number 107000
Training mini-batch number 108000
Training mini-batch number 109000
The network accuracy on test data is 98.66%
Training mini-batch number 110000
Training mini-batch number 111000
Training mini-batch number 112000
Training mini-batch number 113000
Training mini-batch number 114000
The network accuracy on test data is 98.66%
Training mini-batch number 115000
Training mini-batch number 116000
Training mini-batch number 117000
Training mini-batch number 118000
Training mini-batch number 119000
The network accuracy on test data is 98.69%
Training mini-batch number 120000
Training mini-batch number 121000
Training mini-batch number 122000
Training mini-batch number 123000
Training mini-batch number 124000
The network accuracy on test data is 98.72%
Training mini-batch number 125000
Training mini-batch number 126000
Training mini-batch number 127000
Training mini-batch number 128000
Training mini-batch number 129000
The network accuracy on test data is 98.71%
Training mini-batch number 130000
Training mini-batch number 131000
Training mini-batch number 132000
Training mini-batch number 133000
Training mini-batch number 134000
The network accuracy on test data is 98.71%
Training mini-batch number 135000
Training mini-batch number 136000
Training mini-batch number 137000
Training mini-batch number 138000
Training mini-batch number 139000
The network accuracy on test data is 98.71%
Training mini-batch number 140000
Training mini-batch number 141000
Training mini-batch number 142000
Training mini-batch number 143000
Training mini-batch number 144000
The network accuracy on test data is 98.72%
Training mini-batch number 145000
Training mini-batch number 146000
Training mini-batch number 147000
Training mini-batch number 148000
Training mini-batch number 149000
The network accuracy on test data is 98.72%
Training mini-batch number 150000
Training mini-batch number 151000
Training mini-batch number 152000
Training mini-batch number 153000
Training mini-batch number 154000
The network accuracy on test data is 98.72%
Training mini-batch number 155000
Training mini-batch number 156000
Training mini-batch number 157000
Training mini-batch number 158000
Training mini-batch number 159000
The network accuracy on test data is 98.71%
Training mini-batch number 160000
Training mini-batch number 161000
Training mini-batch number 161000
Training mini-batch number 162000
Training mini-batch number 163000
Training mini-batch number 164000
The network accuracy on test data is 98.70%
Training mini-batch number 165000
Training mini-batch number 166000
Training mini-batch number 167000
Training mini-batch number 168000
Training mini-batch number 169000
The network accuracy on test data is 98.68%
Training mini-batch number 170000
Training mini-batch number 171000
Training mini-batch number 172000
Training mini-batch number 173000
Training mini-batch number 174000
The network accuracy on test data is 98.68%
Training mini-batch number 175000
Training mini-batch number 176000
Training mini-batch number 177000
Training mini-batch number 178000
Training mini-batch number 179000
The network accuracy on test data is 98.69%
Training mini-batch number 180000
Training mini-batch number 181000
Training mini-batch number 182000
Training mini-batch number 183000
Training mini-batch number 184000
The network accuracy on test data is 98.68%
Training mini-batch number 185000
Training mini-batch number 186000
Training mini-batch number 187000
Training mini-batch number 188000
Training mini-batch number 189000
The network accuracy on test data is 98.69%
Training mini-batch number 190000
Training mini-batch number 191000
Training mini-batch number 192000
Training mini-batch number 193000
Training mini-batch number 194000
The network accuracy on test data is 98.69%
Training mini-batch number 195000
Training mini-batch number 196000
Training mini-batch number 197000
Training mini-batch number 198000
Training mini-batch number 199000
The network accuracy on test data is 98.69%
Training mini-batch number 200000
Training mini-batch number 201000
Training mini-batch number 202000
Training mini-batch number 203000
Training mini-batch number 204000
The network accuracy on test data is 98.71%
Training mini-batch number 205000
Training mini-batch number 206000
Training mini-batch number 207000
Training mini-batch number 208000
Training mini-batch number 209000
The network accuracy on test data is 98.72%
Training mini-batch number 210000
Training mini-batch number 211000
Training mini-batch number 212000
Training mini-batch number 213000
Training mini-batch number 214000
The network accuracy on test data is 98.73%
Training mini-batch number 215000
Training mini-batch number 216000
Training mini-batch number 217000
Training mini-batch number 218000
Training mini-batch number 219000
The network accuracy on test data is 98.73%
Training mini-batch number 220000
Training mini-batch number 221000
Training mini-batch number 222000
Training mini-batch number 223000
Training mini-batch number 224000
The network accuracy on test data is 98.74%
Training mini-batch number 225000
Training mini-batch number 226000
Training mini-batch number 227000
Training mini-batch number 228000
Training mini-batch number 229000
The network accuracy on test data is 98.74%
Training mini-batch number 230000
Training mini-batch number 231000
Training mini-batch number 232000
Training mini-batch number 233000
Training mini-batch number 234000
The network accuracy on test data is 98.74%
Training mini-batch number 235000
Training mini-batch number 236000
Training mini-batch number 237000
Training mini-batch number 238000
Training mini-batch number 239000
The network accuracy on test data is 98.73%
Training mini-batch number 240000
Training mini-batch number 241000
Training mini-batch number 242000
Training mini-batch number 243000
Training mini-batch number 244000
The network accuracy on test data is 98.73%
Training mini-batch number 245000
Training mini-batch number 246000
Training mini-batch number 247000
Training mini-batch number 248000
Training mini-batch number 249000
The network accuracy on test data is 98.74%
Training mini-batch number 250000
Training mini-batch number 251000
Training mini-batch number 252000
Training mini-batch number 253000
Training mini-batch number 254000
The network accuracy on test data is 98.75%
Training mini-batch number 255000
Training mini-batch number 256000
Training mini-batch number 257000
Training mini-batch number 258000
Training mini-batch number 259000
The network accuracy on test data is 98.76%
Training mini-batch number 260000
Training mini-batch number 261000
Training mini-batch number 262000
Training mini-batch number 263000
Training mini-batch number 264000
The network accuracy on test data is 98.78%
Training mini-batch number 265000
Training mini-batch number 266000
Training mini-batch number 267000
Training mini-batch number 268000
Training mini-batch number 269000
The network accuracy on test data is 98.79%
Training mini-batch number 270000
Training mini-batch number 271000
Training mini-batch number 272000
Training mini-batch number 273000
Training mini-batch number 274000
The network accuracy on test data is 98.80%
Training mini-batch number 275000
Training mini-batch number 276000
Training mini-batch number 277000
Training mini-batch number 278000
Training mini-batch number 279000
The network accuracy on test data is 98.80%
Training mini-batch number 280000
Training mini-batch number 281000
Training mini-batch number 282000
Training mini-batch number 283000
Training mini-batch number 284000
The network accuracy on test data is 98.80%
Training mini-batch number 285000
Training mini-batch number 286000
Training mini-batch number 287000
Training mini-batch number 288000
Training mini-batch number 289000
The network accuracy on test data is 98.80%
Training mini-batch number 290000
Training mini-batch number 291000
Training mini-batch number 292000
Training mini-batch number 293000
Training mini-batch number 294000
The network accuracy on test data is 98.80%
Training mini-batch number 295000
Training mini-batch number 296000
Training mini-batch number 297000
Training mini-batch number 298000
Training mini-batch number 299000
The network accuracy on test data is 98.80%
>>> 





The Street View House Numbers (SVHN) Dataset

The SVHN is obtained from from images of the house numbers in the Google Street View images. Recognizing digits in this "real world" data set is more considerably more challenging.





% python2
Python 2.7.12 (default, Nov  7 2016, 11:55:55) 
[GCC 6.2.1 20160830] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy.io as sio
>>> import matplotlib.pyplot as plt
>>> 
>>> train_data = sio.loadmat('train_32x32.mat')
>>> 
>>> x_train = train_data['X']
>>> y_train = train_data['y']
>>> 
>>> image_index = 109
>>> image=plt.imshow(x_train[:,:,:,image_index])
>>> print y_train[image_index]
[3]
>>> plt.show()
	
	




go back go back