# Philosophy, Computing, and Artificial Intelligence

PHI 319. Recognizing Digits in the MNIST Data Set.

******** UNDER CONSTRUCTION ********

## Artificial Neurons

Perceptron

An artificial neuron is a computational model of a neuron.

A typical neuron has dendrites, a cell body, and an axon. The dendrites (from the Greek δενδρίτης) take input from other neurons in the form of electrical impulses. The cell body processes these impulses, and the output goes from axon terminals to other neurons.

According to one recent estimates, in the average male human brain there are 86 billion neurons.

## Perceptron Neurons

A *perceptron* is an artificial neuron. It
takes binary (**0** or **1**) inputs (**x _{1}** ...

**x**) and computes a binary output. The computation is a function of weights (

_{m}**w**...

_{1}**w**) and a threshold value. If the sum

_{m}**w**is greater than the threshold, the output is

_{1}x_{1}+ ... + w_{m}x_{m}**1**. Otherwise, it is

**0**.

Perceptrons can implement truth-functions. Conjunction (φ ∧ ψ) is an example. Let the perceptron have two inputs, each with a weight of 0.6, and a threshold value of 1. If both inputs are 1, the sum exceeds the threshold value of the perceptron and thus the output is 1. Otherwise, the output is 0. With these conditions for activating the perceptron, the output matches the truth-table for conjunction.

A perceptron is an instance of the *integrate-and-fire* model.
A neuron receives inputs through
synapses. The weights correspond to the relative efficiency with which a synapse communicates
inputs to the cell
body, so some inputs weigh more heavily than others in the computation. Since it takes resources for
the neuron to fire, the neuron is quiet unless the threshold is crossed.

The computation in a perceptron is typically expressed mathematically as the dot product

**w · x **,
where **w**
is a **m**-vector of weights
and **x** is a **m**-vector of inputs.

The negative of the
threshold value is the perceptron's **bias**, **b**. In these terms,
the value of the output activation function for a given set of inputs is **1** if
**w · x + b > 0** and is **0** otherwise.

## Sigmoid Neurons

"Suppose we arrange for some automatic means of testing the effectiveness of
any current weight [and bias] assignment [in the neuron] in terms of actual performance and
provide a mechanism for altering the weight [and bias] assignment so as to maximize
the performance. We need not go into the details of such a procedure to see
that it could be made entirely automatic and to see that a machine so programed
would 'learn' from its experience"
(Arthur L. Samuel, "Artificial Intelligence: A Frontier of Automation," 17.
*The Annals of the American Academy of Political and Social Science*. Vol.
340, *Automation*, 10-20, 1962).
A *sigmoid neuron* has an important feature a perceptron lacks: small changes
in the weights and bias cause small changes in the output. This allows sigmoid neurons to "learn."

We can make a neuron "learn" by changing its weights and biases. We know what the output should be. So if it is not what it should be, we change the weights and biases so that the output is closer to what it should be. In this way, the neuron "learns" what its output should be.

This talk about the neuron "learning" makes sense if we pretend that the neuron itself is changing its weights and biases in an effort to corrects its mistakes. It sees that its output is not what it should be, so it adjusts its weights and biases in an effort to do better. Through many iterations of "learning," the neuron's output approaches the correct output. The neuron, in this way, "learns" like an archer who tries to get closer to hitting the target by slightly adjusting the angle of the arrow and how far he pulls back the string after each shot.

A sigmoid neuron has the mathematical parts in a perceptron (inputs, weights, and a bias),
but there are two important differences. The inputs and outputs are not binary. The inputs may have any value from **0** to **1**.
The activation function is also different. It is the sigmoid function.

The *sigmoid function* is σ(*x*) =
` 1/(1 + e^-x)`, where *x* = **w · x + b**

As the activation function, the sigmoid function maps **w · x + b** to a smooth curve that
preserves desirable features of the activation function for perceptrons. When **w · x + b** is
a large positive
number, the output of the function is close to **1** because `e^-x` is close to **0**.
When **w · x + b** is a large negative
number, the output is close to **0** because `e^-x` is extremely large.

## Artificial Neural Networks

Artificial neurons may be linked together in a *feedfoward* network in which the ouput from one layer
is the input to the next layer. The first layer is input layer of neurons. The last layer is the output layer. The hidden layers are
the neurons that are neither input nor output neurons.

A feedforward network of artificial neurons may be understood as a device that makes "decisions about decisions." The first layer of neurons makes a "decision" about the input, the next layer makes a "decision about the decision" of the prior layer, and so on.

## A Feedforward Network to Classify Digits

For a good video (YouTube) introduction to MNIST, see the series
Neural Networks,
by 3Blue1Brown.

In grayscale, the intensity of light for each pixel is
represented as a number
from 0 to 255. 0 represents "black" (no light), 255 represents "white" (all light), and values in
between 0 and 255 represent decreasing shades of "gray".
The
MNIST
dataset contains scanned images of
handwritten digits.

MNIST is a (M) modified subset of two datasets (Special Database 1 and Special Database 3) of images of handwritten digits that the National Institute of Standards and Technology (NIST) collected. Special Database 1 was collected from high school students. Special Database 3 was collected from employees of the US Census Bureau. The MNIST data selects from both of the two datasets and normalizes the images so that each is 28 x 28 pixels in greyscale.

The images in the dataset are split into 60,000 training images and 10,000 test images.

The input to each neuron in the input layer in the network is one pixel from the input image. Since each image is 28 x 28 pixels, the input layer has 784 (or 28 x 28) neurons.

The input layer is part of a neural network of sigmoid neurons. Because the 28 x 28 images in the MNIST dataset are in greyscale, each is represented as a NumPy (the package for scientific computing with Python) one-dimensional array of 784 values between 0 and 1.

The output layer in the network has 10 neurons. The first neuron in this layer indicates whether the image is a **0**, the second
whether it is a **1**, the third whether it is a **2**, and so on.

### Minimizing the Error Function

This network needs to "learn" or be "trained" to classify the digits correctly.

The amount of error in a network is sensitive to its weights and biases. Training or learning in a network is a matter of finding a set of weights and biases that minimize the error.

To get some insight into the general idea, consider the function `f(x,y) = x^2 + y^2` and how we might move in small steps to values that minimize this function.

`gradf(x,y)` is the gradient of `f(x,y)`. The gradient is a function. It takes two coordinates as a position and returns two coordinates as the direction of steepest ascent.

`gradf(x,y) = [[(delf)/(delx)(x,y)], [(delf)/(dely)(x,y)]] = [[2x],[2y]] `.

So, for example, if the starting-point is `(1,3)`, the direction of steepest ascent is toward

`[[2*1],[2*3]] = [[2],[6]] `

We want to move in the direction of steepest descent, so we take negative steps.

If (with a step parameter of `eta = 0.01`) we step from `(1,3)` to `(2,6)`, we reach `(0.98,2.94)`. In the `x` direction, we step to `-0.01*2`. In the `y` direction, we step to `-0.01*6`.

The value of `f(x,y)` at `(1,3)` is 10. The value at `(0.98,2.94)` is 9.604.

From the new position, the steepest ascent is toward

`[[2*0.98],[2*2.94]] = [[1.96],[5.88]] `

If we take a negative step from `(0.98,2.94)` with step parameter `eta = 0.01`, we step to `(0.9604,2.8812)`. The value of `f(x,y)` at this position is 9.2236816. With each step from `(1,3)`, the value of the function `f(x,y) = x^2 + y^2` decreases.

We want to minimize the error function for the network we are training.

### An Example Image from the MNIST Data Set

The image below of the digit **5** is an example of the images
in **training_data**.

**training_data** is a list of 60,000
2-tuples **(x, y)**.

**x** is a 784-dimensional array
that represents image.

**y** is a 10-dimensional
array that represents the label for image.
It indicates the digit in the image.

**training_data[0]** is the first tuple.

**training_data[0][0]** is the **x** in the first tuple.

**training_data[0][1]** is the **y** in the first tuple.
I use the source code ((c) 2012-2018 Michael Nielsen) from Michael Nielsen's Neural Networks and Deep Learning.
He has made it available in a GitHub repository.

To get a copy of this source code, I installed
Git onto my Linux distribution (Arch Linux), made a directory I named "git," changed the
current directory to the one I made, and created a copy (or "clone") of the repository:

sudo pacman -S git

mkdir git

cd git

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

Nielsen's Python code is in Python 2.6 or 2.7. Michal Daniel Dobrzanski has a
repository
with code in Python 3.5.2.

tom:arch [~/git/neural-networks-and-deep-learning/src] % python2 Python 2.7.12 (default, Jun 28 2016, 08:31:05) [GCC 6.1.1 20160602] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import mnist_loader

>>> training_data, validation_data, test_data = mnist_loader.load_data_wrapper() >>> training_data[0][1].shape (10, 1) >>> training_data[0][1] array([[ 0.], [ 0.], [ 0.], [ 0.], [ 0.], [ 1.], # the image shows a "5" [ 0.], [ 0.], [ 0.], [ 0.]]) >>> training_data[0][0].shape (784, 1) >>> import numpy as np >>> image_array = np.reshape(training_data[0][0], (28, 28)) >>> import matplotlib.pyplot as plt >>> image = plt.imshow(image_array, cmap ='gray') >>> plt.show()

## Making the MNIST Dataset Ready

**mnist.pkl.gz** is a "pickled" tuple of 3 lists:
the training set (**training_data**), the validation set (**validation_data**),
and the testing set (**test_data**).

The function **load_data_wrapper()** returns **training_data**, **validation_data**, **test_data**.

**validation_data** and **test_data** are lists containing 10,000
2-tuples **(x, y)**.

**x** is a 784-dimensional array
that represents the image.

**y** is a 10-dimensional array that represents the label for image. It indicates the digit in the image.

We will not use the **validation_data** in this lecture.

import cPickle import gzip import numpy as np def load_data(): f = gzip.open('../data/mnist.pkl.gz', 'rb') training_data, validation_data, test_data = cPickle.load(f) f.close() return (training_data, validation_data, test_data) def load_data_wrapper(): tr_d, va_d, te_d = load_data() training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]] training_results = [vectorized_result(y) for y in tr_d[1]] training_data = zip(training_inputs, training_results) validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]] validation_data = zip(validation_inputs, va_d[1]) test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]] test_data = zip(test_inputs, te_d[1]) return (training_data, validation_data, test_data) def vectorized_result(j): e = np.zeros((10, 1)) e[j] = 1.0 return e

## The Rest of the Python Program

We will not try to understand the source code (which belongs (Copyright (c) 2012-2018 Michael Nielsen) to Michael Nielsen) or the underlying algorithm in detail.

### The Network Class

class Network(object): def __init__(self, sizes): self.num_layers = len(sizes) self.sizes = sizes self.biases = [np.random.randn(y, 1) for y in sizes[1:]] self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

In the program, we use this class to create a 784x20x10 neural network (784 neurons in the input layer, 30 in the hidden layer, and 10 in the output layer).

sizes = [784,30,10], sizes[1:] = [30,10], sizes[:-1] = [784,30]

### An Example Network

The following code creates a neural network (**net**) whose input layer has two neurons, whose middle layer has three neurons, and whose
output layer has one neuron.

```
tom:arch [~/git/neural-networks-and-deep-learning/src]
```

% python2

Python 2.7.12 (default, Jun 28 2016, 08:31:05)

[GCC 6.1.1 20160602] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> import network # import module network.py

>>> net = network.Network([2, 3, 1]) # create instance of class

>>>

The biases and weights are set as random numbers. The input layer has no bias. Biases are only used in computing the output from later layers.

For the [2, 3, 1] network, the biases are in a 3 x 1 array and a 1 x 1 array.

tom:arch [~/git/neural-networks-and-deep-learning/src] % python2 Python 2.7.12 (default, Jun 28 2016, 08:31:05) [GCC 6.1.1 20160602] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import network >>> net = network.Network([2, 3, 1]) >>> net.biases[0].shape (3, 1) >>> net.biases[0] array([[ 1.36630966], [ 1.05788544], [ 0.80606255]]) >>>net.biases[1].shape (1, 1) >>>net.biases[1] array([[ 1.54813682]]) >>>

For the [2, 3, 1] network,
the weights are in a 3 x 2 array and a 1 x 3 array.

The first row in **net.weights[0]** are the weights the first neuron in the hidden layer attributes to
the outputs of the first and second neurons in the input layer.

>>> net.weights[0].shape (3, 2) >>> net.weights[0] array([[-0.27640848, 0.13942239], [ 1.13350606, 1.51767629], [-0.03836741, 0.06409297]]) >>> net.weights[1].shape (1, 3) >>> net.weights[1] array([[-0.72105625, 1.76366748, 1.49408987]]) >>>

### Stochastic (Mini-Batch) Gradient Descent

For each "epoch" of training, the training data is randomly shuffled and partitioned into "mini-batches." Once the last mini-batch is processed, the network is evaluated against the test data.

def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None): if test_data: n_test = len(test_data) n = len(training_data) for j in xrange(epochs): random.shuffle(training_data) mini_batches = [training_data[k:k+mini_batch_size] for k in xrange(0, n, mini_batch_size)] for mini_batch in mini_batches: self.update_mini_batch(mini_batch, eta) if test_data print "Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test) else: print "Epoch {0} complete".format(j)

The method **update_mini_batch** updates the weights and biases.

The minibatch is random sample of images. So if the sample size is large enough,
the new weights and biases learned from minibatch approximate the weights and biases that would
be learned from training with all the images in the training data.

new_`w` `\Leftarrow w`

new_ `b` `\Leftarrow b`

`w - eta/m sum_j (delE_(X_j))/(delw)`

`b - eta/m sum_j (delE_(X_j))/(delb)`

`m` is len(mini_batch)

the number of images in the minibatch

`eta` is eta

the step factor
For each input in the mini-batch, the method calculates and saves
an adjustment to the weights and biases that reduces the value of the error function for the network.
Next, given the step parameter and the average of the adjustments to the weights and biases for the inputs in the mini-batch,
the method updates the weights and biases in the network.

def update_mini_batch(self, mini_batch, eta): nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] # # # for x, y in mini_batch: delta_nabla_b, delta_nabla_w = self.backprop(x, y) nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] # # update weights and biases # self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]

The
**update_mini_batch** uses the method **backprop** to compute the adjustments.

The **backprop** method has two parts.

In the "feedforward" section of the method,
**backprop** forward feeds the training input (**x**) through the network
and stores the **zs** and **activations**
layer by layer.

The **zs** are the input to the activation function.

The **activations** are the outputs of the activation function.

In the "backward pass" section,
**backprop** uses the **zs** and **activations**
to calculate the adjustments. This calculation is the most important (and difficult) part of the algorithm.

def backprop(self, x, y): # # x is the input to the network, y is the label for x # nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] # #feedforward# # feed x forward through the network # The first time through the loop the activation is the input to the network # activation = x activations = [x] # activations is a list zs = [] for b, w in zip(self.biases, self.weights): z = np.dot(w, activation)+b # np.dot is the numpy dot product zs.append(z) activation = sigmoid(z) activations.append(activation) # # #backward pass# #fundamental equation 1delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1]) # [-1] the last item in the list # # * is the Hadamard product, ⊙ #fundamental equation 3nabla_b[-1] = delta # #fundamental equation 4nabla_w[-1] = np.dot(delta, activations[-2].transpose()) # for l in xrange(2, self.num_layers): z = zs[-l] sp = sigmoid_prime(z) delta = np.dot(self.weights[-l+1].transpose(), delta) * sp #fundamental equation 2nabla_b[-l] = delta #fundamental equation 3nabla_w[-l] = np.dot(delta, activations[-l-1].transpose()) #fundamental equation 4# # # return the adjustments # return (nabla_b, nabla_w) def cost_derivative(self, output_activations, y): return (output_activations-y) def sigmoid(z): """The sigmoid function.""" return 1.0/(1.0+np.exp(-z)) def sigmoid_prime(z): """Derivative of the sigmoid function.""" return sigmoid(z)*(1-sigmoid(z))

When an image `x` from the training set of `n` images is fed through the network, it produces a vector of outputs, `a_L`, different in general from the desired vector of outputs `y`. The ideal is to make `y = a_L` for each image `x` in the training set.

To approach this ideal, we minimize the average error in the network

`E = 1/n sum_x^n E_x`

where

`E_x = 1/2 norm(y - a_L)^2 = 1/2 sum_j (y^j - a_L^j)^2`
`y^j` is the `j^(th)` neuron in the vector of desired activations
in last layer of the network when image `x` is the input

`a_L^j`is the `j^(th)` neuron in the vector of activations in the last layer
of the network when image `x` is the input

Q: Why the constant `1/2`?

A: To cancel the exponent when differentiating.
We can think of it as part of the step parameter, which we set.

In a 784x30x10 network (the network trained (in the example below) to classify the MNIST images), there are 23,820 weights (784x30 + 30x10) and 40 biases (30 + 10). The activations in the vector `a_L(x)` are a function of these weights and biases. So minimizing the cost function steps to new values for weights and biases that make the network more accurate.

(The 784x30x10 network is small. OpenAI's GPT-3 has 175 billion parameters!)

The `L^2` (Euclidean) norm corresponds to the length of the vector from the origin to the point. Suppose, for example, the point is

`u = [[x],[y]] = [[3],[4]] `

The `L^2` norm of ` u = norm u = \sqrt{3^2 + 4^2} = 5.`

The error function uses the squared `L^2` norm makes the computation simpler. Because it eliminates the square root, the computation is sum of the squared values of the vector.

The partial derivative of `E_x` with respect to `a_L^j` is

`(delE_x)/(dela_L^j) = del/(dela_L^j)[1/2(y^j-a_L^j)^2]`

`= 1/2 * del/(dela_L^j)[(y^j-a_L^j)^2]`

`= 1/2 * 2(y^j-a_L^j) * del/(dela_L^j)[y^j-a_L^j]`

`= (y^j-a_L^j) * (del/(dela_L^j)y^j - del/(dela_L^j)a_L^j) `

` = (y^j-a_L^j) * (0 - 1)`

` = (y^j-a_L^j) * -1`

` = a_L^j-y^j`
`(delE_x)/(dela_L^j) = a_L^j-y^j`

The **evaluate** method returns the number of test inputs for which the
network output is correct.

def evaluate(self, test_data): # the output is the index of the first neuron in the final layer with a maximum activation test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data] return sum(int(x == y) for (x, y) in test_results) def feedforward(self, a): for b, w in zip(self.biases, self.weights): a = sigmoid(np.dot(w, a)+b) return a

## Four Fundamental Equations for Training the Network

`delta_l^j` is the error in the `j^(th)` neuron in the `l^(th)` layer

`z_l^j` is the input to the activation function for the `j^(th)` neuron in the `l^(th)` layer

`z_l^j = sum_k w_l^(jk) a_(l-1)^k + b_l^j`

`w_l^(jk)` is the weight on the connection into the `j^(th)` neuron in the `l^(th)` layer
from the `k^(th)` neuron in the `(l - 1)^(th)` layer

`a_(l-1)^k` is the activation of the `k^(th)` neuron in the `(l - 1)^(th)` layer

`a_(l-1)^k = sigma(z_(l-1)^k)`

`b_l^j` is the bias of the `j^(th)` neuron in the `l^(th)` layer

Chain Rule:

`(delz)/(dely) (dely)/(delx) = (delz)/(delx)`,
if
`z = f(y) and y = g(x)`
The four equations are stated in terms of `delta_l^j`,
which is defined as `(delE_x)/(delz_l^j)`.

**Fundamental Equation 1. Error in the Last Layer**

`delta_L^j` = `(delE_x)/(delz_L^j)` = `(delE_x)/(dela_L^j)sigma'(z_L^j)`.

`(delE_x)/(delz_L^j) = sum_k (delE_x)/(dela_L^k) (dela_L^k)/(delz_L^j) `, by the chain rule.

`sum_k (delE_x)/(dela_L^k) (dela_L^k)/(delz_L^j) = (delE_x)/(dela_L^j) (dela_L^j)/(delz_L^j)`, since `(delE_x)/(dela_L^k) (dela_L^k)/(delz_L^j) = 0` if `j \ne k`.

`(delE_x)/(dela_L^j) (dela_L^j)/(delz_L^j) = (delE_x)/(dela_L^j)sigma'(z_L^j)`, since `a_L^j = sigma(z_L^j)`.

Since the graph of the sigmoid function `sigma` is flat when its value is close to 0 or 1, `sigma'` is approximately 0 at these values. So, given fundamental equation #1, `delta_L^j` (the error in the `j^(th)` neuron in the last layer) is approximately 0 at these values. In this situation, given fundamental equations #3 and #4, the neuron stops "learning" new weights and biases.

**Fundamental Equation 2. Error in the Hidden Layers**

`delta_l^j = (delE_x)/(delz_l^j) = sum_k w_(l+1)^(jk)delta_(l+1)^ksigma'(z_l^j)`.

`(delE_x)/(delz_l^j) = sum_k (delE_x)/(delz_(l+1)^k) (delz_(l+1)^k)/(delz_l^j)`, by the chain rule.

`sum_k (delE_x)/(delz_(l+1)^k) (delz_(l+1)^k)/(delz_l^j) = sum_k (delz_(l+1)^k)/(delz_l^j) (delE_x)/(delz_(l+1)^k) = sum_k (delz_(l+1)^k)/(delz_l^j) delta_(l+1)^k`.

`(delz_(l+1)^k)/(delz_l^j) = w_(l+1)^(jk)sigma'(z_l^j)`, since `z_(l+1)^k = sum_m w_(l+1)^(mk)a_l^m + b_(l+1)^k = sum_m w_(l+1)^(mk)sigma(z_l^m) + b_(l+1)^k`.

So `sum_k (delz_(l+1)^k)/(delz_l^j) delta_(l+1)^k = sum_k w_(l+1)^(jk)sigma'(z_l^j) delta_(l+1)^k = sum_k w_(l+1)^(jk) delta_(l+1)^k sigma'(z_l^j) `.

**Fundamental Equation 3. New Bias**

`(delE_x)/(delb_l^j) = delta_l^j`.

`(delE_x)/(delb_l^j) = (delE_x)/(delz_l^j) (delz_l^j)/(delb_l^j)`, by the chain rule.

`(delE_x)/(delz_l^j) (delz_l^j)/(delb_l^j) = delta_l^j(delz_l^j)/(delb_l^j)`.

`(delz_l^j)/(delb_l^j) = 1`, since `z_l^j = sum_k w_l^(jk)a_(l-1)^j + b_l^j`.

**Fundamental Equation 4. New Weights**

`(delE_x)/(delw_l^(jk)) = a_(l-1)^kdelta_l^j`.

`(delE_x)/(delw_l^(jk)) = (delz_l^j)/(delw_l^(jk)) (delE_x)/(delz_l^j)`, by the chain rule.

`(delz_l^j)/(delw_l^(jk)) (delE_x)/(delz_l^j)= (delz_l^j)/(delw_l^(jk))delta_l^j`.

`(delz_l^j)/(delw_l^(jk)) = a_(l-1)^j`, since `z_l^j = sum_k w_l^(jk)a_(l-1)^j + b_l^j`.

## The [784,30,10] Network in Action

The network has 784 neurons in the input layer, 30 in the hidden layer, and 10 in the output layer. The code to train the network uses mini-batch, stochastic gradient descent to learn from the MNIST training_data over 30 epochs. The mini-batch size is 10. The step parameter (η) is 3.0. After the network has been trained, the code tests the network against a random image.

tom:arch [~/git/neural-networks-and-deep-learning/src] % python2 Python 2.7.12 (default, Nov 7 2016, 11:55:55) [GCC 6.2.1 20160830] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import mnist_loader >>> training_data, validation_data, test_data = mnist_loader.load_data_wrapper() >>> import network >>> net = network.Network([784, 30, 10]) >>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data) Epoch 0: 8268 / 10000 Epoch 1: 8393 / 10000 Epoch 2: 8422 / 10000 Epoch 3: 8466 / 10000 . . . Epoch 27: 9497 / 10000 Epoch 28: 9495 / 10000 Epoch 29: 9478 / 10000 >>> import numpy as np >>> imgnr = np.random.randint(0,10000) >>> prediction = net.feedforward( test_data[imgnr][0] ) >>> print("Image number {0} is a {1}, and the network predicted a {2}".format(imgnr, test_data[imgnr][1], np.argmax(prediction))) Image number 4709 is a 2, and the network predicted a 2 >>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots(1,2,figsize=(8,4)) >>> ax[0].matshow( np.reshape(test_data[imgnr][0], (28,28) ), cmap='gray' ) >>> ax[1].plot( prediction, lw=3 ) >>> ax[1].set_aspect(9) >>> plt.show()

Another way to link neurons together forms a convolutional neural network. The layers in such a network are not fully-connected. Here is an example convolutional neural network for the MNIST Data Set. It reaches a 98.80% accuracy. The trained network predicts that the image (chosen randomly) is a "2."