# Philosophy, Computing, and Artificial Intelligence

PHI 319. Recognizing Digits in the MNIST Data Set.

******** UNDER CONSTRUCTION ********

## Artificial Neurons

Perceptron

An artificial neuron is a computational model of a neuron.

A typical neuron has dendrites, a cell body, and an axon. The dendrites (from the Greek δενδρίτης) take input from other neurons in the form of electrical impulses. The cell body processes these impulses, and the output goes from axon terminals to other neurons.

According to one recent estimate, in the average male human brain there are 86 billion neurons.

## Perceptron Neurons

A perceptron is an artificial neuron. It takes binary (0 or 1) inputs (x1 ... xm) and computes a binary output. The computation is a function of weights (w1 ... wm) and a threshold value. If the sum w1x1 + ... + wmxm is greater than the threshold, the output is 1. Otherwise, it is 0.

Perceptrons can implement truth-functions. Conjunction (φ ∧ ψ) is an example. Let the perceptron have two inputs, each with a weight of 0.6, and a threshold value of 1. If both inputs are 1, the sum exceeds the threshold value of the perceptron and thus the output is 1. Otherwise, the output is 0. With these conditions for activating the perceptron, the output matches the truth-table for conjunction.

A perceptron is an instance of the integrate-and-fire model. A neuron receives inputs through synapses. The weights correspond to the relative efficiency with which a synapse communicates inputs to the cell body, so some inputs weigh more heavily than others in the computation. Since it takes resources for the neuron to fire, the neuron is quiet unless the threshold is crossed.

The computation in a perceptron is typically expressed mathematically as the dot product

w · x , where w is a m-vector of weights and x is a m-vector of inputs.

The negative of the threshold value is the perceptron's bias, b. In these terms, the value of the output activation function for a given set of inputs is 1 if w · x + b > 0 and is 0 otherwise.

## Sigmoid Neurons

"Suppose we arrange for some automatic means of testing the effectiveness of any current weight [and bias] assignment [in the neuron] in terms of actual performance and provide a mechanism for altering the weight [and bias] assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programed would 'learn' from its experience" (Arthur L. Samuel, "Artificial Intelligence: A Frontier of Automation," 17. The Annals of the American Academy of Political and Social Science. Vol. 340, Automation, 10-20, 1962).
A sigmoid neuron has an important feature a perceptron lacks: small changes in the weights and bias cause small changes in the output. This allows sigmoid neurons to "learn."

We can make a neuron "learn" by changing its weights and biases. We know what the output should be. So if it is not what it should be, we change the weights and biases so that the output is closer to what it should be. In this way, the neuron "learns" what its output should be.

This talk about the neuron "learning" makes sense if we pretend that the neuron itself is changing its weights and biases in an effort to correct its mistakes. It sees that its output is not what it should be, so it adjusts its weights and biases in an effort to do better. Through many iterations of "learning," the neuron's output approaches the correct output. The neuron, in this way, "learns" like an archer who tries to get closer to hitting the target by slightly adjusting the angle of the arrow and how far he pulls back the string after each shot.

A sigmoid neuron has the mathematical parts in a perceptron (inputs, weights, and a bias), but there are two important differences. The inputs and outputs are not binary. The inputs may have any value from 0 to 1. The activation function is also different. It is the sigmoid function.

The sigmoid function is σ(x) =  1/(1 + e^-x), where x = w · x + b

As the activation function, the sigmoid function maps w · x + b to a smooth curve that preserves desirable features of the activation function for perceptrons. When w · x + b is a large positive number, the output of the function is close to 1 because e^-x is close to 0. When w · x + b is a large negative number, the output is close to 0 because e^-x is extremely large.

## Artificial Neural Networks

Artificial neurons may be linked together in a feedfoward network in which the ouput from one layer is input for the next layer. The first layer is the input layer of neurons. The last layer is the output layer. The hidden layers are the neurons that are neither the input nor output layers.

A feedforward network of artificial neurons may be understood as a device that makes "decisions about decisions." The first layer of neurons makes a "decision" about the input, the next layer makes a "decision about the decision" of the prior layer, and so on.

## A Feedforward Network to Classify Digits

For a good video (YouTube) introduction to MNIST, see the series Neural Networks, by 3Blue1Brown.

In grayscale, the intensity of light for each pixel is represented as a number from 0 to 255. 0 represents "black" (no light), 255 represents "white" (all light), and values in between 0 and 255 represent decreasing shades of "gray".
The MNIST dataset contains scanned images of handwritten digits.

MNIST is a (M) modified subset of two datasets (Special Database 1 and Special Database 3) of images of handwritten digits that the National Institute of Standards and Technology (NIST) collected. Special Database 1 was collected from high school students. Special Database 3 was collected from employees of the US Census Bureau. The MNIST data selects from both of the two datasets and normalizes the images so that each is 28 x 28 pixels in greyscale.

The images in the dataset are split into 60,000 training images and 10,000 test images.

The input to each neuron in the input layer in the network is one pixel from the input image. Since each image is 28 x 28 pixels, the input layer has 784 (or 28 x 28) neurons.

The input layer is part of a neural network of sigmoid neurons. Because the 28 x 28 images in the MNIST dataset are in greyscale, each is represented as a NumPy (the package for scientific computing with Python) one-dimensional array of 784 values between 0 and 1.

The output layer in the network has 10 neurons. The first neuron in this layer indicates whether the image is a 0, the second whether it is a 1, the third whether it is a 2, and so on.

### Minimizing the Error Function

This network needs to "learn" or be "trained" to classify the digits correctly.

The amount of error in a network is sensitive to its weights and biases. Training or learning in a network is a matter of finding a set of weights and biases that minimize the error.

To get some insight into the general idea, consider the function f(x,y) = x^2 + y^2 and how we might move in small steps to values that minimize this function.

gradf(x,y) is the gradient of f(x,y). The gradient is a function. It takes two coordinates as a position and returns two coordinates in the direction of steepest ascent.

gradf(x,y) = [[(delf)/(delx)(x,y)], [(delf)/(dely)(x,y)]] = [[2x],[2y]] .

So, for example, if the starting-point is (1,3), the direction of steepest ascent is toward

[[2*1],[2*3]] = [[2],[6]]

We want to move in the direction of steepest descent, so we take negative steps.

If (with a step parameter of eta = 0.01) we step from (1,3) to (2,6), we reach (0.98,2.94). In the x direction, we step to -0.01*2. In the y direction, we step to -0.01*6.

The value of f(x,y) at (1,3) is 10. The value at (0.98,2.94) is 9.604.

From the new position, the steepest ascent is toward

[[2*0.98],[2*2.94]] = [[1.96],[5.88]]

If we take a negative step from (0.98,2.94) with step parameter eta = 0.01, we step to (0.9604,2.8812). The value of f(x,y) at this position is 9.2236816. With each step from (1,3), the value of the function f(x,y) = x^2 + y^2 decreases.

### An Example Image from the MNIST Data Set

The image below of the digit 5 is an example of the images in training_data.

training_data is a list of 60,000 2-tuples (x, y).
x is a 784-dimensional array that represents image.
y is a 10-dimensional array that represents the label for image. It indicates the digit in the image.

training_data[0] is the first tuple.
training_data[0][0] is the x in the first tuple.
training_data[0][1] is the y in the first tuple. I use the source code ((c) 2012-2018 Michael Nielsen) from Michael Nielsen's Neural Networks and Deep Learning. He has made it available in a GitHub repository.

To get a copy of this source code, I installed Git onto my Linux distribution (Arch Linux), made a directory I named "git," changed the current directory to the one I made, and created a copy (or "clone") of the repository:

sudo pacman -S git
mkdir git
cd git
git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

Nielsen's Python code is in Python 2.6 or 2.7. Michal Daniel Dobrzanski has a repository with code in Python 3.5.2.


tom:arch [~/git/neural-networks-and-deep-learning/src]
% python2
Python 2.7.12 (default, Jun 28 2016, 08:31:05)
[GCC 6.1.1 20160602] on linux2
>>> training_data[0][1].shape
(10, 1)
>>> training_data[0][1]
array([[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 1.], # the image shows a "5"
[ 0.],
[ 0.],
[ 0.],
[ 0.]])
>>> training_data[0][0].shape
(784, 1)
>>> import numpy as np
>>> image_array = np.reshape(training_data[0][0], (28, 28))
>>> import matplotlib.pyplot as plt
>>> image = plt.imshow(image_array, cmap ='gray')
>>> plt.show()	

## Making the MNIST Dataset Ready

mnist.pkl.gz is a "pickled" tuple of 3 lists: the training set (training_data), the validation set (validation_data), and the testing set (test_data).

The function load_data_wrapper() returns training_data, validation_data, test_data.

validation_data and test_data are lists containing 10,000 2-tuples (x, y).
x is a 784-dimensional array that represents the image.
y is a 10-dimensional array that represents the label for image. It indicates the digit in the image.

We will not use the validation_data in this lecture.


import cPickle
import gzip
import numpy as np

f = gzip.open('../data/mnist.pkl.gz', 'rb')
f.close()
return (training_data, validation_data, test_data)

training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
training_results = [vectorized_result(y) for y in tr_d[1]]
training_data = zip(training_inputs, training_results)
validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
validation_data = zip(validation_inputs, va_d[1])
test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
test_data = zip(test_inputs, te_d[1])
return (training_data, validation_data, test_data)

def vectorized_result(j):
e = np.zeros((10, 1))
e[j] = 1.0
return e

## The Rest of the Python Program

We will not try to understand in detail either the source code (which belongs (Copyright (c) 2012-2018 Michael Nielsen) to Michael Nielsen) or the underlying algorithm.

### The Network Class

class Network(object):

def __init__(self, sizes):
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

In the program, we use this class to create a 784x20x10 neural network (784 neurons in the input layer, 30 in the hidden layer, and 10 in the output layer).

sizes = [784,30,10], sizes[1:] = [30,10], sizes[:-1] = [784,30]

### An Example Network

The following code creates a neural network (net) whose input layer has two neurons, whose middle layer has three neurons, and whose output layer has one neuron.

 tom:arch [~/git/neural-networks-and-deep-learning/src] % python2 Python 2.7.12 (default, Jun 28 2016, 08:31:05) [GCC 6.1.1 20160602] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import network                   # import module network.py >>> net = network.Network([2, 3, 1]) # create instance of class >>> 

The biases and weights are set as random numbers. The input layer has no bias. Biases are only used in computing the output from later layers.

For the [2, 3, 1] network, the biases are in a 3 x 1 array and a 1 x 1 array.

tom:arch [~/git/neural-networks-and-deep-learning/src]
% python2
Python 2.7.12 (default, Jun 28 2016, 08:31:05)
[GCC 6.1.1 20160602] on linux2
>>> import network
>>> net = network.Network([2, 3, 1])
>>> net.biases[0].shape
(3, 1)
>>> net.biases[0]
array([[ 1.36630966],
[ 1.05788544],
[ 0.80606255]])
>>>net.biases[1].shape
(1, 1)
>>>net.biases[1]
array([[ 1.54813682]])
>>>	

For the [2, 3, 1] network, the weights are in a 3 x 2 array and a 1 x 3 array.

The first row in net.weights[0] are the weights the first neuron in the hidden layer attributes to the outputs of the first and second neurons in the input layer.

>>> net.weights[0].shape
(3, 2)
>>> net.weights[0]
array([[-0.27640848,  0.13942239],
[ 1.13350606,  1.51767629],
[-0.03836741,  0.06409297]])
>>> net.weights[1].shape
(1, 3)
>>> net.weights[1]
array([[-0.72105625,  1.76366748,  1.49408987]])
>>> 

For each "epoch" of training, the training data is randomly shuffled and partitioned into "mini-batches." Once the last mini-batch is processed, the network is evaluated against the test data.

def SGD(self, training_data, epochs, mini_batch_size, eta,
test_data=None):
if test_data: n_test = len(test_data)
n = len(training_data)
for j in xrange(epochs):
random.shuffle(training_data)
mini_batches = [training_data[k:k+mini_batch_size] for k in xrange(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data
print "Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test)
else:
print "Epoch {0} complete".format(j)	

The method update_mini_batch updates the weights and biases.

The minibatch is random sample of images. So if the sample size is large enough, the new weights and biases learned from minibatch approximate the weights and biases that would be learned from training with all the images in the training data.

new_w \Leftarrow w

new_ b \Leftarrow b

w - eta/m sum_j (delE_(X_j))/(delw)

b - eta/m sum_j (delE_(X_j))/(delb)

m is len(mini_batch)
the number of images in the minibatch

eta is eta
the step factor
For each input in the mini-batch, the method calculates and saves an adjustment to the weights and biases that reduces the value of the error function for the network. Next, given the step parameter and the average of the adjustments to the weights and biases for the inputs in the mini-batch, the method updates the weights and biases in the network.

	def update_mini_batch(self, mini_batch, eta):
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
#
#
#
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
#
# update weights and biases
#
self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]	

The update_mini_batch uses the method backprop to compute the adjustments.

The backprop method has two parts.

In the "feedforward" section of the method, backprop forward feeds the training input (x) through the network and stores the zs and activations layer by layer.

The zs are the input to the activation function.

The activations are the outputs of the activation function.

In the "backward pass" section, backprop uses the zs and activations to calculate the adjustments. This calculation is the most important (and difficult) part of the algorithm.

def backprop(self, x, y):
#
# x is the input to the network, y is the label for x
#
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
#
# feedforward
#
# feed x forward through the network
# The first time through the loop the activation is the input to the network
#
activation = x
activations = [x]                                                            # activations is a list
zs = []
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b                                              # np.dot is the numpy dot product
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
#
#
# backward pass
#
# fundamental equation 1
delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1])     # [-1] the last item in the list
#                                                                            # * is the Hadamard product, ⊙
# fundamental equation 3
nabla_b[-1] = delta
#
# fundamental equation 4
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
#
for l in xrange(2, self.num_layers):
z = zs[-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l+1].transpose(), delta) * sp               # fundamental equation 2
nabla_b[-l] = delta                                                      # fundamental equation 3
nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())               # fundamental equation 4
#
#
#
return (nabla_b, nabla_w)

def cost_derivative(self, output_activations, y):
return (output_activations-y)

def sigmoid(z):
"""The sigmoid function."""
return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))


When an image x from the training set of n images is fed through the network, it produces a vector of outputs, a_L, different in general from the desired vector of outputs y. The ideal is to make y = a_L for each image x in the training set.

To approach this ideal, we minimize the average error in the network

E = 1/n sum_x^n E_x

where

E_x = 1/2 norm(y - a_L)^2 = 1/2 sum_j (y^j - a_L^j)^2 y^j is the j^(th) neuron in the vector of desired activations in last layer of the network when image x is the input

a_L^jis the j^(th) neuron in the vector of activations in the last layer of the network when image x is the input

Q: Why the constant 1/2?
A: To cancel the exponent when differentiating. We can think of it as part of the step parameter, which we set.

In a 784x30x10 network (the network trained (in the example below) to classify the MNIST images), there are 23,820 weights (784x30 + 30x10) and 40 biases (30 + 10). The activations in the vector a_L(x) are a function of these weights and biases. So minimizing the cost function steps to new values for weights and biases that make the network more accurate.

(The 784x30x10 network is small. OpenAI's GPT-3 has 175 billion parameters!)

The L^2 (Euclidean) norm corresponds to the length of the vector from the origin to the point. Suppose, for example, the point is

u = [[x],[y]] = [[3],[4]]

The L^2 norm of  u = norm u = \sqrt{3^2 + 4^2} = 5.

The error function uses the squared L^2 norm makes the computation simpler. Because it eliminates the square root, the computation is sum of the squared values of the vector.

The partial derivative of E_x with respect to a_L^j is

(delE_x)/(dela_L^j) = del/(dela_L^j)[1/2(y^j-a_L^j)^2]

= 1/2 * del/(dela_L^j)[(y^j-a_L^j)^2]

= 1/2 * 2(y^j-a_L^j) * del/(dela_L^j)[y^j-a_L^j]

= (y^j-a_L^j) * (del/(dela_L^j)y^j - del/(dela_L^j)a_L^j)

 = (y^j-a_L^j) * (0 - 1)

 = (y^j-a_L^j) * -1

 = a_L^j-y^j
(delE_x)/(dela_L^j) = a_L^j-y^j

The evaluate method returns the number of test inputs for which the network output is correct.

def evaluate(self, test_data):
# the output is the index of the first neuron in the final layer with a maximum activation
test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data]
return sum(int(x == y) for (x, y) in test_results)

def feedforward(self, a):
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a	  

## Four Fundamental Equations for Training the Network

delta_l^j is the error in the j^(th) neuron in the l^(th) layer

z_l^j is the input to the activation function for the j^(th) neuron in the l^(th) layer

z_l^j = sum_k w_l^(jk) a_(l-1)^k + b_l^j

w_l^(jk) is the weight on the connection into the j^(th) neuron in the l^(th) layer from the k^(th) neuron in the (l - 1)^(th) layer

a_(l-1)^k is the activation of the k^(th) neuron in the (l - 1)^(th) layer

a_(l-1)^k = sigma(z_(l-1)^k)

b_l^j is the bias of the j^(th) neuron in the l^(th) layer

Chain Rule:

(delz)/(dely) (dely)/(delx) = (delz)/(delx), if z = f(y) and y = g(x)
The four equations are stated in terms of delta_l^j, which is defined as (delE_x)/(delz_l^j).

Fundamental Equation 1. Error in the Last Layer

delta_L^j = (delE_x)/(delz_L^j) = (delE_x)/(dela_L^j)sigma'(z_L^j).

(delE_x)/(delz_L^j) = sum_k (delE_x)/(dela_L^k) (dela_L^k)/(delz_L^j) , by the chain rule.

sum_k (delE_x)/(dela_L^k) (dela_L^k)/(delz_L^j) = (delE_x)/(dela_L^j) (dela_L^j)/(delz_L^j), since (delE_x)/(dela_L^k) (dela_L^k)/(delz_L^j) = 0 if j \ne k.

(delE_x)/(dela_L^j) (dela_L^j)/(delz_L^j) = (delE_x)/(dela_L^j)sigma'(z_L^j), since a_L^j = sigma(z_L^j).

Since the graph of the sigmoid function sigma is flat when its value is close to 0 or 1, sigma' is approximately 0 at these values. So, given fundamental equation #1, delta_L^j (the error in the j^(th) neuron in the last layer) is approximately 0 at these values. In this situation, given fundamental equations #3 and #4, the neuron stops "learning" new weights and biases.

Fundamental Equation 2. Error in the Hidden Layers

delta_l^j = (delE_x)/(delz_l^j) = sum_k w_(l+1)^(jk)delta_(l+1)^ksigma'(z_l^j).

(delE_x)/(delz_l^j) = sum_k (delE_x)/(delz_(l+1)^k) (delz_(l+1)^k)/(delz_l^j), by the chain rule.

sum_k (delE_x)/(delz_(l+1)^k) (delz_(l+1)^k)/(delz_l^j) = sum_k (delz_(l+1)^k)/(delz_l^j) (delE_x)/(delz_(l+1)^k) = sum_k (delz_(l+1)^k)/(delz_l^j) delta_(l+1)^k.

(delz_(l+1)^k)/(delz_l^j) = w_(l+1)^(jk)sigma'(z_l^j), since z_(l+1)^k = sum_m w_(l+1)^(mk)a_l^m + b_(l+1)^k = sum_m w_(l+1)^(mk)sigma(z_l^m) + b_(l+1)^k.

So sum_k (delz_(l+1)^k)/(delz_l^j) delta_(l+1)^k = sum_k w_(l+1)^(jk)sigma'(z_l^j) delta_(l+1)^k = sum_k w_(l+1)^(jk) delta_(l+1)^k sigma'(z_l^j) .

Fundamental Equation 3. New Bias

(delE_x)/(delb_l^j) = delta_l^j.

(delE_x)/(delb_l^j) = (delE_x)/(delz_l^j) (delz_l^j)/(delb_l^j), by the chain rule.

(delE_x)/(delz_l^j) (delz_l^j)/(delb_l^j) = delta_l^j(delz_l^j)/(delb_l^j).

(delz_l^j)/(delb_l^j) = 1, since z_l^j = sum_k w_l^(jk)a_(l-1)^j + b_l^j.

Fundamental Equation 4. New Weights

(delE_x)/(delw_l^(jk)) = a_(l-1)^kdelta_l^j.

(delE_x)/(delw_l^(jk)) = (delz_l^j)/(delw_l^(jk)) (delE_x)/(delz_l^j), by the chain rule.

(delz_l^j)/(delw_l^(jk)) (delE_x)/(delz_l^j)= (delz_l^j)/(delw_l^(jk))delta_l^j.

(delz_l^j)/(delw_l^(jk)) = a_(l-1)^j, since z_l^j = sum_k w_l^(jk)a_(l-1)^j + b_l^j.

## The [784,30,10] Network in Action

The network has 784 neurons in the input layer, 30 in the hidden layer, and 10 in the output layer. The code to train the network uses mini-batch, stochastic gradient descent to learn from the MNIST training_data over 30 epochs. The mini-batch size is 10. The step parameter (η) is 3.0. After the network has been trained, the code tests the network against a random image.

tom:arch [~/git/neural-networks-and-deep-learning/src]
% python2
Python 2.7.12 (default, Nov  7 2016, 11:55:55)
[GCC 6.2.1 20160830] on linux2
>>> import network
>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
Epoch 0: 8268 / 10000
Epoch 1: 8393 / 10000
Epoch 2: 8422 / 10000
Epoch 3: 8466 / 10000

.
.
.

Epoch 27: 9497 / 10000
Epoch 28: 9495 / 10000
Epoch 29: 9478 / 10000
>>> import numpy as np
>>> imgnr = np.random.randint(0,10000)
>>> prediction = net.feedforward( test_data[imgnr][0] )
>>> print("Image number {0} is a {1}, and the network predicted a {2}".format(imgnr, test_data[imgnr][1], np.argmax(prediction)))
Image number 4709 is a 2, and the network predicted a 2
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(1,2,figsize=(8,4))
>>> ax[0].matshow( np.reshape(test_data[imgnr][0], (28,28) ), cmap='gray' )
>>> ax[1].plot( prediction, lw=3 )
>>> ax[1].set_aspect(9)
>>> plt.show()


Another way to link neurons together forms a convolutional neural network. The layers in such a network are not fully-connected. Here is an example convolutional neural network for the MNIST Data Set. It reaches a 98.80% accuracy. The trained network predicts that the image (chosen randomly) is a "2."