The first10chapterNeural network foundation

In this chapter, we will further study the foundation of neural networks. We’ll start by discussing artificial neural networks and how they can be inspired by real biological neural networks in our own bodies. After that, we will review the classic perceptron algorithm (P).Erceptron algorithm) and its role in the history of neural networks.

A perceptron is constructed, and we will learn the back propagation algorithm, which is the cornerstone of modern neural networks. We will use Python to implement the BP algorithm from scratch to ensure that we understand this important algorithm.

Of course, the modern neural network library, such as Keras, already has (highly optimized) built-in BP algorithm. Every time we want to train a neural network, it’s not unrealistic to manually back-propagate just as it is to encode a linked list or hash table data structure from scratch every time we deal with a general programming problem.And waste time and resources. To simplify the process, I will demonstrate how to use the Keras library to create standard feedforward neural networks.

Finally, we will conclude this chapter by discussing the four elements needed to build any neural network.

1 Neural network foundation

Before entering the convolution neural network, we first need to understand the basis of neural network. In this chapter, we will review:

（1）Artificial neural networks and their relationship with biology.

（2）The seed perceptron algorithm (seminal Perceptron algorithm).

（3）BPAlgorithm and how it effectively trains multilayer neural network.

（4）How to use Keras library to train neural network?

When you learn this chapter, you will have a deep understanding of neural networks and be able to learn more advanced convolutional neural networks.

1.1 Introduction of neural network

Neural network is an integral part of deep learning system. In order to succeed in deep learning, we need to first review the basic knowledge of neural networks, including architecture, node types, and algorithms for teaching networks.

In this section, we review the neural network and its biological relationship with human brain. Then we discuss the most common architecture type: feedforward neural network. The concept of neural learning and its relationship with subsequent neural network training algorithms will also be briefly discussed.

（1）What is a neural network?

Many characters involved in intelligence, pattern recognition, and target detection are extremely difficult to automate, but easy and natural for animals and children. For example, how do domestic dogs react differently to family members and strangers? Do children identify school buses and buses? Our brain continues to perform such complex tasks.Didn’t we notice? The answer is in our body. Each of us contains a real biological neural network connected to the nervous system — a network of interconnected neurons. “Nerve” is an adjective form of “neuron”.The “network” represents a graph-like structure, so the “artificial neural network” (ANN) is a computational system dedicated to simulating neural connections in our biological nervous system. It is usually abbreviated as “ANN” or”NN”, this book will use the two.

Consider a NN system that must contain a labeled, directed graph structure in which each node performs some simple computation. Figure 1 is an example of NN.

Figure 1 a simple NN architecture.

Each node performs a simple calculation. Each subsequent connection carries a signal (such as computed output) from one node to another, and the weights indicate whether the signal is amplified or shrunk. Some connections are the weights of large, positive amplified signals, indicating that this signal is important for classification. anotherSome are negative weights, which reduce signal strength, that is, the output of nodes is not important for the final classification. We call this system ANN if it consists of a weighted graph structure (similar to Figure 1) that can be adjusted in a learning algorithm.

（2）Relationship with biology

Our brains are made up of about 10 billion neurons, each connected to about 10,000 other neurons. The structure of biological neurons is shown in Figure 2.

Fig. 2 biological neuron structure

The soma of a neuron is called soma, and its inputs (dendrites) and outputs (axons) are connected to other somas.

Each neuron receives electrical signals from other neurons on dendrites. If these electrical signals are strong enough to activate neurons, the activated neurons will transmit signals along axon to the dendrites of other neurons. These connected neurons may also be ignited.(fire) continue to pass the message.

The key operation for a neuron to be firing is binary operation — either ignited or not ignited. There is no grade of ignition. Simply, if the total signal intensity received by the neuron exceeds a given threshold, it is ignited.

However, please remember that the inspiration of ANN comes only from our understanding of the brain and the working principle of the brain. The purpose of in-depth learning is not to mimic how our brains work, but to learn what we understand and to make analogies in our own work. Now, we are dealing with neuroscience andWe don’t know enough about deep brain function to properly simulate how the brain works — instead, we take inspiration from it and move on.

（3）Artificial model

Look at a basic NN that performs simple input weighting summation, as shown in Figure 3.

Figure 3 simple NN

Simple NN, as shown in Figure 3, simulates the structure of a neuron by performing weighting of input X and weight w, weighting and then passing to the activation function to determine whether to activate the neuron. Where values x1, X2, X3 are input to NN and usually correspond to a single row from our design matrix.For example, data points. Constant 1 represents bias and assumes that it is embedded in the design matrix. We can consider these inputs as input eigenvectors input to NN.

In practice, these inputs may be vectors used to quantify the content of an image in a systematic, predefined manner (e.g., color histograms, directional gradient histograms, local binary patterns, etc.). In the context of deep learning, these inputs are the original pixel intensity of the image itself.

Every x is passed by W1, W2,… The weight vector W formed by wn is connected, and the sum of the weights outputs a value through the activation function f (used to determine whether the neuron is activated). In mathematics, there are usually three forms:

（3）Activation function (activation functions)

The simplest activation function is the step function used in the perceptron algorithm (perceptron algorithm):

As shown in Figure 4,

Fig. 4 examples of different activation functions

The step function is 1 when the weighted sum is greater than 0, and 0 when it is less than 0, just like standing on the ladder. However, despite its intuitive and easy-to-use characteristics, its step function is non-differentiable, which can cause problems when applying gradient descent and training our network.

Instead of it is the activation function sigmoid function, which is more commonly used in NN literature history.

SigmoidFunctions are better for learning than step functions because: (1) any time is continuous and differential (2) y-axisymmetric (3) asymptotically near saturation.

SigmoidThe main advantage is that it’s smooth, so it’s easier to derive the learning algorithm, but it has two big problems: (1) its output is not zero center (2) saturated neurons essentially will kill the gradient, because the delta of the gradient will be extremely small (here it means the output of F as the weighted sum is large enough).It will be very flat and nearly saturated.

The activation function used after 1990s is tanh function:

ThanThe function is the 0 center, but the gradient will still killed, when the neuron is saturated.

We now know that there is a better choice of activation functions than sigmoid and tanh, namely ReLU (Rectified Linear Unit):

ReLUThe function is not saturated and the computation is effective. Empirically, the ReLU activation function is superior to sigmoid and tanh functions in almost all applications. However, when our value is 0, one problem is that the gradient can not be taken.

Then, a variant of ReLU, Leaky ReLU, allows for a small, non-zero gradient when the cell is not activated:

As you can see from the graph, this function allows a negative output, unlike the traditional ReLU, which forces the function output to zero.

Parametric ReLUsOr PReLus is based on Leaky ReLUs, allowing parameter alpha to be learned as a parameter, meaning that each node in the network can learn a different parameter from other nodes.

Finally, a ELUs (exponential linear unit) is introduced:

The parameter alpha here is constant and is set when the network mechanism is initialized, unlike the parameter in PReLUs. Usually this value is set to 1. The author of ELUs and the author of this article think that ELUs often gets better classification accuracy than ReLUs.

（5）Which activation function should I use?

Because of the current selection of modern (ReLU, Leaky ReLU, ELU, etc.) and classical (step, sigmoid, taan, etc.), the authors strongly recommend that you start with ReLU and then switch to Leaky.ReLu variant tested.

The author’s personal experience is to start with a standard ReLU, then adjust network and optimizer parameters (architecture, learning rate, regularization strength, etc.) and observe accuracy. Once the accuracy requirement is reasonably met, switching to the ELU usually relies on the dataset to improve the classification accuracy by another 1%.- 5%.

（6）Feedforward network architecture

Although there are many different NN architectures, the most common one is the feed forward network, as shown in Figure 5.

Fig. 5 example of feedforward neural network

In the feed forward neural network example shown in Figure 5, there are three input nodes, a hidden layer with two and three nodes, and an output layer with two nodes. In this type of architecture, connections between nodes are only allowed from the nodes of Layer I to the nodes of Layer I + 1 (hence terminology, feedforward), notAllow feedback and intra layer connections. Feedforward networks are called recurrent neural networks (RNN) when they include feedback connections (feedback to input and output connections).

This book focuses on feedforward neural networks, because they are the cornerstone of the application of modern in-depth learning to computer vision. In the eleventh chapter, CNNs is a special example of feed-forward neural network.

To describe a feedforward network, we usually use integer sequences to quickly and accurately identify the number of nodes in each layer. For example, the network in Figure 5 can be represented as a 3-2-3-2 feedforward network.

Layer 0: contains 3 inputs.

Layers 1 and 2 contain hidden layers of 2 and 3 nodes, respectively.

Layer 3: output layer or visual layer. This is all the output classification we get from the network. The output layer is usually the same as the number of class labels, and a node is a potential output. For example, if we build a NN for sorting handwritten numbers, our output layer will contain 10 nodes, eachA number in 0-9.

（7）Neurologic learning

Neural learning involves modifying weights and linking them to nodes in a network.

（8）What does neural network do?

Neural networks can be used for supervised, semi supervised and unsupervised learning tasks, providing appropriate architectures. A complete review of NN can be referred to [1]. Usually, the applications of N N include classification, regression, clustering, vector quantization, pattern association and function approximation. Here are just a few examples. In this book, NN is used for computing.Machine vision and image classification.

（9）Basic summary of neural network

In this section, by reviewing the ANN foundation, let’s learn more about it in detail through the actual architecture and related implementations. In the next section, we discuss the classic Percetron algorithm, the first ANN to be created.

1.2 Perceptron algorithm (perceptron algorithm)

In short, the perceptron algorithm is the oldest and the simplest ANN algorithm. From proposing, to severe winter, and then to multilayer perceptron and BP algorithm, a series of processes have gone through. In conclusion, perceptron is still a very important algorithm for understanding higher level multi-layer networks. We first review the perceptron architecture.Explain the training process in training perceptron (also known as delta rule). The termination criteria in the network are then reviewed (for example, when the perceptron should be stopped training). Finally, the perceptron algorithm is implemented using pure Python code and the reason why the network can’t learn the nonlinear classification number is explained.According to the collection.

（1）AND、OR、XORdata set

Before discussing the perceptron, we first look at the lower operations, including AND, OR and XOR. Bit operations and related bitsets accept two input bits and produce a final output bit after the application operation. Given two bit, each value is 0 or 1, table 1 provides AND.4 possible results of OR and XOR:

Table 1 bit operation result

We often use these simple bit operations to envoy and debug machine learning algorithms. As shown in Figure 6, the visualization bit operation value:

Fig. 6 visualization diagram of bit operation

Where AND and OR are linearly separable, we can clearly draw a line that separates 0 or 1 categories, but can no longer be drawn on XOR. That is, XOR is an example of a nonlinear separable data set.

Ideally, we want our machine learning algorithm to be able to partition non-linear data, because most of the real world we see is non-linear. Therefore, when building, debugging, and evaluating a given machine learning algorithm, we may use bit values x0 and X1 as our design matrix and try to predict the pairingThe y value should be.

Unlike the standard process of partitioning training set test sets, we test and evaluate networks on the same data set when using bitwise data sets. The goal here is to simply determine whether learning algorithms can learn data patterns. As pointed out, perceptron algorithm can correctly classify AND and OR, but can not classify them.XOR data.

（2）Perceptron architecture

RosenblattDefine the perceptron as a system that maps these inputs to the corresponding output class labels using a label example (supervised learning) of the eigenvector (or original pixel strength).

In the simplest form, a perceptron contains N input nodes, each of which is the input row of the design matrix, followed by the only layer in the network, just a single node of that layer, as shown in Figure 7:

Fig. 7 sensor network architecture

There is a connection here and their connection corresponds to the corresponding weight W1 from the input Xi, W2,… . wi, and the single output of the network. The node area is weighted and sent to the step function to determine the output class label. Perceptron output is either 0 or 1, that is, the most primitive form, perceptron only.It is only a classifier of two or two categories. ,

（3）Perceptron training process and delta rule

Training perceptron is a fairly intuitive process. Our goal is to get the weight W, which can correctly classify each class in the training set. In order to train the perceptron, we iteratively train the data feed network several times. Every network can see the entire training data set, that is, passing a epoch. usuallyMultiple epoch is needed until a weight vector w is learned to classify two sets of data linearly.

The pseudo code of perceptron training algorithm can be described as follows:

The actual “learning” takes place in the 2B and 2C steps. First, we transmit the feature vector XJ to the network and get the output YJ by weight W point multiplication. After this value is passed to the step function, this function determines if x&gt, 0, returns 1, or returns 0.

Now we need to update the weight vector w to get closer to the correct classification. The process of updating the weight vector is processed by delta rule of step 2C.

The expression (DJ – YJ) determines whether the output classification is correct. If the classification is correct, the difference is zero, otherwise the difference is either positive or negative, indicating the direction in which our weights should be updated (until closer to the correct classification). Then we multiply by XJ and (dj-yj) to move us.To a more correct classification.

1、Initialize the weight vector w with small random numbers

2、Until perceptron converge:

（a）Loop each feature vector XJ and the real class label Di on the training set D.

（b）X is passed to the network to calculate the output value: YJ = f (w (T). XJ).

（c）Update weight vectorFor all features 0< =i< =n

The parameter parameter here is the learning rate to control the size of our stride, neither too big nor too small. If there is confusion about this, don’t worry. We will use Python code in detail later to elaborate.

（4）Termination of perceptron training

The perceptron training process is executed until all training samples are properly classified or the default epoch arrival limit is reached. To ensure termination, it is necessary to set the effective parameter set and the training data is linearly separable.

So, what if the data is not linearly separable or set a bad parameter selection? Will there be unlimited training? We usually terminate the training by setting a certain number of epoch or according to the number of classification errors.

（5）pythonImplementation of perceptron

Now that we have studied the perceptron algorithm, we have implemented the real algorithm with Python. The directory structure of the perceptron algorithm package file is as follows:

We created the perciptron. py file in the pyimagesearch. NN package, which contains the real perceptron implementation. After you create the file, type the following code:

The Perceptron class is defined to accept a default parameter and an input parameter: N denotes the number of columns in our eigenvector, and N = 2 denotes two inputs in bit operations. Alpha selects 0.1,0.01, 0.001 for the learning rate of perceptron algorithm.

The weight matrix is N+1 dimension, which represents N input eigenvectors, plus 1 biases. Through np.sqrt (N), this is a common technology used for scaling the weight matrix, and it can converge faster. We will use this weight initialization technique in subsequent technologies of this chapter.

Next we define the step function:

This function is the step function in the activation function.

In order to really train the perceptron, we define a function named fit. Usually in machine learning, Python and scikit-learn libraries, the training process is defined as function fit:

fitThe method receives two parameters: X represents the actual training data, and the Y variable represents the target output class label. Finally, support epochs to indicate the number of rounds of perceptron training. In the 18 row, bias is used as the training parameter of weights.

Next, let’s look at the actual training process:

In a 21-line loop with the number of epochs designed, we loop predicted each data point and output tag in the data set (23 lines) at each epoch. On line 27, the perceptron’s predictive output is obtained by multiplying the input feature X by the weight matrix w point, and then by step function.. After that, the weights are updated when the prediction results are inconsistent with the target results.

The last function needs to define predict (), which is used to predict class labels when given the input data.

predict()The method requires an input data set X that needs to be sorted, and the offset column is added by default. The predicted output is similar to the training process, and the weighted sum can be passed through the step function.

Now that we implement the perceptron class, let’s try to apply the BitSet and see how the network performs.

（6）Evaluation of perceptron bit data sets

First create a file called perceptron_or.py in the same level folder of perceptron to train the perceptron model on the bitset:

We first import the Percetron perceptron class created, then build the OR data set, and then train. We then evaluated the dataset:

In the evaluation, we predict each data point in the data set, and then output data points, real markers and prediction markers.

Finally, we run the program and show the result:

By changing the target label in the data set, we make the data set into AND and XOR data sets. Through the test, we can see that no matter how many times it is executed, our single-layer perceptron model can not correctly predict XOR. We need multi-storey, from the multi tier into the deep learning field.

1.3 Backward propagation and multilayer network

BPAlgorithm is arguably the most important algorithm in the history of neural networks. BP algorithm can be considered as the cornerstone of modern neural network and deep learning. There are many topics on BP.

It can be seen that there are many discussions and discussions about BP. The author learns this way by using Python language to build an intuitive and easy-to-track implementation of BP algorithm. In this implementation, we build a real neural network and use BP algorithm to train it. When this section is completed,You’ll understand how BP works, and perhaps more importantly, how the algorithm trains neural networks from scratch.

（1）BPalgorithm

BPThe algorithm consists of two stages:

l Forward pass: Input travels across the network and gets predicted output (also known as propagation phase).

l Back pass: Calculates the gradient of the loss function at the last layer of the network (for example, the prediction layer) and uses this gradient iteration to update the weight of the network (also known as the weight update phase).

We first review these stages from the top level, and then use Python to implement the BP algorithm. Once the algorithm is implemented, a prediction can be made. The prediction process is only the forward propagation stage, and more effective prediction can be made with only a few minor adjustments according to the code.

Finally, I’ll show you how to train a custom neural network on XOR and MINIST datasets using BP and python.

（2）Forward propagation stage

The purpose of forward propagation is to apply a series of point multiplication and activation functions to propagate our input across the network until it reaches the output layer of the network (such as prediction). To visualize this process, we consider the XOR dataset, as shown in Table 2, left side:

Table 2 left: XOR dataset (including class labels) right: add bias design matrix

From Table 2 we know that every X in the design matrix is two-dimensional, that is, each data point is represented by two bits. For example, the first data point feature vector bit (0,0), the second data point is (0,1). We put the corresponding XOR output labels in the corresponding rows, and our goal is right.Predict target output y.

As pointed out below, to properly classify this nonlinear problem, we need a feedforward neural network with at least one hidden layer, so we start with a 2-2-1 architecture, as shown in Figure 9, where we do not add bias. When the bias is inserted, our eigenvector corresponds to the right side of Table 2.You can insert a bias into any column, but usually either the starting column of the eigenvector or the last column. Because we change the size of the input eigenvectors (usually executed within the neural network implementation itself, so we do not need to explicitly modify our design matrix), this modification makesThe network architecture is from 2-2-1 to 3-3-1, as shown in Figure 9.

Fig. 9 schematic diagram of network structure (depending on whether to modify bias)

In order to experience the forward propagation in practice, we first initialize the weight in the network, as shown in Figure 10. The weight values will be updated at the BP stage.

Fig. 10 example of forward propagation

From the left side of Figure 10, we enter the eigenvector (0,1,1) (corresponding output of the network is 1). You can see that each value 0,1,1 of the eigenvector corresponds to one of the three input nodes in the network. In order to propagate the value through the network and get the final classification, we need input and right.Double dot multiplication, and then apply the result to the activation function (in this example, sigmoid function, _).

First, calculate the values of three hidden layers:

Through the above calculation, we transfer the input value to the three nodes of the hidden layer. Then we will continue the calculation process and obtain the output of the network according to the input values and weights of the three hidden layers.

As you can see, the output of the network is 0.506. We then use a step function to determine the output of the prediction classification.

So, by net = 0.506, our network prediction value is 1, which means that the prediction corrects the label. However, our network’s prediction on this label is not very confident because the prediction value of 0.506 is very close to the step threshold. Ideally, this prediction should be close.0.98-0.99, suggesting that our network does learn potential patterns on data sets. In order for our network to learn, we need to apply the backpropagation process.

（3）Backpropagation

In order to apply the back propagation algorithm, our activation function must be differentiable so that we can calculate the partial derivative of the error under a given weight, loss (E), node output Oj, and network output netj:

The author does not deduce here, the reader can search by himself, there are many. I will briefly explain here that the final loss of the partial derivative of E to the weight Wi, J equals the first partial derivative of E to the jth output oj, then the weighted sum of OJ to the jth output and the partial derivative of netj, and finally the weighted sum of ne.TJ’s partial derivation of weight wi and J. The author will explain it through the following Python code.

（4）Implementation of BP algorithm with Python

First, in the directory where perceptron.py is located, create a new file named neuralnetwork.py:

We first import the necessary numeric operations library. Then the fifth line defines the NeuralNetwork class, which requires a necessary parameter, a default parameter:

l Layers：This parameter is an integer list that represents the real feedforward network architecture. For example, [2,2,1] represents an input 2 node, hiding 2 nodes, and exporting 1 node network.

l Alpha：Specify the learning rate of the neural network.

Next, we initialize the weight of each layer W list value [], then store the layer and alpha value. Since the initial weight value is empty, we first need to initialize the weight:

Here we loop the number of layers in the network, but stop before the last two layers (which will be explained in detail later). Each layer in the network is constructed by a M*N weight matrix represented by a standard, mean-distributed sampling value. The weight M*N is constructed because the weight we construct is each node of the current layer and the next layer.The connection between each node. For example, suppose layers [i] = 2 and layers [i + 1] = 2, then our weight to connect these nodes needs to be a matrix of 2 * 2, but it’s important to note that we forget the important part – the offset term. That is, add the bias item.After that, the current layer and the next layer need to be biased, so the matrix becomes 3 * 3, that is, there are 2 + 1 nodes in the current layer and 2 + 1 nodes in the next layer. The output variance of each neuron can be homogenized by scaling the square root of the number of nodes in the current layer of W.

The final block of code builds a special weight between the last two layers, because there is an offset between the connection input layer and the output layer, and no offset is required:

Then, by defining a magic function u repr_ for debugging, we can display the network hierarchy:

For example, the following implementation will show the network architecture:

Note: When importing a class name like this, remember to add the package in u init_.py under the package as mentioned earlier.

Next, we define the sigmoid activation function:

And the derivative of sigmoid used in backpropagation.

The input X of the derivative of sigmoid here is the x value after the sigmoid activation function. Again, whenever a back-propagation is performed, a differentiable activation function must be chosen.

Since the code is too much, there is no code pasting here. The specific code can see the corresponding code in GitHub.

Then, according to the common training process, the fit () function is created to train X and y, where X corresponds to the training data and Y corresponds to the class label corresponding to the training data. In each epoch training of the epoch loop, we loop predicted each data point in X, andThe weights are updated according to the prediction results. In the calculation of every data point, forward propagation and backward propagation must be carried out.

In the forward calculation, we save the output of each layer into the corresponding layer array of A [layer], so that in the forward calculation, the output of the current layer can be obtained by multiplying the weights of the A [layer] points of the current layer. Finally, the output of the output layer is obtained and saved.In A[-1].

Then, in the backward propagation calculation, we can obtain the output loss value by the formula error = A [-1] – y first, then calculate the derivative value of each layer in reverse according to the chain rule and store it in D. After reversing the D value of all the layers, flipping the D value, obtaining every positive one.Layer results. After that, the weights of each layer are recycled and weights are updated. The above process completes the forward and backward calculation process of a data point. Then under all epochs, all data sets X are calculated according to the above procedure, and the training process of the data sets is completed.

Then, if the new data set X is forecasted, the forecasting results can be obtained by multiplying the weights W obtained by the training process by the X points and then by the activation function, that is, the forecasting process is only a simple forward calculation process.

By comparing and analyzing the 2-2-1 and 2-1 feedforward neural networks, it is known that the 2-1 network can not train the nonlinear data sets.

（5）mnistExample BP algorithm Python implementation

MnistThe data set is integrated into the scikit-learn library, which consists of 1797 instance digits, each of which is an 8*8 grayscale image (the original image is 28*28).

By scaling min/max, the number is converted to float data, and then zoomed to [0,1]:

After constructing the network, the input is 1797, and the output is 10 digits.

By running nn_mnist.py under chpater10 /, you can see that some of the numbers are not accurate because the built-in MNIST dataset used only 450 instances in the test, and we will use CNNs to test on the entire dataset in the future.

（6）BPAlgorithm summary

BPThe algorithm is a generalization of the family of gradient descent algorithms, and is specially used for training multi-layer networks. In practice, BP is not only challenging to implement (because of errors in calculating gradients), but also difficult to achieve efficiency without a special optimization library, which is why we often use libraries such as KeRAS, TensorFlow, and mxnet that have already (correctly) used optimization strategies to implement BP.

1.4 Multilayer network using Keras

Now that we’ve implemented the neural network in pure python, we’re using specialized (optimized) neural network libraries such as Keras to implement it. Next we’ll use MNIST and cirfar-10 to implement feedforward neural networks for two main purposes: one is to demonstrate how to use KThe eras library implements a simple neural network, and the other is to obtain a benchmark for the subsequent application of CNNs and other standard neural networks.

（1）mnist

In the previous section, we used the sampled MNIST data set, and now we use the full MNIST data set, which has 70,000 data points per digit, 784 dimensions per data point, corresponding to 28*28 images. We create keras_mnis under chapter10/.T.py file:

From sklearn. preprocessing import Label Binarizer is used to encode the integer tag one-hot in the data set into vector tags, one-hot encodes a single integer tag into a vector, and thenBy parsing the parameters through the argparse package, the output / keras_mnist. PNG parameter here should be created well in advance, otherwise an error will be prompted. After that, load the data set and zoom the pixel intensity to [0,1], that is, the twenty-seventh row. Then the training set and the test set are divided.

After that, the data labels are transformed from integer to vector. Because the tag is 0-9, the one-hot code here is as follows:

This coding mechanism exists in many machine learning libraries. A feedforward network architecture is constructed. The 784-dimension eigenvector input is 28*28 and the output is 10-dimension label. The network construction process is as follows:

Here model. add () builds the network and add s each layer of information to Dense (), that is, the first layer is the next layer of input, input dimension, activation function, the middle layer only fills in the next layer of input, activation function, in the last layer, because it is class output, activation function uses sOftmax.

After the model was built, we trained the network through the optimizer:

That is, we construct the SGD optimizer according to the learning rate, and then construct the model kernel information: loss, optimizer, metrics, etc. Then fit the training model, and notice that we use the test set in the verification set of the training model, which is unreasonable in practice.Examples are used to demonstrate the training process of keras. In a real world, to tune hyperparameters or network architecture, validation sets are used to validate rather than train sets.

After the training, we can use the prediction () method to predict the results of the test set and evaluate the network performance. In the evaluation report, we need to determine the label of the maximum probability, which is known only by. argmax (axis = 1).

Finally, we will count the loss and accuracy through matplot, and then save:

Start the training process by running Python keras_mnist.py – output /output/keras_mnisit.png. When you first run, it will take some time to download the MNIST data set.After that, we can train very quickly. After training, the training curve can be displayed.

It can be seen from the graph that the training loss and the verification loss are basically close, indicating that there is no over-fitting or training process problem. When data sets are real images, the training accuracy will not be good enough to see the next training.

（2）CIFAR-10

Because the MNIST dataset is too easy to get high accuracy, and it can not reflect the real image. Therefore, the CIFAR-10 data set is usually used here, which consists of 60,000 sets of 32*32 RGB images, i.e. each data point in the data set is 32*32*3.=3072 is composed of integers. At the same time, the data set consists of 10 classes, that is, the number of nodes in the output layer is 10. The training of this data set is similar to that of mnist, except that the network is built differently, but the basic flow is the same, and is not listed here, as shown in github’s chapter10/kEras_cifar-10.py file.

The training images are (50 000, 32, 32, 3) the reshape is (50 000, 3072) and the training set reshape is (10 000, 3072). Finally, after the completion of the training process, the training results will be displayed.

It can be seen from the graph that the correctness rate of the verification set of the network training is only about 50%, and the characteristics that the training loss decreases sharply and the verification loss increases show the extremely over-fitting phenomenon. The fact is that a basic feedforward network with a strictly fully connected layer is not suitable for challenging image sets, so we need a more advanced approachCNNs. When we complete the Startle Bundle, we will be able to get 79% accuracy on CIFAR-10, and if we continue to learn Practitioner Bundle, we will demonstrate how to increase the accuracy to more than 93%.

1.5 Four components in neural network formulations

As you may notice from the Python code for training neural networks, we need to combine four main components into our neural networks and depth learning algorithms: a dataset, a model / architecture, a loss funcTion and an optimization method.

（1）dataset

Data set is the first part of neural network training, and the data itself is accompanied by the problems we ultimately need to solve. In the context of this book, we focus on image classification, but the combination of your data set and the problems you’re trying to solve will affect the loss function, network architecture, and optimization methods you use to train models.Choice. Usually, we will give us data sets on some project that can expect some kind of result, and we will train a machine learning model on a given data set that is helpful for a given task.

（2）loss function

Given the data set and the target, we need to define a loss function that is consistent with the problem we are trying to solve. In almost all image classification problems using deep learning, we use cross-entropy loss. For > 2 classification, we call it categorical C.Ross-entropy; for the two classification problem, we call it binary cross-entropy.

（3）Model / Architecture

Your network architecture may be the real choice you need to consider as the first recipe. Your data set may have been given to you for specific tasks, and you may choose cross-entry as the loss function when performing the classification.

However, your network architecture is dynamic, especially when you choose which optimization method to train your network. Take time to explore your data set and understand how many data points, how many categories, how similar / dissimilar these categories are, and intra-class differences.

You should begin to feel a sense of the network architecture you are going to use. This requires practice, because in-depth learning is part of science and part of Art – in fact, the rest of the book is dedicated to helping you develop both skills.

Note that the number of layers and nodes in the network architecture (along with regularization) may change as you become more experienced. The more results you collect, the more you’ll be able to make informed decisions about which technology to try next.

（4）optimization method

The last part is to define an optimization method. SGD is often used, and other more advanced optimization methods, such as RMSprop, Adagrad, Adadelta, and Adam, will be introduced in Practitioner Bundle.

Even with so many new optimization methods, SGD is still in-depth learning – most neural networks use SGD training, including on the latest IMAGENET. SGD should be your optimizer’s choice when training deep neural networks, especially when you first enter the learning field. After that, it shouldThis sets an appropriate learning rate and regularization intensity, the number of epochs for network training, whether to use momentum and Nesterov acceleration. Take time to familiarise yourself with SGD and familiarity with tuning SGD parameters.

Familiarity with a given optimizer is similar to how to drive a car. Familiarity with driving your own car rather than someone else’s should cost you a lot of time driving your own car. You are familiar with your car and its complexity. Normally, a given optimizer is selected on a data set to train the network, not because of thisIt’s because “drivers” (like in-depth learning practitioners) are more familiar with the optimizer and understand the art of tuning its parameters.

Keep in mind that even if you have a reasonably good neural network on a small/medium data set, even for advanced in-depth learners, it takes 10 to 100 experiments, so don’t be frustrated even if your network performs extremely badly. Being proficient in deep learning requires time and effort.But once you have grasped how these elements are combined, it is worth it.

1.6 Weight initialization

Before concluding this chapter, we briefly discuss the concept of weight initialization, that is, how to initialize weight matrices and bias vectors. This section is not intended to be a comprehensive initialization technique, but it is certainly a popular method from the in-depth learning literature.And some rules of thumb, here we will use some Python code demo.

（1）Constant initialization

When constant normalization is applied, all the values in the neural network are initialized with the constant C, usually C is 0 or 1.

For visualization, assuming input 64 and output 32 (excluding bias), then 0 initialization can be represented as W = np. zeros ((64, 32)), 1 initialization can be represented as W = np. ones ((64, 32)), then the constant of any C is initialized.It begins to turn into W=np.ones ((64, 32)) *C. Although this method is simple, it is almost impossible to destroy the symmetry of the activation function [2], so it is rarely used for depth learning weight initialization.

（2）Uniform and positive distribution

Uniform distribution is a random value between the ranges [lower, upper], where each value is equal to an introduction. Again, suppose our neural network has 64 inputs and 32 outputs to the top level, we hope to use L at the top level.Initialize the weights between power = – 0.05 and upper = 0.05, and then initialize them according to the following formula, resulting in 64 * 32 random values W = np. random. uniform (low = – 0.5, high = 0.5, SizE= (64, 32)).

The probability density of the Gauss distribution is defined as the normal distribution.

The most important parameters here are mean and standard deviation, and the square of standard deviation is called variance. When the random. Normal Library of keras is used to generate the orthodox distribution values of Mu = 0.0 and delta = 0.5, it can be written as W = np. random. normal (0.0,0.5, size= (64, 32)).

Both uniform and orthogonal distributions can be used to initialize the weights of neural networks, but we usually use various heuristic methods to create better initialization mechanisms.

（3）LeCunUniform and positive

If you use Torch7 or PyTorch frameworks, you may notice that the default weight initialization method is called “Efficient Backprop” derived from LeCun. The author defines the parameter Fin (the number of input layers) and Fout (the number of output of the layer).The uniform initialization is defined as:

KerasThe library uses truncated positive distribution:

（4）Glorot/Xavier Uniform and Normal

KerasThe default weight initialization method in the library is named “Glorot initialization” or “Xavier initialization” by Xavier Glorot.

The positive distribution with a mean value of 0 can be initialized as follows:

The Glorot/Xavier initialization distribution can be achieved by means of mean distribution.

Learning to use this initialization method is quite effective, and I recommend that most neural networks use this method.

（5）He et al./Kaiming/MSRA Uniform and Normal

There is also an initialization method called “He et al. initialization”, “Kaiming initialization”, or just “MSRA initialization”, which is commonly usedThe training is very deep in a neural network similar to the ReLU activation function (especially PReLU).

The uniform distribution is initialized as follows:

The initialization of the distribution is as follows:

We will discuss this initialization method in Practitioner Bundle and ImageNet Bundle when training very deep neural networks on large image sets.

（6）Initialization of application differences

Perhaps this actual limit will be different in different literature, which needs to be treated concretely in different literature.

2 summary

Unfortunately, as some of our research results (such as CIFAR-10) show, standard neural networks can not achieve high classification accuracy when we use challenging image sets with disparities in translation, rotation, viewpoints, etc. I got it on these datasets.With reasonable accuracy, we need to work on a particular type of feedforward neural network called CNNs (Convolutional Neural Networks).

3 appendix

[1] Jürgen Schmidhuber. “Deep Learning in Neural Networks: An Overview”. In: CoRR
abs/1404.7828 (2014). URL: http://arxiv.org/abs/1404.7828
(cited on pages 29,
128).

[2] Greg Heinrich. NVIDIA DIGITS: Weight Initialization.
https://github.com/NVIDIA/
DIGITS/blob/master/examples/weight-init/README.md. 2015 (cited on pages 165,
167).

The tenth chapter is the foundation of neural network.