Deep learning in Matlab II

2019-01-07

neural network theory

Interpreting Network Behavior

Extracting and Visualizing Activations:

use the activations function to extract features from an input image:
- it accepts three inputs: the network, the input image, and the layer to extract features from.
- features = activations(net,img,layerName)
Each convolution layer consists of many 2-D arrays called channels.
Most CNNs learn to detect features like color and edges in the first convolution layer. In deeper layers, the network learns more complicated features.
use the function mat2gray to normalize the activations
use montage to show the images side by side

Representing Signal Data as Images

CNN use the images as input, which means every 2-D array can be input to CNN. Therefore, we can represent the signals in to 2-D array for use of CNN
There are two ways to convert 1-D signal to images: use spectrogram or a continuous wavelet transform using cwt.

Feature Extraction for Machine Learning:

It is difficult to perform deep learning on a computer without a GPU because of the long training time.
An alternative is to use pretrained networks for feature extraction. Then you can use traditional machine learning methods to classify these features.
CNNs learn to extract useful features while learning how to classify image data. As you’ve seen, the early layers read an input image and extract features. Then, fully connected layers use these features to classify the image.
With deep learning, you can use the activations function to extract features. These features can be used as the predictor variables for machine learning.
We can set the OutputAs option to store the features from activation as rows for the machine learning model: testFeatures = activations(net,testImgs,'fc7','OutputAs','rows')

# extract the features from a pretrained net
trainingFeatures = activations(net,trainImgs,'fc7','OutputAs','rows');

# train the maching leanring classifier e.g. naive bayes
classifier = fitcnb(trainingFeatures,trainImgs.Labels)

#W extract the features for test data and use the trained classifer above to predict
testFeatures = activations(net,testImgs,'fc7','OutputAs','rows');
predictedLabels = predict(classifier,testFeatures)

# compute accuracy and plot the confusion matrix
accuracy = nnz(predictedLabels==testImgs.Labels)/numel(predictedLabels)
confusionchart(testImgs.Labels, predictedLabels)

Create Networks

Many pretrained networks need the input to be 3-D data and some restriction for the size of the images. If the data you want to train doesn’t satisfy these requirment (like the have more than 3 dimensions), we then need to build our own model.

Create Network Architectures

The first layer of any convolutional neural network is an image input layer. Function imageInputLayer(inputSize) can do this. The input size is a three-element vector corresponding to the height, width, and number of channels of that image: inLayer = imageInputLayer([28 28 3])
Convolution layers learn features in the input image by applying different filters to the image. To create a convolution layer, you need to specify the filter size and the number of filters: convolution2dLayer([h w],n), convLayer = convolution2dLayer([5 5],20)
Convolution layers are generally followed by rectified linear unit (ReLU) and max pooling layers.
A ReLU layer sets all negative values to zero: reluLayer(). The function does not require any inputs.
Max pooling layers perform down-sampling by “pooling” rectangular regions together and computing the maximum of each region. Use the pool size as input: maxPooling2dLayer([h w])
The last three layers of a convolutional neural network are
- fullyConnectedLayer: it requires the output size as input. This is the number of classes that the network can predict.
- softmaxLayer: does not require any inputs.
- classificationLayer: does not require any inputs.
To train a network, we need an array of your entire network architecture. The last step is to stack all the layers you have created into a single array.

# To create an architecture that can be used to classify 28-by-28 color images into two classes.

# Create input layer that expects RGB images sized 28-by-28.
inLayer = imageInputLayer([28 28 3])

# Create convolution layer with 20 filters sized 5-by-5.
convLayer = convolution2dLayer([5 5],20)

# Create middle layers with a ReLU layer and a max pooling layer with a pool size of 2-by-2. 
midLayers = [reluLayer(); maxPooling2dLayer([2 2])]

# Create output layers with a fully connected layer for two classes and a softmax layer and a classification layer.
outLayers = [fullyConnectedLayer(2); softmaxLayer(); classificationLayer()]

# Combine layers
layers = [inLayer; convLayer; midLayers; outLayers]

# set the training options and train
options = trainingOptions('sgdm','MaxEpochs',5,'InitialLearnRate',0.0001)
net = trainNetwork(XTrain,YTrain,layers,options);

# Prediction
testpreds = classify(net,XTest)

Understanding Neural Networks

Each layer in the network performs some operation on its inputs and outputs a new value.
The first layer is an image input layer. This layer defines the input size of the network and normalizes the input images. By default, an image input layer subtracts the mean image of the training data set. This centers the images around zero.
2-D convolution layers apply sliding filters to the input image. Convolution layers are a key part of the CNN architecture. They rely on the spatial structure of the input image.
Convolution layers are usually followed by a nonlinear activation layer such as a rectified linear unit (ReLU). A ReLU layer performs a threshold operation to each element of the input. Any value less than zero is set to zero.
A maximum pooling layer performs down-sampling by dividing the input into rectangular pooling regions and computing the maximum of each region. Pooling reduces the network complexity and creates a more general network.
Features passing through the network are stored in a collection of matrices until they reach the fully connected layer. At the fully connected layer, the input is “flattened” (which means to tansform the matrix to a vector) so that it can be mapped to the output classes. This layer is a classical neural network.
The output size for this layer is the number of classes for your classification problem. For example, if you were classifying cats and dogs, the output size would be two.
The softmax layer converts the values for each output class into normalized scores using a normalized exponential function. You can interpret each value as the probability that the input image belongs to each class.
The softmax layer converts the values for each output class into normalized scores using a normalized exponential function. You can interpret each value as the probability that the input image belongs to each class.
The classification output layer returns the name of the most likely class.

Convolutional Layers

Convolution layers in CNNs perform convolution using learned filters.
A matrix called a kernel is used to filter the image.
We can use kernel with the conv2 function to apply a filter to an image: filteredim = conv2(kernel,im)
Using empty brackets as the second input in imshow will scale the display based on the minimum and maximum values present in the image: imshow(im,[])
Or you can also use the imfilter function to apply the same kernels to the entire RGB image, which means the 3-D data. conv2 can only apply to a 2-D image.

Summary
List of layer functions

Train Networks

Understand the network training:

Mini-Batch
Learning rate
Learning algorithm

Monitor Training Progress

can use the Plots option to monitor network training: options = trainingOptions('sgdm','MaxEpochs',2,'InitialLearnRate',0.0001,'Plots','training-progress');

Validation

About different dataset:
- Training data: used during training to update weights.
- Validation data: used during training to evaluate performance.
- Testing data: used after training to evaluate performance.
Validation data is useful to detect if your network is overfitting. Even if the training loss is decreasing, if the validation loss is increasing, you should stop training because the network is learning details about the training data that aren’t relevant to new images.
There are three training options related to validation.
- ValidationData: Validation data and labels.
- ValidationFrequency: Number of iterations between each evaluation of the validation data.
- ValidationPatience: The number of validations to check before stopping training. Fluctuations in the loss from one iteration to another are normal, so you generally don’t want to stop training as soon as the validation loss increases. Instead, perform several validations. If the loss has not reached a new minimum in that time, then stop the training.

Improve Performance

After testing on the test dataset, if the accuracy is not adequate. We need to improve the network. Any of the inputs to trainNetwork can be modified to train a different network that may perform better.

From training algorithm options: Modifying the training options is generally the first place to begin improving a network.
Training Data: If no enough training data, the network may not generalize to new data. If you cannot get more training data, augmentation is a good alternative.
Architecture: If you are performing transfer learning, you often do not need to modify the network architecture to train an effective network. One alternative is to try using a pretrained network with a directed acyclic architecture like GoogLeNet or ResNet-50.

Training Options:

Decrease learning rate: If there is a large spike in loss, or loss values are no longer being plotted, your initial learning rate is probably too high. Decrease the learning rate by a power of ten until your loss decreases.
When you train a network, there is always a trade-off between accuracy and training time.
The following chart shows some general guidelines when training a convolutional neural network.

Augmented Datastores:

The imageDataAugmenter function can be used to choose your augmentation. Possible augmentations include transformations like reflection, translation, and scaling.
Generally, you should choose an augmentation that is relevant to your data set: imageDataAugmenter('RandRotation',[min max])
When you create an augmented image datastore, you need to specify the output image size, the source of the files, and the augmenter using augmentedImageDatastore: augmentedImageDatastore(size,ds,'DataAugmentation',augmenter)
You can read data from an augmented datastore with the read function. Instead of returning one image, read will return a batch of data. Each returned image has a different random augmentation: data = read(augImds)
The variable containing the augmented images is named input: im = data.input{n}

Directed Acyclic Graphs:

All the networks we have used or created are represented in MATLAB as a column vector of layers. This is called a series architecture.
An alternate way to organize layers in a network is called a directed acyclic graph (DAG). DAGs have a more complex architecture where layers can have inputs from, or outputs to, multiple layers.
A DAG architecture is defined with layers and connections between these layers. In MATLAB, these are represented in separate network properties. Some pretrained networks – e.g., GoogLeNet, ResNet-50, and SqueezeNet – are DAG networks.
Transfer learn from a DAG:
- To modify the architecture of DAG network, we first need to get a graph of its layers. A layer graph contains both the layers and connections of a DAG: lgraph = layerGraph(net)
- view the architecture by using the layer graph as input to the plot function: plot(lgraph)
- Connections between layers in a DAG are defined by each layer’s name. When creating a new layer for a DAG network, you should name it by setting the ‘Name’ option: newly = fullyConnectedLayer(n,'Name','layerName')
- replace a layer using the replaceLayer function. The three inputs are the layer graph, the name of the layer to replace, and the variable containing the new layer: newgraph = replaceLayer(graph,'oldLayerName',newly)

sequence classification and regression.

Perform Regression

Transfer learning for regression

Use Alexnet to perform transfer learning
We need to delete the last three layers before replacing them with the correct layers because they are for classification. Now is a regression problem: layers(end-n+1:end) = []
For regression problems, the last two layers must be a fully connected layer and a regression layer. The corresponding functions are fullyConnectedLayer(outputSize) and regressionLayer(). Regression networks do not need a softmax layer.
When a regression network is trained, root-mean-square error (RMSE) is calculated instead of accuracy.

# build and train a netwotk
net = alexnet;
ly = net.Layers;
ly(end-2:end) = [];
newlayers = [fullyConnectedLayer(1);regressionLayer()];
ly = [ly;newlayers];

# evaluate regression network
pred = predict(mynet,testImage);
err = trueValue - pred;
# RMSE
rmse = sqrt(sum(err.^2));

Detect Objects in images

use insertObjectAnnotation to add the bounding box to a image: alteredImg = insertObjectAnnotation(image,'rectangle',boxposition,label)

Regions with Convolutional Neural Networks (R-CNN)

R-CNN workflow:
- Find regions likely to contain objects
- Extract and resize each region to CNN input size
- Use CNN to predict class of each region
Training and Using an R-CNN:
- There are three different types of object detectors in MATLAB: R-CNN, Fast R-CNN, and Faster R-CNN. The corresponding functions are trainRCNNObjectDetector, trainFastRCNNObjectDetector, and trainFasterRCNNObjectDetector.
- These networks differ between training time and detection time. For example, a R-CNN can be trained quickly, but the time to detect a new image is slower than a Faster R-CNN network. You should choose between these networks depending on your application.
- All of these functions have the same inputs and outputs: data, network, options
- data is the ground truth stored as a table. The first variable is a directory and filename for each image. The remaining variables are labels and the corresponding bounding boxes.
- use detect function to detect new images: [bboxes,scores,labels] = detect(detector,image)
Evaluating an Object Detector:
- Precision: function evaluateDetectionPrecision calculates a precision metric using an overlap threshold between the predicted and true bounding boxes. Precision is a ratio of true positive instances to all positive instances of objects in the detector.
- Miss rate: we also need to consider the case when the detector fails to find an object. This is called the miss rate. You can calculate a miss rate metric using evaluateDetectionMissRate.

# train a RCNN
net = alexnet;
options = trainingOptions('sgdm','InitialLearnRate',0.0001,'MaxEpochs',2);
rcnn = trainRCNNObjectDetector(petGroundTruth,net,options);

# test on a image
[dbox,dscore,dlabel] = detect(rcnn,dogim)
detectedDogs = insertObjectAnnotation(dogim,'rectangle',dbox,cellstr(dlabel))
imshow(detectedDogs)

Classify Sequence Data with Recurrent Networks

Long Short-term memory network (LSTM)

Sequence classification
Bidirectional LSTMs

Structuring Sequence Data

Training an LSTM requires the data to be stored in a particular format:
- The input data is a cell array with one column.
- Each element in the cell array is one sample, or sequence. This sample is a numeric matrix.
- The columns in each sample are the time steps. Every sample can have a different number of time steps.
  -The rows correspond to the feature dimension of the sample. This could be signal data from different sensors, or different letters in a vocabulary. All samples must have the same number of rows.

Sequence Classification：

Create LSTM architecture
- The network begins with an input layer, follows with a BiLSTM layer, and ends with the same output layers as a CNN.
- The first layer of an LSTM is a sequence input layer: sequenceInputLayer(inputSize). The input to this function is the number of features, or the number of rows in a sample.
- Next is a bidirectional LSTM layer. You should set the number of nodes and the output mode when creating this layer: bilstmLayer(numNodes,'OutputMode','last')
- The last three layers in the LSTM are the same layers as a CNN for classification: fullyConnectedLayer, softmaxLayer and classificationLayer
Train an LSTM
Use LSTM to classify sequences
- The classify function can be used with an LSTM: predictedLabel = classify(net,testdata)

inLayer = sequenceInputLayer(1)
lstm = bilstmLayer(100,'OutputMode','last')
outLayers = [fullyConnectedLayer(3); softmaxLayer(); classificationLayer()]
# combine all layers
layers = [inLayer; lstm; outLayers]

# train the network
options = trainingOptions('adam', ...
'MaxEpochs',250, ...
'InitialLearnRate',0.005, ...
'GradientThreshold',1, ...
'Shuffle','every-epoch', ...
'Plots','training-progress', ...
'LearnRateDropPeriod',200,...
'LearnRateSchedule','piecewise');

net = trainNetwork(Xtrain,Ytrain,layers,options);

testPred = classify(net,XTest)
confusionchart(YTest,testPred)

Improving LSTM Performance

Sequences can be normalized using a variety of methods.
Sequence length and padding is specific to LSTMs.
Sequence length:
- Sequences can contain any number of time steps. This is convenient, but you should be cautious if your sequences have different lengths.
- During training, the sequences in each mini-batch are padded with a number, usually zero, to equalize the lengths. A network cannot distinguish between values created for padding and values that are part of the sequence.
- You should minimize the amount of padding by sorting your data by sequence length and carefully choosing the mini-batch size.
- You can also use the ‘shortest’ option to trim longer sequences to the same length as the shortest sequence. This option has no padding, but can remove important data from your sequences.

Classify Categorical Sequences

Training an LSTM requires the sequences to be numeric. If your sequences are categorical, how can you train a deep network?
Categorical sequences could be a sequence of the weather, DNA, or music notes.
One option is to assign a number to each category. However, this results in imposing a false numerical structure on the observations. For example, if you assign the numbers 1 through 4 to four categories in a predictor, it implies that the distance between the categories 1 and 4 is longer than the distance between the categories 3 and 4.
Instead of assigning a number to each category, create dummy predictors for the categories. Each dummy predictor can have only two values – 0 or 1. For any given observation, only one of the dummy predictors can have the value equal to 1. You can create a matrix of dummy variables using the function dummyvar: d = dummyvar(c)
You can train a network on text data by creating dummy predictors from a categorical representation of your text. The rows of the dummy predictor matrix correspond to each letter in the vocabulary.

Classify Text Data:

# suppose we have some sequences of text: tDickens, etc

# convert to lowercases
tDickens = lower(tDickens)
# create the vocabulary that contains all the types (with lowercase)
vocab = uint8(' !"&''()*,-.0123456789:;?_abcdefghijklmnopqrstuvwxyz');
# dummy the variables
m = dummyvar(categorical(uint8(tDickens),vocab)')'; # remember to transpose
# classify using a trained network
[author,score] = classify(net,mDickens)

Generate Sequences of Output

Sequence-to-Sequence Classification:
- Rather than classifying a recording as a single label, sometimes we also need to classify a sequence with multiple labels, e.g. multiple instruments in one recording.
- In this case,the ‘OutputMode’ property need to be set to ‘sequence’
Sequence Forecasting
- Long short-term memory networks can be used to forecast future time steps of a sequence. Forecasting is often performed with time series data.
- The data used with the network is the sequence you want to forecast. You will use a subset of the sequence to train, and the rest to test.
- The input data is the training sequence, except the last value. The response is the sequence shifted by one time step.
- To predict with this network, use function predictAndUpdateState with the training data as input. This function predicts and updates the state of the network so it will remember this sequence during its next prediction.
- The output is a prediction for the next value in the sequence. You can evaluate the network by comparing the actual and predicted value.
- The training data for a text-generating network is a sequence of text where each label is the next letter in the sequence. The input data should be everything except the last letter in the sequence.

Some resources:

Mathwork blogs for deep learning https://blogs.mathworks.com/deep-learning/
Deep learning toolbox: https://au.mathworks.com/help/deeplearning/index.html