Chapter 2: The mathematical building blocks of neural network

1. Structure of neural network

An example of MNIST classification to understand neural network (NN).

Key Steps

1.1 Buid the architecture

network <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%
  layer_dense(units = 10, activation = "softmax")

1.2 Compile the model

loss function: how the network will be able to measure its performance on the training data. (e.g. least square norm)
optimizer: mechanism through which the network will update itself based on data it sees and its loss function (e.g. gradient desent)/
metrics to monitor: how to quantify the classification/prediction in training and test (e.g. accuracy)

network %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

Notice that compile() function modifies the network in place.

For classification problem, use to_categorical() function to create categorical encoding, i.e. vector of 0/1 indicating class

1.3 Fit the model and check the performance

network %>% fit(train_images, train_labels, epochs = 5, batch_size = 128)
network %>% evaluate(test_images, test_labels)

A Full Example

## 1) Loading the data
library(keras)
mnist = dataset_mnist()
train_images = mnist$train$x
train_labels = mnist$train$y
test_images = mnist$test$x
test_labels = mnist$test$y

## 2) Network architecture
network = keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%
  layer_dense(units = 10, activation = "softmax")
summary(network)

## 3) Compile
network %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

## 4) Prepare the data (vectorization as we are using ANN)
train_images = array_reshape(train_images, c(60000, 28 * 28))
train_images = train_images / 255
test_images = array_reshape(test_images, c(10000, 28 * 28)) 
test_images = test_images / 255

## 5) Prepare the labels
## to_categorical function created the 0/1 vector indicating the class
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

## 6) Fit
model_fit = network %>% fit(train_images, train_labels, 
                epochs = 5, batch_size = 128,
                validation_split = 0.2)

## 7) Check the performance
network %>% evaluate(test_images, test_labels)
plot(model_fit)

## 8) Predict
network %>% predict_classes(test_images[1:10,])

2. Data representation

2.1 Basic data formats

Scalar -> Vectors (1D tensors) -> Matrices (2D tensors) -> 3D and higher dimensional tensors

rank: number of axes
shape: dimension of each axis (dim())
data type: integer or double

2.2 Notion of data batches

In DL, the first axis is typically sample axis. E.g, in MNIST, samples are images of digits.

DL models process data into small batches. batch_size = 128 specifies that the first batch will be train_images[1:128,,], and next batch will be train_images[129:256,,]

2.3 Real work examples of data tensors

Vector: 2D tensors of shape (samples, features)
Time series: 3D tensors of shape (samples, timestamps, features)
Images: 4D tensors of shape (samples, height, width, channels) or (samples, channels, height, width)

3. Tensor operations

The example first hidden layer is a relu layer, which does output = relu(dot(W, input) + b), where relu(x) = max(x,0)

3.1 Element wise operation

naive_relu = function(x){
  for(i in 1:nrow(x)){
    for(j in 1:ncol(x)){
      x[i,j] = max(x[i,j],0)
    }
  }
  x
}

pmax(x)

3.2 Operations involving different dimensions

E.g. to perform the matrix plus vector addition

3. Gradient based optimization

In output = relu(dot(W,input) + b), W is the weight and b is the bias, which are trainable. A training loop consists of following step

1. Draw a batch of training samples
1. Run network on samples in a forward pass fashion to obtain prediction
1. Calculate the loss based on prediction and truth
1. Update weights and bias to reduce loss.

3.1 Stochastic gradient descent

To move against the direction of derivatives. (a direction in which derivative will to go zero)

Then 4) from above becomes

4.1) Compute the gradient of the loss with regard to the network’s parameters (backward pass)
4.2) Move the parameters a little in the opposite direction of the gradient, W = W - (step * gradient)

This is called mini batch stochastic gradient descent (SGD), as it works on a batch, and stochastic is because this is because a stochastic random sample of batch.

Special methods, SGD with momentum, as well as Adagrad, RMSProp.