An example of MNIST classification to understand neural network (NN).
Key Steps
network <- keras_model_sequential() %>%
layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%
layer_dense(units = 10, activation = "softmax")
loss function: how the network will be able to measure its performance on the training data. (e.g. least square norm)
optimizer: mechanism through which the network will update itself based on data it sees and its loss function (e.g. gradient desent)/
metrics to monitor: how to quantify the classification/prediction in training and test (e.g. accuracy)
network %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
Notice that compile()
function modifies the network in place.
For classification problem, use to_categorical()
function to create categorical encoding, i.e. vector of 0/1 indicating class
network %>% fit(train_images, train_labels, epochs = 5, batch_size = 128)
network %>% evaluate(test_images, test_labels)
A Full Example
## 1) Loading the data
library(keras)
mnist = dataset_mnist()
train_images = mnist$train$x
train_labels = mnist$train$y
test_images = mnist$test$x
test_labels = mnist$test$y
## 2) Network architecture
network = keras_model_sequential() %>%
layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%
layer_dense(units = 10, activation = "softmax")
summary(network)
## 3) Compile
network %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
## 4) Prepare the data (vectorization as we are using ANN)
train_images = array_reshape(train_images, c(60000, 28 * 28))
train_images = train_images / 255
test_images = array_reshape(test_images, c(10000, 28 * 28))
test_images = test_images / 255
## 5) Prepare the labels
## to_categorical function created the 0/1 vector indicating the class
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
## 6) Fit
model_fit = network %>% fit(train_images, train_labels,
epochs = 5, batch_size = 128,
validation_split = 0.2)
## 7) Check the performance
network %>% evaluate(test_images, test_labels)
plot(model_fit)
## 8) Predict
network %>% predict_classes(test_images[1:10,])
Scalar -> Vectors (1D tensors) -> Matrices (2D tensors) -> 3D and higher dimensional tensors
rank: number of axes
shape: dimension of each axis (dim()
)
data type: integer or double
In DL, the first axis is typically sample axis. E.g, in MNIST, samples are images of digits.
DL models process data into small batches. batch_size = 128
specifies that the first batch will be train_images[1:128,,]
, and next batch will be train_images[129:256,,]
Vector: 2D tensors of shape (samples, features)
Time series: 3D tensors of shape (samples, timestamps, features)
Images: 4D tensors of shape (samples, height, width, channels) or (samples, channels, height, width)
The example first hidden layer is a relu layer, which does output = relu(dot(W, input) + b)
, where relu(x) = max(x,0)
naive_relu = function(x){
for(i in 1:nrow(x)){
for(j in 1:ncol(x)){
x[i,j] = max(x[i,j],0)
}
}
x
}
pmax(x)
E.g. to perform the matrix plus vector addition
In output = relu(dot(W,input) + b)
, W
is the weight and b
is the bias, which are trainable. A training loop consists of following step
To move against the direction of derivatives. (a direction in which derivative will to go zero)
Then 4) from above becomes
4.1) Compute the gradient of the loss with regard to the network’s parameters (backward pass)
4.2) Move the parameters a little in the opposite direction of the gradient, W = W - (step * gradient)
This is called mini batch stochastic gradient descent (SGD), as it works on a batch, and stochastic is because this is because a stochastic random sample of batch.
Special methods, SGD with momentum, as well as Adagrad, RMSProp.