Chapter 5. Deep Learning for Computer Vision

1. Convnets

Example, with an inout of tensor with shape (image_height,image width, image_channels), in the MNIST data, that will be (28,28,1)

Convet:

library(keras)
model <- keras_model_sequential() %>%
layer_conv_2d(filters = 32, kernel_size = c(3, 3), activation = "relu",
                input_shape = c(28, 28, 1)) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = "relu") %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = "relu")

Then it will be fitted to a densely connected classifier network that deals with 1D data (layer_flatten())

model <- model %>%
  layer_flatten() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 10, activation = "softmax")

Apply Convnets on MNIST data

library(keras)


model = keras_model_sequential() %>%
  layer_conv_2d(filters = 32, kernel_size = c(3, 3), activation = "relu",
                input_shape = c(28, 28, 1)) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = "relu") %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(filters = 64, kernel_size = c(3, 3), activation = "relu")

model = model %>%
  layer_flatten() %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 10, activation = "softmax")

mnist =dataset_mnist()
c(c(train_images, train_labels), c(test_images, test_labels)) %<-% mnist

train_images = array_reshape(train_images, c(60000, 28, 28, 1)) 
train_images = train_images / 255

test_images = array_reshape(test_images, c(10000, 28, 28, 1))
test_images = test_images / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

model %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

model %>% fit(
  train_images, train_labels,
  epochs = 5, batch_size=64
)

 model %>% evaluate(test_images, test_labels)

1.1 Convolution

Densely connected layer learns global pattern, whereas convolution layers learn local patterns. In terms of images, dense layers learn all pixels, but conv layer learn a small windows.

Learned patterns are translation invariant. Once they learn a pattern in a location, they will memorize it even for new location.
Spatial hierarchies of patterns.

Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height and width) as well as a depth axis (also called the channels axis). For an RGB image, the dimension of the depth axis is 3, because the image has three color channels: red, green, and blue. For a black-and-white picture, like the MNIST digits, the depth is 1 (levels of gray). The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map.

The output feature map is still a 3D tensor with (width, height,depth). depth will be arbitrary, which is a parameter of the layer, and they won’t represent RGC input, rather, they stand for filters. E.g of a filter: a single filter could encode the concept “presence of a face in the input”.

Input: 3D tensors (width, height, channels) \(\rightarrow\) Output: 3D tensors (width*,height*, depths*)

Two parameters that define convolutions:

size of the patches extracted from the input. Typically 3 * 3 or 5 * 5.
Depth of the output feature map. i.e. number of filters.

layer_conv_2d (output_depth, c(window_height, window_width))

A convolution works by sliding these windows of size 3 * 3 or 5 * 5 over the 3D input feature map, stopping at every possible location, and extracting the 3D patch of surrounding features (shape (window_height, window_width, input_depth)). Each such 3D patch is then transformed (via a tensor product with the same learned weight matrix, called the convolution kernel) into a 1D vector of shape (output_depth). All of these vectors are then spatially reassembled into a 3D output map of shape (height, width, output_depth). Every spatial location in the output feature map corresponds to the same location in the input feature map (for example, the lower-right corner of the output contains information about the lower-right corner of the input). For instance, with 3 * 3 windows, the vector output[i, j, ] comes from the 3D patch input[i-1:i+1, j-1:j+1, ]

Padding is to add an appropriate number of rows and columns on each side of the input feature map to make it possible to fit center convolution windows around every input.

1.2 Max-pooling operation

Max pooling consists of extracting windows from the input feature maps and out- putting the max value of each channel. It’s conceptually similar to convolution, except that instead of transforming local patches via a learned linear transformation (the convolution kernel), they’re transformed via a hardcoded max tensor operation.A big difference from convolution is that max pooling is usually done with 2 × 2 win- dows and stride 2, in order to downsample the feature maps by a factor of 2. On the other hand, convolution is typically done with 3 × 3 windows and no stride (stride 1)

2. Train a CNN to classify cat/dog

2.1 Data preprocessing

To process a jpeg file: 1. Read the picture files

Decode JPEG to RGB grids of pixels
Convert these into floating-point tensors
Rescale pixel values to [0,1]

image_data_generator \(\rightarrow\) flow_image_from_directory

train_datagen = image_data_generator(rescale = 1/255)
train_generator = flow_images_from_directory( train_dir,
                                               train_datagen,
                                               target_size = c(150, 150), 
                                               batch_size = 20,
                                               class_mode = "binary")

2.2 Fit generator

Use fit_generator function to fit the data generator. Because the data is being generated endlessly, the fitting process needs to know how many samples to draw from the generator before declaring an epoch over. This is the role of the steps_per_epoch argument: after having drawn steps_per_epoch batches from the generator—after having run for steps_per_epoch gradient descent steps—the fitting process will go to the next epoch. In this case, batches are 20 samples, so it will take 100 batches until you see your target of 2,000 samples.

history = model %>% fit_generator(
  train_generator,
  steps_per_epoch = 100,
  epochs = 30,
  validation_data = validation_generator,
  validation_steps = 50
)

2.3 Data augmentation

Given infinite data, your model would be exposed to every possible aspect of the data distribution at hand: you would never overfit. Data augmentation takes the approach of generating more training data from existing training samples, by augmenting the samples via a number of random transformations that yield believable-looking images. The goal is that at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data and generalize better.

datagen = image_data_generator(
  rescale = 1/255, 
  rotation_range = 40, 
  width_shift_range = 0.2, 
  height_shift_range = 0.2, 
  shear_range = 0.2, 
  zoom_range = 0.2,
  horizontal_flip = TRUE,
  fill_mode = "nearest"
)

3. Using a pretrained convnet

There are two ways to use a pretrained network (e..g VGG, ResNet, Inception, Inception-ResNet, Xception etc), feature extraction and fine tunning/

3.1 Feature extraction

Feature extraction consists of using the representations learned by a previous network to extract interesting features from new samples. These features are then run through a new classifier, which is trained from scratch.

Convnets used for image classification comprise two parts: they start with a series of pooling and convolution layers, and they end with a densely connected classifier. The first part is called the convolutional base of the model. In the case of convnets, feature extraction consists of taking the convolutional base of a pre- viously trained network, running the new data through it, and training a new classifier on top of the output.

The feature maps of a convnet are presence maps of generic concepts over a picture, which is likely to be useful regardless of the computer-vision problem at hand.

conv_base = application_vgg16(
  weights = "imagenet",
  include_top = FALSE, #refers to including (or not) the densely 
                       #connected classifier on top of the network
  input_shape = c(150,150,3)
)

1. Feature extraction without data augmentation. This methods takes the input data and records the output from conv base as features. We then fit a densely connected classifiers. Since the model only runs on image once, it is computationally efficient.

### Feature Extraction ###
datagen <- image_data_generator(rescale = 1/255) 
batch_size <- 20
extract_features <- function(directory, sample_count) {
  features <- array(0, dim = c(sample_count, 4, 4, 512)) 
  labels <- array(0, dim = c(sample_count))
  generator <- flow_images_from_directory(
    directory = directory,
    generator = datagen,
    target_size = c(150, 150),
    batch_size = batch_size,
    class_mode = "binary"
  )
  i <- 0 
  while(TRUE) {
    batch <- generator_next(generator)
    inputs_batch <- batch[[1]]
    labels_batch <- batch[[2]]
    features_batch <- conv_base %>% predict(inputs_batch)
    index_range <- ((i * batch_size)+1):((i + 1) * batch_size)
    features[index_range,,,] <- features_batch 
    labels[index_range] <- labels_batch
    i <- i + 1
    if (i * batch_size >= sample_count)
      break 
    }
  list(
    features = features,
    labels = labels
  ) 
}

train <- extract_features(train_dir, 2000) 
validation <- extract_features(validation_dir, 1000) 
test <- extract_features(test_dir, 1000)

reshape_features <- function(features) {
  array_reshape(features, dim = c(nrow(features), 4 * 4 * 512))
}

train$features <- reshape_features(train$features) validation$features <- reshape_features(validation$features) test$features <- reshape_features(test$features)

model <- keras_model_sequential() %>%
  layer_dense(units = 256, activation = "relu",
              input_shape = 4 * 4 * 512) %>%
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = optimizer_rmsprop(lr = 2e-5), 
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)
history <- model %>% fit(
  train$features, train$labels,
  epochs = 30,
  batch_size = 20,
  validation_data = list(validation$features, validation$labels)
)

1. Feature extraction with data augmentation. Extending the con base by adding dense layers on top.

model <- keras_model_sequential() %>%
  conv_base %>%
  layer_flatten() %>%
  layer_dense(units = 256, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

We can also freeze a network, i.e. preventing their weights from being updated during training, by freeze_weights(conv_base)

3.2 Fine tuning

Fine-tuning consists of unfreezing a few of the top layers of a frozen model base used for feature extraction, and jointly training both the newly added part of the model (in this case, the fully connected classifier) and these top layers.

4. Interpretation of Convets

4.1 Visualizing intermediate activations

Useful for understanding how successive convnet layers transform their input, and for getting a first idea of the meaning of individual convnet filters.

The features extracted by a layer become increas- ingly abstract with the depth of the layer. The activations of higher layers carry less and less information about the specific input being seen, and more and more information about the target

4.2 Visualizing convnet filters

4.3 Visualizing heatmaps of class activation

This general category of techniques is called class activation map (CAM) visualiza- tion, and it consists of producing heatmaps of class activation over input images. A class activation heatmap is a 2D grid of scores associated with a specific output class, computed for every location in any input image, indicating how important each loca- tion is with respect to the class under consideration.