FashionMNIST with FluxML
[Part 1] Solving FashionMNIST for Google Code-In for Julia
Task : Add the Fashion MNIST model to the Flux model-zoo
Task Description
Fashion MNIST is a good way to introduce the concept of autoenoders and for classification tasks. Write an efficient Fashion MNIST implementation using Flux and benchmark it against equivalent implementations in TensorFlow and PyTorch. A good extension might be to have it run smoothly on GPUs too. The FashionMNIST dataset can be easily obtained and unpackaged into ready-to-use Julia data types with the help of MLDatasets.jl. A working example of using Flux for classification of handwritten digits from the MNIST dataset can be found here, for students who are already familiar with basic image detection techniques and want to hit the ground running. Flux's documentation can be found here Another reference material that may interest you is this notebook that handles the FashionMNIST dataset using keras and tensorflow.
If you need GPUs for your task, I'd recommend you to use NextJournal. If you need help setting up a Julia GPU session, just use the interactive chat icon in the bottom of the screen or contact one of the mentors on Slack.
I must explicitly say this that the code from the model zoo and other sources mentioned sources is solely for reference. Please do copy-paste code without writing proper comments or understanding the code.
Introduction
In this notebook I'll be using the Flux Machine Learning Package for Julia to build a simple multilayer perceptron to predict Zalando's articles images. Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Sample here:
Let's start by importing the required packages.
using Flux
using Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle
using Base.Iterators: repeated
Flux already has the FashionMNIST built in. Source.
#getting the data from Flux itself
images = Flux.Data.FashionMNIST.images()
labels = Flux.Data.FashionMNIST.labels();
# Example of a image and a label below.
r = rand(1:60000)
The corresponding image and label for this random data-point is.
images[r]
labels[r] #It's a dress.
Image Pre-Processing
We'll use hcat
(behaves like numpy's reshape function) to stack the images into a single vector and onehotbatch
to encode the categorical values into T/F's.
More about onehotbatch
:
One hot encoding allowed us to convert our categorical labels: "Trouser", "Dress", "Coat", ... etc. into machine readable values;
That is, if our label is "Pullover" in our result set, it would be one-hot encoded against the categories ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
as ["false", "false", "true", "false", "false", "false", "false", "false", "false"].
In such a matrix, each column represents a one-hot encoded value of the label for the corresponding position in the vector.
X = hcat(float.(reshape.(images, :))...)
Y = onehotbatch(labels, 0:9)
Creating our ML model (more a Neural Network with 2 layers)
Read more on Flux Layers.
We'll be chaining two dense layers to go from a 28^2 dimensional space (28x28 images) to our predictions which will be 10 dimensional (0-9 types of articles.)
model = Chain(Dense(28^2, 32, relu), Dense(32, 10), softmax)
softmax : The Softmax regression is a activation function that normalizes an input value into a vector of values that follows a probability distribution whose total sums up to 1. In other words, softmax just converts output to probability distribution.
Relu are another type of non-linear activation functions used actively in neural networks.It simply sets all negative values computed in the image to 0. The reason we do this, is because we want to introduce a non-linear function into our learning network that mimics very much what our eyes do when detecting images. There are other functions that do similar things such as sigmoid or tanh, which both mimic similar bounded functions, but relu seems to perform faster and gives good results.
See relu link to a Why/How answer on Stack.
In order to train our model in flux, we will require three functions as parameters:
- an loss/objective function - allows the network to see how close we are to the result and used for gradient decent
- an optimizer - a function that operates on the weight parameters of the network, to decrease the loss and drive gradient decent.
- an evaluation function to show the progress of the training
Awesome documentation on Training from the Flux team here
Now, we write our loss function (we'll be using the log loss function/ better known as Cross-Entropy). The loss function measures the performance of a binary/multiclass classification model (latter, such as ours).
Math behind it here
ADAM would be our optimizer function (it can be called a advanced, more powerful gradient descent but for neural networks). It is an adaptive learning rate method, which means, it computes individual learning rates for different parameters. ADAM
#1
loss(x, y) = crossentropy(model(x), y)
#2
optim = ADAM();
#3
accuracy(x, y) = mean(onecold(model(x)) .== onecold(y)) # a way to find average of correct guesses in julia
dataset = repeated((X,Y),200) # repeating the data set 200 times, as opposed to 200 epochs
evalcb = () -> @show(loss(X, Y)) # callback to show loss
#collapse
Flux.train!(loss, params(model), dataset, optim, cb = throttle(evalcb, 10));
This took around 13 minutes to run on my potato PC. To make it faster we can do smarter things like batching which I have to still learn how to implement.
r = rand(1:12000) # a random data-point in the 12000 large test set.
Flux.Data.FashionMNIST.images(:test)[r] #visualising
Same image - preprocessing as used before but on the test set this time.
X_test = hcat(float.(reshape.(Flux.Data.FashionMNIST.images(:test), :))...)
Y_test = onehotbatch(Flux.Data.FashionMNIST.labels(:test), 0:9);
Running our model on the X_test.
A 10-element array is returned with values between 0 to 1 (probabilities of catetgories being predicted). This means that the model means to say that the image inputted is 99.94% a trouser (0.9994) and 0.013% dress (0.00013).
model(X_test[:,r])
# We can use onecold to compare the true and predicted classes:
onecold(model(X_test[:,r])) - 1 # the minus one is to account for julia's one-indexing. We start
#from 0 but Julia's indexing start at 1.
print("Train Accuracy : ", accuracy(X,Y)*100, " %")
print("Test Accuracy : ", accuracy(X_test,Y_test)*100," %")
Based on the benchmarks provided by Zalando Research, our simple neural network ranks among the top 20% of the top accuracy !
-PseudoCodeNerd
P.S: 85.65% on the test set was achieved via the resnet34 pre-trained CNN by fastAI. Notebook attached.
Update
Adding a ConvNet
I say (heavily) influenced from https://github.com/FluxML/model-zoo/blob/master/vision/mnist/conv.jl.
using Flux, Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle
using Base.Iterators: repeated, partition
using Printf, BSON
images = Flux.Data.FashionMNIST.images();
labels = Flux.Data.FashionMNIST.labels();
#hide_output
#Bundle images together with labels and group into minibatches.
function make_minibatch(X, Y, idxs)
X_batch = Array{Float32}(undef, size(X[1])..., 1, length(idxs))
for i in 1:length(idxs)
X_batch[:, :, :, i] = Float32.(X[idxs[i]])
end
Y_batch = onehotbatch(Y[idxs], 0:9)
return (X_batch, Y_batch)
end
batch_size = 128
mb_idxs = partition(1:length(images), batch_size)
train_set = [make_minibatch(images, labels, i) for i in mb_idxs]
#hide_output
test_imgs = Flux.Data.FashionMNIST.images(:test)
test_labels = Flux.Data.FashionMNIST.labels(:test)
test_set = make_minibatch(test_imgs, test_labels, 1:length(test_imgs))
Now that we examined and prepared the data, we can now build our CNN to train the articles images on their corresponding labels.
There are 3 convulational layers, 3 maxPool layers, a dense layer and finally softmax to give a output in range of 0-->1
Convulated : The convolutional layer it convolves the article image with a cumulative result that learns the filter. Convolution is best explained as a a matrix traveling along another matrix and doing an element-wise multiplication against it to get a resulting convolved feature matrix. For further reading, check out this article that explains the convolution step very well with an animation of the convolution occuring between matrices. The Conv function below, takes the feature matrix dimensions as its first parameter. In other words, the first layer will use a 3x3 matrix and slide it over the image pixels to learn the feature detection filter. The first layer will produce 16 outputs from 1 input as indicated by 1=>16 in the second parameter. The relu function, in the third parameter, is a function that simply sets all negative values computed in the image to 0.
Maxpool : After we apply the convolution layer, we apply another function called max pooling which helps reduce the dimensionality of the features that the network is learning, but still maintains the important feature information for learning. The pooling function in our neural network, slides a 2x2 matrix over different sections of the 2d input and pools them into a single value. So a 28 x 28 matrix pooled by a 2x2 pooling matrix would be reduced to a 14x14 matrix. The pooling function can be anything from max value, average, or sum. Max value is most commonly used, since it seems to be very effective.
Role of Abpve Layers :The first 6 layers (3C, 3 MP) are used for feature extraction in the image. hese layers take a 28x28 image input and aggregate into a simpler set of features representing the images. As more images are passed over these 4 layers, the feature extractor portion learns a simplified set of features from the images fed through it. This will make it much easier for the neural net to classify digits in the remaining part of the network
Dense :The reshape function's job is to get the pooled data into a form that the Dense neural net can process. The dense neural net takes 288 inputs from the last maxpool function and trains on the data to produce 10 outputs. The outputs are then subjected to the softmax function which essentially squashes the values to values between 0 and 1. All values produced by the softmax function will total 1.
The error between the output and actual label value will be fed back through the network to train the weights inside the network. This is called backpropogation. The deep learning uses a technique called gradient descent to adjust the weights of the network based on the error. As the network weights are adjusted with each new training dataset (image, label), the neural net model gets better and better at predicting the digit fed through it.
model = Chain(
Conv((3, 3), 1=>16, pad=(1,1), relu),
MaxPool((2,2)),
Conv((3, 3), 16=>32, pad=(1,1), relu),
MaxPool((2,2)),
Conv((3, 3), 32=>32, pad=(1,1), relu),
MaxPool((2,2)),
# Reshape 3d tensor into a 2d one, at this point it should be (3, 3, 32, N)
# which is where we get the 288 in the `Dense` layer below:
x -> reshape(x, :, size(x, 4)),
Dense(288, 10),
softmax,
)
# precompiling model before starting our training loop
model(train_set[1][1])
# `loss()` calculates the crossentropy loss between our prediction `y_hat`
# (calculated from `model(x)`) and the ground truth `y`. We augment the data
# a bit, adding gaussian random noise to our image to make it more robust.
function loss(x, y)
# We augment `x` a little bit here, adding in random noise
x_aug = x .+ 0.1f0*gpu(randn(eltype(x), size(x)))
y_hat = model(x_aug)
return crossentropy(y_hat, y)
end
accuracy(x, y) = mean(onecold(model(x)) .== onecold(y))
opt = ADAM(0.001)
best_acc = 0.0
last_improvement = 0
for curr_epoch in 1:100
global best_acc, last_improvement
# Train for a single epoch
Flux.train!(loss, params(model), train_set, opt)
# Calculate accuracy.
acc = accuracy(test_set...)
@info(@sprintf("[%d]: Test accuracy: %.4f", curr_epoch, acc))
# If this is the best accuracy we've seen so far, save the model out
if acc >= best_acc
@info(" New best accuracy! Saving model out to mnist_conv.bson")
best_acc = acc
last_improvement = curr_epoch
end
# If we haven't seen improvement in 5 epochs, drop our learning rate:
if epoch_idx - last_improvement >= 5 && opt.eta > 1e-6
opt.eta /= 10.0
@warn(" Haven't improved in a while, dropping learning rate to $(opt.eta)!")
# After dropping learning rate, give it a few epochs to improve
last_improvement = epoch_idx
end
if epoch_idx - last_improvement >= 10
@warn(" converged.")
break
end
end