Just here for code? Git repo.
This is a followup on my last post about enhancing images using a generative adversarial autoencoder structure. This post is about how it was done, and we provide code to hopefully let the reader replicate the results.
This project was done in Theano, and closely follows the code given for the DCGAN paper of Alec Radford et al. I refactored some things to make it easier to change things around, and I had to change the architecture a bit. I originally tried porting the code over to Lasagne, a library built on top of Theano, but decided that it was only slowing me down doing this. After this project I have started to think that sadly, for small experimental projects using novel techniques, working with small simple modules over Theano is quicker than trying to twist your code to fit some given library.
The data is the CelebA imageset, which is roughly 200k images of celebrities. I scaled and cropped these images to 128×160 (because these dimensions divide by large powers of 2) and saved to HDF5 format (there is a python script in the git repo that does this for you). This is the target set. The input data is each of these images scaled down by a factor of 4 using PIL’s resize tool to give a size of 32×40. As with the original dcgan code, we feed the images to the neural network using the fuel library.
In retrospect, I possibly should have trained the neural network on an easier dataset to start with, to give it more of a chance to excel. For example using dlib I could have cropped/sized each image to be just the bounding box around the face a la Openface. This would have meant the model would have spent less time trying to learn the background, though as someone pointed out in the reddit thread, this problem is already well solved by PCA methods (I found this quite surprising).
A generative adversarial autoencoder has the following structure.
For my experiment, the autoencoding step consisted of 2 convolutional/pooling steps followed by 4 deconvolutional/depooling steps (the fattest middle layer consisted of 1024 8×10 images). The discriminator consisted of 4 convolutional/pooling followed by a straight sigmoid logistic regression on all of the outputs. As in Alec Radfords code, for each layer we do batch normalisation and we use learned pooling/unpooling steps rather than just straight maxpooling.
We train in batches, alternating between updating the weights of the the autoencoder (generator) and those of the discriminator.
To update the discriminator, we take a generated fake image and its real target, and then punish the discriminator for thinking the real image was fake, and that the fake image was real (binary crossentropy). We only update the weights in the discriminator on this step.
To update the generator, we have two loss functions. The first is the reconstruction cost – MSE distance between the generated image, and the target image. N.B – we actually run MSE on downsized versions of the target/generated images, first applying an averaged pooling layer – the loss function wants the grainy images to look the same, not necessarily the upscaled images. If you try to use the MSE of the full size images, you suffer the standard ‘blurred’ effect that MSE trained autoencoders suffer from (I talked about this in the last post). The other loss function is that of the discriminator, we punish the generator for generating images that the discriminator correctly classifies as fake (binary crossentropy). We only update the weights in the autoencoder here, using an (ad-hoc) linear combination of the two loss functions.
I trained this model on my Titan X GPU. It took 24 hours to do something like 10 epochs.
The interested reader may be interested in training one of these models themselves and would probably benefit from these notes :
- Use an IPython notebook (as in the git repo). I don’t know if Alex Radford et al really trained their models in python from the command line as it would appear from their code, but this will be very tedious to train. It is invaluable to be able to monitor training, stop, adjust and restart, while not throwing away the weights already trained.
- Keep checking the numbers printed out by the logging (the code in the repo calculates costs on a holdout set every 100 batches).
- I found it was important to make sure that the model put emphasis on the MSE cost at the start by weighting the loss function towards it, otherwise that cost would never converge. Once it gets low you can dampen it a lot (scaled down 100 times) without it diverging.
- The adversarial cost takes a looooong time to converge. Your faces will look completely crazy for the first few hours.
- If the generative adversarial cost is much higher than the discriminative cost, try doing two or three batches of generative updates for every one of the discriminative costs. I sometimes found that the generator would get stuck, never being able to catch up to the discriminator, and this helped a lot. By the end though, you should probably be doing 50/50 for the best results. Remember that although we are competing against the discriminator, we also need it to get as good as possible.
- I think the discriminator I trained was not so great. It would probably be worth sticking a fully connected layer at the end before the final sigmoid function.
- It’s my belief that if you want to run this on a harder dataset, one needs the fattest layer of the generator and discriminator to be a lot bigger than the one I used to realistically be able to learn the different features needed for all the textures and shapes encountered in the wild. I tried running the model on small patches of images from imagenet (examples in the last post), and the results were pretty poor. It also seems to me that for a dataset like imagenet, one would have to take in a lot of the context data to be able to realistically fill in the textures – i.e. using global level features of the data. Perhaps something like residual networks could help here.