I was training a ConvNet and everything was working fine during training. But when I evaluated the model on the validation data I was getting NaN for the cross entropy. I thought it was the cross entropy attempting to take the log of 0 and added a small epsilon value of 1e-10 to the logits to address that. I thought that would fix the problem but it did not.

Further investigation indicated that the NaNs were being introduced somewhere early in the network, in one of the convolutional layers. I checked the validation and training data to make sure there wasn't some fundamental difference between the two, thinking that maybe one was being pre-processed differently than the other, but that was not the case.

In my graph I am using tf.metrics and gathering all of the update ops into one op to be executed during training with:

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

Also gathered into this op was the updates to the batch norms. I had done this many times before with no problems at all so never thought this could be a factor. But when I removed the extra update op from the evaluation code the problem went away. Including the ops generated to update my metrics individually caused no problems. 

I am not sure what the issue actually was, but I assume it has something to do with the batch normalization, or maybe there is another op created somewhere in my graph that caused this issue.

Update - I had been restoring the weights from a pre-trained model and I think the restored batch norms caused the problem. NOT restoring the batch norms when loading the weights seems to solve this problem completely. Otherwise the issue still occurred sporadically.

Labels: python, machine_learning, tensorflow
No comments

TensorFlow GPU Bottlenecks

Wednesday 16 May 2018

I was training a model on a Google Cloud instance with a Tesla K80 GPU. This particular model had more data pre-processing required than normal. The model was training very slowly, the GPU usage was oscillating between 0% and 75-100%. I thought the CPU was the bottleneck and was trying to put as much pre-processing on the GPU as possible.

I read TensorFlow's optimization guide, which suggested forcing the pre-processing to be on the CPU by enclosing it with:

with tf.device('/cpu:0'):

Since I thought the CPU was the bottleneck I didn't think that would help, but I tried it anyway because I had no other good ideas and was surprised that it worked like magic! The GPU usage now stays constant around 95-100% while the CPU usage stays at about the same levels as before.

Labels: machine_learning, tensorflow, google_cloud
No comments

I have been working on a project to detect abnormalities in mammograms. I have been training it on Google Cloud instances with Nvidia Tesla K80 GPUs, which allow a model to be trained in days rather than weeks or months. However when I tried to do online data augmentation it became a huge bottleneck because it did the data augmentation on the CPU.

I had been using tf.image.random_flip_left_right and tf.image.random_flip_up_down but since those operations were run on the CPU the training slowed down to a crawl as the GPU sat idle waiting for the queue to be filled.

I found this post on Medium, Data Augmentation on GPU in Tensorflow, which uses tf.contrib.image instead of tf.image. tf.contrib.image is written to run on the GPU, so using this code allows the data augmentation to be performed on the GPU instead of the CPU and thus eliminates the bottleneck.

This has been a life saver for me. Adding it to my graph allows me to train for longer without overfitting and this get better results.

Labels: python, machine_learning, tensorflow
No comments

Machine Learning

Monday 16 April 2018

While we tend to think of ourselves as being at the pinnacle of evolution, in reality humans are barely a step up from monkeys. Our only real differentiation from them is that we have language, which allows us to communicate knowledge to others and to preserve that knowledge through time. Language gives us a huge advantage, allowing us to progressively accumulate new knowledge by building up on previous discoveries, but in the end we are just animals who have evolved to survive, like all other animals. We are not adapted to having civilizations and technology, we evolved to find food and procreate and the results of this can be seen all over - from how tech companies use simple tricks like noises and bright colors and intermittent rewards to keep us hooked, to how food companies load their food with salt, fat and sugar to keep us eating unhealthy food, to the cognitive biases and heuristics we use to make decisions under uncertainty. The point of all of this is that humans evolved to find food and avoid predators, and our brains are incredibly ill-suited to processing the large amounts of data that are required to make evaluations about the types of issues that we face everyday in today's complex world.

Computers on the other hand are designed for processing large amounts of data - they can do this very efficiently if programmed correctly. However they lack our creativity - the ability to combine seemingly unrelated ideas into new ideas, and to come up with novel solutions to problems. Machine learning combines our creativity with the ability of computers to handle large amounts of data, specifically the ability to find patterns in data. In 2015 a journalist ran a study on chocolate, the results of which were that chocolate helps you lose weight. The study was commissioned as an example of "junk science" and only had 15 participants with 18 measurements for each participant. The author said “here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a ‘statistically significant’ result.” Unfortunately much of science is conducted like this - the authors start a study to proof a hypothesis and maybe they'll ignore some results which contradict the hypothesis if other results confirm it. No one likes to be wrong and if the results are a bit ambiguous you can maybe just cherry pick the numbers you like. In Darrell Huff's 1954 book "How to Lie with Statistics" he says "if you torture the data long enough it will confess to anything" and this is in fact the case. 

Machine learning is about letting the data speak for itself. With machine learning you set up a system of symbolic equations for transforming data into predictions and then feed the data into that system and see what happens. If you don't like the results you can change the system or the data, but the process is far too complex to be able to cherry pick the data you like and discard the rest. This combines the strengths of humans with the strengths of computers - the humans use their creativity and domain knowledge to create the system which they hope will find patterns in the data and the computers run the data through the system. While technically possible to do so, the process of analyzing the data is far more complex than could ever be done without a computer, and the computers can only do what they are told to do - they can not create novel ideas from nothing.

In my opinion, machine learning is the most important scientific technology in recent history. Just like electricity allowed energy to become uncoupled from the previous sources - fire and animal energy - machine learning uncouples the ability to process data from the constraints of the human brain. Properly used, I think machine learning will be as revolutionary as electricity was.

Labels: machine_learning
No comments

Archives