Showing posts with label: coding. Show all posts.

Variable Scoping in Python

Thursday 16 May 2019

There is one thing that absolutely drives me crazy in Python and that is the fact that you can access a variable that was defined outside of a function from within the function without passing it as an argument. I'm not going to lie, that does come in handy at times - especially when you are working with APIs; but it is still a terrible way to do things.

I like the way scoping is done in C++ - each variable is only valid within the block in which it is declared, but obviously that doesn't work in Python since we don't declare variables. Even so, I think that the only variables that should be available in a function are the ones which are created in it or passed to it as arguments. While we can't modify variables that were declared outside of a function in Python, only access them, if the variable is an object you can modify it using it's methods. So if we have a list and we append to it in a function using list.append() it will actually modify the list, which is crazy.

Being able to access variables that were declared outside of the function makes it easy to write bugs and makes it hard to find them. There have been several times when I have declared a variable outside of a function and then I pass it in to the function with a different name and then use the external variable instead of the one passed into the function. The code will work, but not as expected, and it is difficult to track things like this down.

This makes me see much of the merit in functional programming languages like Scala.

Labels: coding, python
No comments

AWS Lambda

Thursday 07 March 2019

I've been working with AWS Lambda recently and I am very impressed. Usually if I need to set up a microservice or a recurring task or anything like that I'll just set up something on one of my virtual servers so I didn't think Lambda would be all that useful. But it makes it really, really easy to set up little tasks and it is much cheaper than having a whole virtual server.

You can create tasks in a number of different languages, and set up a variety of triggers ranging from HTTP requests to scheduled tasks, and when the Lambda is triggered AWS spins it up, executes it and then shuts it down. Since it is so ephemeral it is completely stateless, but you can load files from S3 buckets if you need data of any sort. I assume you can probably also connect to a variety of AWS databases as well, although I haven't done this yet. If you need additional libraries or packages that are not default you can create a layer containing them.

Lambda is not going to replace servers for most use cases, but I think serverless technology is going to make quite a dent in the near future.

Labels: coding, aws, lambda
No comments

CatBoost

Thursday 10 January 2019

Usually when you think of a gradient boosted decision tree you think of XGBoost or LightGBM. I'd heard of CatBoost but I'd never tried it and it didn't seem too popular. I was looking at a Kaggle competition which had a lot of categorical data and I had squeezed just about every drop of performance I could out of LGBM so I decided to give CatBoost a try. I was extremely impressed.

Out of the box, with all default parameters, CatBoost scored better than the LGBM I had spent about a week tuning. CatBoost trained significantly slower than LGBM, but it will run on a GPU and doing so makes it train just slightly slower than the LGBM. Unlike XGBoost it can handle categorical data, which is nice because in this case we have far too many categories to do one-hot encoding. I've read the documentation several times but I am still unclear as to how exactly it encodes the categorical data, but whatever it does works very well.

I am just beginning to try to tune the hyperparameters so it is unclear how much (if any) extra performance I'll be able to squeeze out of it, but I am very, very impressed with CatBoost and I highly recommend it for any datasets which contain categorical data. Thank you Yandex! 

Labels: coding, data_science, machine_learning, kaggle, catboost
2 comments

CoLab TPUs

Tuesday 09 October 2018

The other day I was having problems with a CoLab notebook and I was trying to debug it when I noticed that TPU is now an option for runtime type. I found no references to this in the CoLab documentation, but apparently it was quietly introduced only recently. If anyone doesn't know, TPUs are chips designed by Google specifically for matrix multiplications and are supposedly incredibly fast. Last I checked the cost to rent one through GCP was about $6 per hour, so the ability to have access to one for free could be a huge benefit.

As TPUs are specialized chips you can't just run the same code as on a CPU or a GPU. TPUs do not support all TensorFlow operations and you need to create a special optimizer to be able to take advantage of the TPU at all. The model I was working with at the time was created using TensorFlow's Keras API so I decided to try to convert that to be TPU compatible in order to test it.

Normally you would have to use a cross shard optimizer, but there is a shortcut for Keras models:

TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']

# create network and compiler
tpu_model = tf.contrib.tpu.keras_to_tpu_model(
keras_model, strategy = tf.contrib.tpu.TPUDistributionStrategy(
    tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))

The first line finds an available TPU and gets it's address. The second line takes your keras model as input and converts it to a TPU compatible model. Then you would train the model using tpu_model.fit() instead of keras_model. This was the easy part.

For this particular model I am using a lot of custom functions for loss and metrics. Many of the functions turned out to not be compatible with TPUs so had to be rewritten. While at the time this was annoying, it turned out to be worth it regardless of the TPU because I had to optimize the functions in order to make them compatible with TPUs. The specific operations which were not compatible were non-matrix ops - logical operations and boolean masks specifically. Some of the code was downright hideous and this forced me to sit down and think through it and re-write it in a much cleaner manner, vectorizing as much as possible.

After all that effort, so far my experience with the TPUs hasn't been all that great. I can train my model with a significantly larger batch size - whereas  on an Nvidia K80 16 was the maximum batch size, I am currently training with batches of 64 on the TPU and may be able to push that even higher. However the time per epoch hasn't really improved all that much - it is about 1750 seconds on the TPU versus 1850 seconds on the K80. I have read code may need to be altered more to take full advantage of TPUs and I have not really tried playing with the batch size to see how that changes the performance yet.

I suspect that if I did some more research about TPUs and coded the model to be optimized for a TPU from scratch there might be a more noticeable performance gain, but this is based solely on having heard other people talk about how fast they are and not from my experience. 

Update - I have realized that the data augmentation is the bottleneck which is limiting the speed of training. I am training with a Keras generator which performs the augmentation on the CPU and if this is removed or reduced the TPUs do, in fact, train significantly faster than a GPU and also yield better results.

Labels: coding, machine_learning, google_cloud
No comments

I have previously written about Google CoLab which is a way to access Nvidia K80 GPUs for free, but only for 12 hours at a time. After a few months of using Google Cloud instances with GPUs I have run up a substantial bill and have reverted to using CoLab whenever possible. The main problem with CoLab is that the instance is terminated after 12 hours taking all files with it, so in order to use them you need to save your files somewhere.

Until recently I had been saving my files to Google Drive with this method, but while it is easy to save files to Drive it is much more difficult to read them back. As far as I can tell, in order to do this with the API you need to get the file id from Drive and even then it is not so straightforward to upload the files to CoLab. To deal with this I had been uploading files that needed to be accessed often to an AWS S3 bucket and then downloading them to CoLab with wget, which works fine, but there is a much simpler way to do the same thing by using Google Cloud Storage instead of S3.

First you need to authenticate CoLab to your Google account with:

from google.colab import auth

auth.authenticate_user()

Once this is done you need to set your project and bucket name and then update the gcloud config.
project_id = [project_name]
bucket_name = [bucket_name]
!gcloud config set project {project_id}

After this has been done files can simply and quickly be upload or downloaded from the bucket with the following simple commands:

# download
!gsutil cp gs://{bucket_name}/foo.bar ./foo.bar

# upload
!gsutil cp  ./foo.bar gs://{bucket_name}/foo.bar

I actually have been adding the line to upload the weights to GCS to my training code so it is automatically uploaded every couple epochs, which removes the need for me to manually back them up periodically throughout the day.

Labels: coding, python, machine_learning, google, google_cloud
1 comments