EN-FR Machine Translation with Various RNN Models in Google CoLab (1)

8 min readMar 29, 2018

Within two parts of articles, I am going to implement deep learning model for English to French machine translation. The training data-set is borrowed from Udacity AI-ND (AI Nanodegree) program’s repository.

Along the journey, I am going to use Google CoLab which gives cloud based jupyter notebook like environment. The main reasons to choose this one over Amazon Sagemaker or Microsoft Azure Notebook are (1)“easy(er) accessibility”, and (2)“free GPU accelerator”. Even though many users share the limited number of GPUs, it should be enough for simple cases. As far as I know, for other platforms, I need to know more than just plain jupyter notebook.

Udacity AI-ND repo, https://github.com/udacity/aind2-nlp-capstone/
Google CoLab, https://colab.research.google.com/

About myself

My background in deep learning is Udacity {Deep Learning ND & AI-ND with contentrations(CV, NLP, VUI)}, Coursera Deeplearning.ai Specialization (AI-ND has been split into 4 different parts, which I have finished all together with the previous version of ND). Also, I am currently taking Udacity Data Analyst ND, and I am 80% done currently.

Setup for Google CoLab — 1st article
Data pre-processing — 1st article
Experiment with various models — 2nd article
Conclusion — 2nd article

Setup for Google CoLab

Notebook and google drive

First, you need to go to the CoLab. You will see the similar screen on the left hand side. CoLab comes with two kinds of kernel, python2 and python3. For this tutorial, I am going to use python3 exclusively. After clicking the “NEW PYTHON3 NOTEBOOK” button, you will have Jupyter notebook like screen.

The second step is to get access to the google drive, so you can load training data-set into the notebook. There are couple of ways to achieve this. One option is to bring up the UI form to upload files directly in the notebook, and the next option is to load files stored in the google drive. I am going to use the second method for this article.

Fig 2. Google Drive setup

Include the code example above to the cell in the notebook. This code snippet is actually provided by google, and you can find more useful ones by browsing the left sidebar menu.

Once you run the code cell, you will notice that the cell asks you with a text form to enter “verification code” with an URL. In this case, if you just follow the URL, you will get the verification code that you have to paste into the form. This step has to be done every 12 hours since CoLab session will be re-initialized every 12 hours. Now, you are all set to access google drive.

GPU support

As of when I write this article, Google CoLab gives free GPU runtime environment for everyone, and the GPU is known to be NVIDIA Tesla K80.

In order to enable GPU accelerator, you have to choose it. Please go to “Runtime → Change runtime type”, then you will see the configuration window in Fig 3. Under the “Hardware accelerator” drop-down menu, you can choose “GPU” easily, and it is free!

After enabling GPU environment, you may want to check whether it is really supported. You can simply write two lines of code to find it out. The code snippet below shows the way.

If you run the code cell above in the notebook, you should see the result in Fig 4. As you can see and as I mentioned, the Tesla K80 11GB is used.

note* All the GPU environments in Google CoLab are shared with all users as far as I know (please correct me if I am wrong). It means that it could under-perform the standalone environment, so please notice this issue when you think it’s performance doesn’t meet your expectation.

Load the data-set

I am not going to explain how to upload the data-set into google drive since it is very easy work. I will assume that the file is stored somewhere in the google drive. It doesn’t really matter where because every file/directory has its unique “id”, and we can explicitly use that reference.

Fig 5. Load files via google drive

I have stored the two given data files into the directory, and the id of that directory is ‘1vBIsfiF…….Ll9dJQ1uIsJ’ (I have ommited the middle for not to expose in public). In the GoogleDrive’s ListFile API, a number of parameters can be set, but I am going to use only one ‘q’. In the example code above, “drive.ListFile(….).GetList()” simply asks to load every files that meet the query specified under ‘q’. The query means only choose whatever (file or directory) specified before the keyword ‘in’, and it has to be an ‘id’. As I said, the id of my directory is ‘1vBIsfiF…….Ll9dJQ1uIsJ’, so that is why there is that string in the code example. Fig 6. shows how to find out the ‘id’.

The rest line of code just make sense if you spend some time. While searching through files in the directory, download only specified (I am interested) one as the argument in load_data function. If you are interested more about file attributes, just print the file itself. It will give all information in dictionary form.

Fig 7. load sentences

As the final step for this section, English and French sentences are loaded by using the function previously defined.

GoogleDrive ListFile API parameters, https://developers.google.com/drive/v2/reference/files/list#request

Data Pre-processing

For almost every machine learning problems, it is very common to have a look how the data looks like and manipulate to fit into our problem. Even though there are tons of ways to enhance the data-set when it is very noisy, they are not included in this article. Instead, I am going to show how to make the data to go well with deep learning architecture through tokenizing and padding processes using Keras.

First Look in the Data-set

Fig 8. show word information

The code snippet above simply counts the number of words, and it prints out the number of words, the number of unique words, and the 10 most common words used in each language. The results is shown in Fig 9.

**Fig 9. result of the word information display**

Tokenizing

The raw data is separately stored in sentence granularity. There is two problems with this. (1) If I attempt on One English sentence to one French sentence based translation, there are the infinite number of cases. In order to make the translation process more general, word by word translation could be much more appropriate. (2) Neural networks takes only numeric values, but I have characters here. It would be nice to assign unique numeric values to each unique words so that I can represent the whole training sentences in numbers.

If you want to practice your coding skill, please go ahead to solve the problems mentioned above. It would be nice to improve yourself since you have to understand the dictionary type of data structure and the string. However, I am going to use the handy Tokenizer class provided in Keras in this article.

Fig 11. Tokenize

First step is to create Tokenizer class instance. The newly created Tokenizer doesn’t have any information, so it has to be fed with data. What I essentially want to do is change each words in every sentences to numeric values. That could be done by calling texts_to_sequences() method. However, in order to make that happen, the Tokenizer should have 1:1 internal mapping representation between word and numeric value. That internal mapping can be built up by calling fit_on_texts() method. The above code example does exactly what I have described in sequence and results in Fig 12.

Tokenizer class, https://keras.io/preprocessing/text/
texts_to_sequences method, https://github.com/keras-team/keras/blob/f7afc73780ffe89abd8dd08ebd815cbcf56720ee/keras/preprocessing/text.py#L254
fit_on_texts method, https://github.com/keras-team/keras/blob/f7afc73780ffe89abd8dd08ebd815cbcf56720ee/keras/preprocessing/text.py#L189

Padding

The length of each sentences varies a lot. It means the model I am going to build should take input data by changing its input size dynamically. Or I also should change the output size dynamically too because one English sentence could be translated in French with shorter/longer sentence. Instead, it is much simpler approach to set the size fixed in the maximum length among the all sentences, and it works just fine. One consequent question comes naturally. What should I do to the shorter sentences than the longest one? That is where the idea ‘padding’ comes in.

For the shorter sentences, 0, which is mapped to “<PAD>” character, is added. It could be placed before or after the sentence. Just like the previous step, I can practice my coding skill by implementing this functionality, but Keras comes with handy function to achieve this process.

Fig 13. How to pad?

pad_sequences() method takes a number of arguments. Let’s have a look in some of them. As the first argument, the sequence data like the one built from the previous step should be provided, and this sequence data will eventually reformed and returned. ‘maxlen’ argument specify the maximum length of the sequences to keep. Longer sequences(sentences) than the maxlen will drop some of their element, and which part to drop can be specified by ‘truncating’ argument. ‘padding’ argument is used to determine where to put padding value in the sequence when the length of the sequence is shorter than the maxlen.

Since data pre-processing is done, the next step is to build RNN models and train the models by feeding the data. Here is the brief overview what models I am going to build.

Simple RNN
RNN with Embedding Layer
Bi-directional RNN
RNN with Encode-Decode architecture
Combined model with the above choices of design
Experiment with LSTM and GRU
Adjustment in Hyper-parameters