These are some thoughts I wrote up as a retrospective on https://github.com/ddhackiisc/code/blob/master/smiles-pipeline/SMILES%20VAE.ipynb, a project I worked on a few months ago. In future posts I hope to go into more detail on the tricky details of setting up batch processing in pytorch, and also explain how exactly variational autoencoders work.
In the summer of 2020 I spent some time participating in the Government of India Drug Discovery Hackathon as part of a six member IISc team. One of the projects I worked on involved trying to find molecules that might mess up the functioning of the SARS-CoV-2 main protease, based on looking at data from known inhibitors of SARS-CoV-1 (aka the original SARS). The problem statement suggested using a variational autoencoder trained on SMILES strings. Let’s go into what that means.
SMILES is a standard format for representing complex organic molecules as strings of ASCII characters. Here’s an example string:
C1CC2=C3C(=CC=C2)C(=CN3C1)[C@H]4[C@@H](C(=O)NC4=O)C5=CNC6=CC=CC=C65,
and what the molecule looks like:

The goal of this project was to explore the chemical space around molecules that inhibit the original SARS virus. Unfortunately, since molecules are fundamentally discrete objects, it is difficult to smoothly interpolate between them. One possible way around this is to use variational autoencoders.
Autoencoders are a type of generative machine learning model that can be used to generate new samples from a distribution defined by some training examples. A standard autoencoder simply compresses an input (as a vector) by encoding it into a lower dimensional space and then trying to decode this back to the original. The idea is that the information bottleneck created will force the autoencoder to learn a better representation (in the lower dimensional latent space) of only the important information in the data set. One can generate new samples by supplying random inputs to the decoder.
A variational autoencoder (VAE) is a type of autoencoder that lets you explore the effects of small changes in the latent space, by encoding the data as the mean and variance of a multivariate normal distribution, which is used to generate samples with the aid of a pseudorandom number generator. (in other words, the randomness is disentangled from the rest of the encoding) Altering the mean and variance of the distribution allows you to explore a slightly different space of samples, in a way that is not possible with an standard autoencoder.
At a high level, the code is a pipeline that takes a set of training examples that are the SMILES representations of known SARS-CoV-1 inhibitors, and generates SMILES strings that represent new molecules drawn from the same distribution, which will hopefully have similar properties.
The largest conceptual hurdle in the project was figuring out how to apply a VAE to strings. Normally, especially with images, the standard procedure is to convert everything to a vector of real numbers (which fits easily into the VAE framework) using one-hot encoding. Unfortunately this would not work here, for two reasons. One was that different SMILES strings have different lengths, the second was that SMILES is designed in such a way that the order of characters is very important, and there are long-range dependencies between them. Only a tiny minority of character sequences form valid SMILES strings.
The solution, which was surprisingly simple in retrospect, was to separate the model into three pieces, first, an LSTM-based encoder that would embed the string into a vector space, then a VAE, and then another LSTM-based decoder which would decode the output of the VAE back into a SMILES string.
This worked quite well, with nearly all output being valid SMILES strings. The technical issue that arose was getting this multi-stage model to batch process multiple strings at the same time. For this, the pytorch tutorial on sequence-to-sequence natural language processing ended up being invaluable. I finally used almost exactly the same process as machine translation, except one level down, (so sentences became strings and word tokens became characters) and with a VAE stuck in the middle.
One major design decision I had to figure out was whether to use an attention mechanism as part of the decoder, as this might have made it a lot easier for the VAE to escape from the constraints of the SMILES format and pay more attention to molecular similarities.
Three considerations that influenced my decision were model complexity, data availability and compute constraints. I wasn’t confident I would be able to debug problems with the attention, and I thought it would be better to get a simpler version working sooner, due to time constraints. I also felt that adding attention would significantly increase the size of the model, which would severely impact the speed of my iteration cycle due to compute constraints, and that the amount of data ( less than a thousand samples) wasn’t enough to really see a gain from attention. I eventually decided not to use attention for these reasons.

Looking back
In hindsight, it was quite difficult to evaluate the quality of the final generated samples. While it was of course easy to eyeball the molecules and check they made chemical sense and weren’t memorized from the distribution, the goal of the project was to find new inhibitors, and the only way we could evaluate that was by actually running molecular dynamics simulations and checking the energy of the binding between the molecule and the SARS-CoV-2 protease. Of course, this is computationally very intensive, but we did have the resources and expertise to do it. My mistake was delaying the project so long that there wasn’t time to iterate after actual evaluation of the molecules. I should have figured out a minimum viable model as quickly as possible so we could iron out the inevitable bugs and difficulties in the MD training process and have a truly end-to-end training pipeline. More importantly, I should have realised that the constraints affecting my teammates with simulation expertise (which I was perfectly aware of) would significantly influence my development timeline. In retrospect, I should have figured out early on when my teammates would be free to run simulations and adjusted my schedule in that light. All days are not equal.
Another way I would do things differently is to spend a lot more time looking for more data and try out much larger models. Recent experiments with scaling and papers like double descent make it clear that bigger models simply work better, and at scale, the overfitting problem goes away. I might also try including attention, because it really seems large transformers can do anything.
There isn’t much data on SARS-CoV-inhibitors, so I would need some sort of fine tuning approach, where I initially train the LSTMs on ordinary drugs and then finetune the VAE on inhibitors.
Finally, I wasted a lot of time confused over whether my ideas on the vague ideas on the LSTM+VAE approach made sense, without actually writing anything down. Perhaps the most important lesson I took from this project is that the fastest way to crystallise an idea is to write it down as code.
***
The ipython notebook is available at https://github.com/ddhackiisc/code/blob/master/smiles-pipeline/SMILES%20VAE.ipynb,