Summary of RobustFillNeural Program Learning under noisy I/O

By Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, Pushmeet Kohli

Summarized by Pankaj Kumar


A comparison of a few methods of automatically generating a computer program from some specification: a) Rule based program synthesis. b)neural program synthesis where a neural network learns to generate a program from given I/O examples. c) neural program induction where a neural network genrates new outputs directly using a latent program representation


Their synthesis model uses RNN with attention to achieve 92% accuracy on a real-world test set, compared to a 34% accuracy of the previous best neural synthesis approach. The synthesis model also outperforms a comparable induction model on this task


  • The strength of each model is dependent on the evaluation metric and end-user application
  • Noise is handled well by the neural methods.


Data used: Programming by example system for string transformations similar to FlashFill. Flashfill allows end users to write regex based string transformations using examples and not complex macros. A user gives a small number of example output strings from input strings to convey desired intent. Flashfill uses this to automatically generate the corresponding outputs for the remaining input strings. Either synthesis or induction can be used.

Neural Model details

Authors develop novel variants of the attention RNN architecture to encode a variable length unordered set of input/output examples. - Method 1: They deploy a domain specific language that can define an expressive class of regex based string transformations. The neural network is then used to generate a program in the DSL. This is trained with programs uniformly sampled from the DSL. - Method 2: The neural network is trained to directly output the strings without a DSL. This is trained on a large set of input-output examples.

The authors show that the manual methods work good for well-formed I/O examples, its performance degrades in the presence of small amounts of noise. The neural models are much more robust to noise.

Problem Overview

InStr is input string
OutStr is the corresponding output string
Also is the training set
is the test set

In the program synthesis case, the examples generate a program $P$ as output token by token. It is trained fully supervised on a large corpus of synthetic I/O examples. In this case, the program can also execute $P$ on $I_j$ to produce $O_j$ and see if it matches. But generalization is more important than just a consistent outputoutput token by token. It is trained fully supervised on a large corpus of synthetic I/O examples. In this case, the program can also execute $P$ on $I_j$ to produce $O_j$ and see if it matches. But generalization is more important than just a consistent output..

In the program induction approach, the authors train the neural network on the training set and the neural network produces an $O^y$ directly when given an $I^y$.

These 2 systems can be compared by comparing the number of InStr get the correct OutStr.

Data overview

There are only a few hundred real world flash fill instances and the data used to train the model was synthesized automatically. First some programs in DSL were randomly sampled. Given the sampled program, we compute a simple set of heuristic requirements on the InStr such that the program can be executed without throwing an exception. Then the program, inStr and outStr become part of the dataset.

Program Synthesis model Architecture

RNNs are being used to train. But there is a small difference. In typical RNN, the input is one long sequence. Here, authors are dealing with a set of smaller input output pairs. Four variants are considered: - Basic Seq to Seq Each sequqence is encoded with a non-attentional LSTM, and the final hidden state is used as the initial state of the next LSTM. - Attention-A: O and P are attentioanal LSTMs, with O attending to I and P attending to O. - Attention-B: Same as Attention-A, but P uses a double attention architecture, attending to both O and I simultaneously. - Attention-C: Same as Attention-B but I and O are bidirectional LSTMs.
InStr and OutStr are character embeddings of 95 printable ascii characters. P is the source-code-order linearization of the 430 total program tokens which includes function names and parameter values as well as special tokens for concatentation and end-of-sequence. Numerical parameters are represented with enbedding tokens. The model is trained to maximize the log-likelihood of the reference program P.

Double Attention

Double attention is a straightforward extension to the standard attention architecture. A typical attentional layer looks like this:

S is the set of vectors being attended to, is the previous recurrent state, and is the current input. Double attention takes the form:

Multi-Example pooling

Attention and LSTM work on only one input output example. But we need a classifier that given a set of I/O pairs produces a program. That is why a full-connected layer and maxpool gets used.