The Natural Language Group


Our Github organization hosts the latest list of tools.

utoken. Universal tokenizer, i.e. word segmenter for a wide variety of scripts and languages. Built by Ulf Hermjakob.

MTData,  A tool capable of retrieving thousands of parallel datasets for machine translation research. Built by Thamme Gowda.

Reader Translator Generator (RTG),  A feature rich neural machine translation toolkit based on PyTorch, with focus on reproducible experiments. Buily by Thamme Gowda.

NLCodec and NLDb,  A scalable tool for mapping words, characters, BPE subwords into integer sequences, and a storage layer for efficiently storing and retrieving large scale datasets. Built by Thamme Gowda.

English-to-AMR parser. Converts English sentences into Abstract Meaning Representations (AMRs). Built by Michael Pust, Ulf Hermjakob, Kevin Knight, Daniel Marcu, and Jonathan May. (Download size = 719Mb).

AMR-to-English generator. Converts Abstract Meaning Representations (AMR) into English sentences. Built by Nima Pourdamghani.

uroman. Converts texts in any script to Latin alphabet.   Online interface.   Built by Ulf Hermjakob.

Monogiza. Extracts a word-for-word translation table from non-parallel corpora. Built by Qing Dou.

Carmel. Finite-state transducer package for strings, built by Jonathan Graehl.  Latest version on Github.

Tiburon. Finite-state transducer package for trees, built by Jonathan May.

Bolinas. Hyperedge replacement transducer package for graphs, built by Jacob Andreas, Daniel Bauer, David Chiang, Karl Moritz Hermann, Bevan Jones, and Kevin Knight.

Zoph_RNN. GPU-based neural LSTM sequence-to-sequence modeling toolkit, built by Barret Zoph.

EUREKA. CPU-based neural LSTM sequence-to-sequence modeling toolkit, built by Ashish Vaswani.

NPLM. Neural probabilistic language model toolkit, built by Ashish Vaswani, with contributions from David Chiang and Victoria Fossum.

SPADE. Sentence-level Discourse Parser. Built by Radu Soricut.

ReWrite Decoder. Greedy Decoder for IBM SMT Models. Built by Daniel Marcu and Ulrich Germann.


Software demos:

Many-English NMT, A multilingual NMT model that can translate from 500 source languages to English. Built by Thamme Gowda.

Portmanteau generator. Creates a new word (neologism) from two existing words. Built by Aliya Deri.

Poetry password demo and assigner. Shows poems create from randomly-generated 60-bit passwords. Built by Marjan Ghazvininejad.

Smatch. Evaluates output of semantic parsing. Built by Shu Cai.

Poetry generator.  Creates a poem on any topic.  Built by Marjan Ghazvininejad, Xing Shi, Yejin Choi, and Kevin Knight.

Spolin Bot.  Chat with our improvisation bot!


HowToSpeak. Allows users to speak a language they don't understand, by phonetic rendering. Built by Xing Shi.


Annotation and data collection interfaces:

Shannon Game. Collects character-level text predictions from people, in order to estimate the entropy of translation. Built by Marjan Ghazvininejad.

AMR Editor. Allows human annotators to type in the meanings of English sentences, using the Abstract Meaning Representation framework. Built by Ulf Hermjakob. Video.

RST Annotation Tool. Enables annotators to build Rhetorical Structure Representations for texts. Built by Benjamin Liberman.


Shared tasks:

Bilingual compression challenge. If we exploit the high redundancy of human translated texts, what is the best compression rate we can achieve for bilingual texts? Run by Barret Zoph, Kevin Knight, and Marjan Ghazvininejad.

AMR parsing. This 2016 SemEval challenge asks participants to write software to convert English into Abstract Meaning Representations. Run by Jonathan May.