Bilingual Compression Challenge (BCC)

Bilingual Compression Challenge

Barret Zoph, Marjan Ghazvininejad, and Kevin Knight


This site introduces a simple bilingual compression challenge. The task is to build the smallest piece of executable code that prints out a large (but specific) bilingual text, byte for byte.

Our overall aim is to use compression to drive new research into translation technologies, using the metric of file size. We are inspired by this observation (Nevill and Bell, 1992):

"From an information theoretic point of view, accurately translated copies of the original text would be expected to contain almost no extra information if the original text is available, so in principle it should be possible to store and transmit these texts with very little extra cost."

Nevill and Bell are perhaps optimistic, but clearly, bilingual texts contain huge amounts of redundancy. The more translation patterns we can automatically extract from bilingual text, the more we can compress it, and vice-versa. As a by-product, we are interested the question "How much information does a human translator add to the original text?"

In the challenge, we adopt the same rules used in monolingual text compression benchmark evaluations, such as the Hutter Prize. A valid entry is an executable, or self-extracting archive, that prints out a particular Spanish/English bilingual text. Decompression code, dictionaries, and/or other resources must be embedded in the executable -- we cannot assume that the recipient of the compressed file has access to those resources.

Any executable is permitted. For example, if our job were to compress the first million digits of π, then we could submit a very short piece of code that prints those digits. The brevity of the compression demonstrates our understanding of the sequence.

The bilingual data is here (large version) and here (small version). The text is already aligned at the segment (sentence) level, but not tokenized or re-cased. Here is a summary of the data:

The uncompressed large text is 619.4 MB. In this paper, we describe a compression to 86.4 MB.

It should be possible to do better! Can you?