Deciphering Ancient Documents

Archaeologists sometimes encounter an ancient document that is written in an unfamiliar script. If they are fortunate enough to guess the language behind it, they can often decipher the script. Their literal goal is to "make the text speak," after which it can be interpreted, translated, and so on. The Mayan and Linear B syllabic scripts were deciphered in this way very recently.

Here is the situation:

observed character sequences <-- unknown connection --> observable sound sequences

Computer algorithms can simulate, in mindless but thorough fashion, the cryptographic process of uncovering the unknown connection. In computational linguistic terms, the goal is to do text-to-speech conversion without a dictionary or spelling rules or literate informant.

If the Mayans had invaded Spain, rather than the other way around, we might today be deciphering ancient Spanish documents. Following this idea, we can illustrate archaeological decipherment by pretending that the modern scripts of Spanish, Japanese (kana), and Chinese documents are all unfamiliar. To decipher them we first collect phonetic data. We could do this by tape-recording native speakers, but we can also fake it by turning online documents into phonetic versions, using dictionaries and spelling rules handed down by an "oracle."

In a Spanish experiment, we collected phonetic data from a modern medical domain (simple sound triple frequences in IPA format), and we used this data to decipher a written page of the book Don Quixote:

"Primera parte del ingenioso hidalgo don Quijote de la Mancha. Capítulo primero. Que trata de la condición y ejercicio del famoso hidalgo don Quijote de la Mancha. En un lugar de la Mancha, de cuyo nombre no quiero acordar-me, no ha mucho tiempo que vivía un hidalgo de los de lanza en astillero, adarga antigua, rocín flaco y galgo corredor..."

This page might as well be written in a different alphabet, as the computer is unaware of how the characters sound. As the decipherment of this document proceeds, we can "listen in" and hear how the rendering becomes more and more accurate:

After five minutes ... After fifteen minutes ... After three hours.

The last decipherment is imperfect, but understandable enough.

Here are results for 200 sentences of Japanese.

After five minutes --- After ten hours --- After thirty hours

These are a bit more difficult to understand, in part because we use the Spanish synthesizer (eg, no "sh" sound), and in part because small Japanese kana are pronounced as though they are full size.

Interesting research questions:

  • can these techniques decipher a "logographic" writing system (eg, Chinese)? how accurately?
  • is there a generic, computationally-decipherable "mother of all writing systems" that encompasses all known systems?
  • how much ancient text is needed? (this necessarily comes in limited quantities)
  • how much phonetic data is needed?
  • what kind of phonetic data is needed? are phoneme triple frequencies enough? are words or grammar needed?
  • are word separators necessary or helpful?
  • are the techniques robust against language evolution (ie, suppose the modern phonetic data is distantly related to the ancient written data)?
  • can the techniques decipher actual ancient writing?
  • can the techniques automatically identify the language behind a script as as precursor to deciphering it?

    For details, see "A Computational Approach to Deciphering Unknown Scripts," (K. Knight and K. Yamada), Proceedings of the ACL Workshop on Unsupervised Learning in Natural Language Processing, 1999. Get paper in PostScript.