ISI ReWrite Decoder User's Manual (Release 1.0.0a)
This page is available at

ISI ReWrite Decoder
User's Manual
Version 1.0.0a

for Public Release 1.0.0a

Related Papers

What's New?

What you need

Notes on Language Model Training

The Translation Model Config File

The Translation Model Config file contains information about the Translation Model. It contains one parameter-value pair per line. Parameters and values are separated by ‘=’. Comments are separated by ‘#’.

The following example also explains the meaning of the various (mandatory!) parameters.

# This is a sample config file for the translation model for a French (fra) to English (eng) system

TM_RawDataDir = /the/path/to/the/translation/model/files
# TM_RawDataDir points to the location where Giza put its output. 
# If missing, the decoder assumes that the files are in the same
# directory as the config file.
TTable = # translation table InverseTTable = # inverse translation table NTable = # fertility table D3Table = # distortion table for model 3 D4Table = # distortion table for model 4 PZero = # contains p0 for the model Source.vcb = fra-eng.src.vcb # maps from words to word IDs Target.vcb = fra-eng.trg.vcb # maps from words to word IDs Source.classes = /location/of/the/tokenized/corpus/eng.vcb.classes # a path starting with '/' overrides the default path specified with TM_RawDataDir Target.classes = /location/of/the/tokenized/corpus/fra.vcb.classes FZeroWords = ZeroFert # file with (English) zero fertility words # file format: one word per line (anything after the # first word (space separated) on each line is ignored)


Input Format

The decoder expects the input in XML format. Minimally, an input file looks like this:

<?xml version="1.0" encoding="latin-1"?>
<s id="1">this is input sentence one .</s>
<s id="2">this is input sentence two .</s>

Note that the input must be proper XML! Something like <s id=1> is not a proper XML tag --- attribute values must be quoted.

For your convenience, the distribution contains a small xml-wrapper for one-sentence-per-line input.

Additional Markup

Between the <s> tags, the sentence can be marked up with other XML tags; if the markup amounts to a tree structure (e.g. <a><b>one two</b><c>three four</c></a>), only siblings within the tree will be swapped, that is "one" and "two", "three" and "four", and "one two" and "three four" in the example, but not, for example, "two" and "three".

With the attribute english you can tell the decoder how to translate certain items. <ENAMEX english="Kohl">Kohl<ENAMEX>, for example, will force top translate Kohl (the former German Chancellor) as Kohl (instead of cabbage) in English; the actual tag name is irrelevant. Multiple options can be provided by seperating them with a vertical bar, e.g. 'english="April 7|1 Apr|04/07"'. The decoder then picks the option that best fits the given language model context.

The attribute allow_swapping="no" prevents the swapping of the immediate children of the respective node in the XML structure. The children of those children can be swapped with one another. This feature can be used to preserve parentheses: <paren allow_swapping="no"><lp english="(">(</lp><middle> this is a parenthesis</middle><rp english=")">)</rp></paren> will keep "(" and ")" as brackets around the content of the parenthesis, while the words of the parenthesis can be swapped around within the scope of the parenthesis.

Decoder Config File (optional)

In addition to the (mandatory) Translation Model Config file, an optional Decoder Config File can be used, so that parameters do not have to be specified on the command line. The default name for the Decoder Config file is decoder.cfg. The decoder, by default, searches first in the current working directory, and then in the user's home directory. The command line parameter --config can be used to specify a different name and/or location. Settings in the Decoder Config File override built-in defaults. Switches on the command line override built-in defaults and settings in the Decoder Config File.

Suppose, for example, that you want to run the decoder always with the same translation model and print out the three best translations for each input sentence together with various probability measures (Language Model Probability, Translation Model Probability, and various combinations of these two probabilities). In this case, the file decoder.cfg might look like this:

# This is a sample decoder config file (~/decoder.cfg)

LanguageModelFile = ~/language-models/english.binlm
# this saves us the --lmfile switch

TranslationModelConfigFile = ~/decoders/
# this saves us the --tmfile switch

PrintAlignment     = true
PrintProbabilities = true

Running a Translation Server

You can run the decoder as a translation server e.g. with --server --port=<port-number> --server-lock-file="/tmp/tserver"

This will start the server in the background (it actually forks into a child process and then exits). At the same time, a server lock file /tmp/tserver.<PID>.lock will be created. This file contains information about the port number, the process ID of the "master" translation server (it forks for every request), and the translation and language model used. To stop the server, kill <PID>. The use of kill -9 to stop the server is not recommended. The input format for the translation server is again XML, a simple client in perl is here.

Decoder Options

in decoder.cfg on the command line Comments
--config=<file> specify the location of the decoder config file
--debug[=<0-3>], -d<0-3> debugging mode; print diagnostic messages at various levels of detail
--quiet, -q don't print the start screen, processing time or anything like that
in decoder.cfg on the command line Comments
--about print info about the decoder (authors etc.)
--copyright print copyright details
--help, -h print help screen
--version, -v print version number
in decoder.cfg on the command line Comments
EvalAlignments=<true|false> --eval[=<true|false>] evaluate given alignments
input format:
  • first line: translation source (=NCM target) sentence (space-separated)
  • second line: translation target (=NCM source) sentence (space-separated)
  • third line: alignment mapping each word of the translation source to the corresponding position in the translation target
je ne comprend pas .
i do not understand .
1 3 4 3 5

Gloss1=<true|false> --gloss1[=<true|false>] gloss maximizing P(translation source|translation target)
Gloss2=<true|false> --gloss2[=<true|false>] gloss maximizing P(translation source|translation target)*P(translation target|translation source)
Gloss3=<true|false> --gloss3[=<true|false>] gloss maximizing P(translation target|translation source)
Greedy1=<true|false> --greedy1[=<true|false>], -g1 fast greedy decoding
Greedy2=<true|false> --greedy2[=<true|false>], -g2 thorough greedy decoding
TMWeight=<number> --tmweight=<number> specify the translation model weight (value must be greater than zero and less than 1)
LanguageModelFile=<file> --lmfile=<file> specify language model (.binlm) file
LmBeginSentenceMarker=<string> --lmbsm=<string> specify begin sentence marker used in the language model (default is <s>)
LmEndSentenceMarker=<string> --lmesm=<string> specify end sentence marker used in the language model (default is </s>)
NgramSize --ngram-size=<n> specify the ngram size for the language model; by default, the largest ngram-size that the LM can handle is used
TranslationModelConfigFile=<file> --tmfile=<file> specify the location of the Translation Model Config File
TM_BinDataDir=<directory> --tm-bindatadir=<directory> specify the directory for the binary version of the data files
--zero-fert-file=<file> specify the location of the list of words of fertility zero; overrides the parameter FZeroWords in the translation model config file.
MaxSwapDistance --max-swap-distance=<n>, --msd=<n> maximum distance between words or segments being swapped in greedy decoding
MaxSwapSegmentSize --max-swap-segment-size=<n>, --msss=<n> maximum size of segments being swapped in greedy decoding
MaxTimePerSentence --max-time-per-sentence=<n>, --mtps=<n> don't spend more than n seconds on decoding any particular sentence
MaxZeroFertWords --zfmax=<n> don't consider more than the top n zero fertility words for insertion during decoding. Long lists of zero fertility words drive up decoding times significantly without really improving the overall performance. Empirically, n = 128 is a pretty good bet.
RandomSearches --random-searches=<n> specify the number of additional searches from randomized starting points
SkipUnknownWords --skip-unknown-words=<true|false> Skip unknown words in the input (that is, align them with NULL). The default behavior is to pass them through as-is.
in decoder.cfg on the command line Comments
PrintAll=<true|false> --print-all=<true|false> set all print options below to true unless specified otherwise on the command line
PrintAlignment=<true|false> --print-alignment=<true|false> print alignment information with the translation (implies --print-original=true)
PrintOriginal=<true|false> --print-original=<true|false> print the input sentence together with the translation
PrintProbabilities=<true|false> --print-probabilities=<true|false> print probabilities
RandSearchOutputMode=<all|best|shortest> --rsprint=<all|best|shortest> when using randomized searches, print all / only the best / only the shortest translation
TraceGreedy=<true|false> --trace-greedy=<true|false> trace search history (improvements only) of the final alignments
XmlCharEscapes --xml-char-escapes=<true|false> escape characters with special function in XML in the output (default is 'no')
LOGGING OPTIONS (produce information about the progress of the decoding)
in decoder.cfg on the command line Comments
--log=<true|false> turn logging on or off
Note: In the decoder config file, set LogFile to the name of the destination file (or '-' for STDOUT) to turn logging on, leave it empty to turn logging off.
LogFile=<file> --logfile=<file> specify destination for the logging output; '-' means: write to STDOUT
LogLM=<true|false> --logLM=<true|false> include language model look-ups in the log
LogTM=<true|false> --logTM=<true|false> include translation model look-ups in the log
--nologLM equivalent to --logLM=false
--nologTM equivalent to --logTM=false
in decoder.cfg on the command line Comments
Port=<number> --port=<number> specify the port number that the server should use
--server run the decoder as a TCP server
ServerLockFile=<file> --server-lock-file name of the server lock file (default: <config-file>.lock)