ISI ReWrite Decoder User's Manual Version 1.0.0a for Public Release 1.0.0a |
Related Papers |
What's New? |
What you need |
Notes on Language Model Training |
The Translation Model Config File |
The Translation Model Config file contains information about the Translation Model. It contains one parameter-value pair per line. Parameters and values are separated by =. Comments are separated by #.
The following example also explains the meaning of the various (mandatory!) parameters.
# This is a sample config file for the translation model for a French (fra) to English (eng) system TM_RawDataDir = /the/path/to/the/translation/model/files # TM_RawDataDir points to the location where Giza put its output. |
Notes:
Input Format |
The decoder expects the input in XML format. Minimally, an input file looks like this:
<?xml version="1.0" encoding="latin-1"?>Note that the input must be proper XML! Something like <s id=1> is not a proper XML tag --- attribute values must be quoted.
For your convenience, the distribution contains a small xml-wrapper for one-sentence-per-line input.
Between the <s> tags, the sentence can be marked up with other XML tags; if the markup amounts to a tree structure (e.g. <a><b>one two</b><c>three four</c></a>), only siblings within the tree will be swapped, that is "one" and "two", "three" and "four", and "one two" and "three four" in the example, but not, for example, "two" and "three".
With the attribute english you can tell the decoder how to translate certain items. <ENAMEX english="Kohl">Kohl<ENAMEX>, for example, will force top translate Kohl (the former German Chancellor) as Kohl (instead of cabbage) in English; the actual tag name is irrelevant. Multiple options can be provided by seperating them with a vertical bar, e.g. 'english="April 7|1 Apr|04/07"'. The decoder then picks the option that best fits the given language model context.
The attribute allow_swapping="no" prevents the swapping of the immediate children of the respective node in the XML structure. The children of those children can be swapped with one another. This feature can be used to preserve parentheses: <paren allow_swapping="no"><lp english="(">(</lp><middle> this is a parenthesis</middle><rp english=")">)</rp></paren> will keep "(" and ")" as brackets around the content of the parenthesis, while the words of the parenthesis can be swapped around within the scope of the parenthesis.
Decoder Config File (optional) |
Suppose, for example, that you want to run the decoder always with the same translation model and print out the three best translations for each input sentence together with various probability measures (Language Model Probability, Translation Model Probability, and various combinations of these two probabilities). In this case, the file decoder.cfg might look like this:
# This is a sample decoder config file (~/decoder.cfg) LanguageModelFile = ~/language-models/english.binlm # this saves us the --lmfile switch TranslationModelConfigFile = ~/decoders/fra-eng.1.tm-config # this saves us the --tmfile switch PrintAlignment = true PrintProbabilities = true |
Running a Translation Server |
Decoder Options |
GENERAL OPTIONS | ||
in decoder.cfg | on the command line | Comments |
---|---|---|
|
--config=<file> | specify the location of the decoder config file |
|
--debug[=<0-3>], -d<0-3> | debugging mode; print diagnostic messages at various levels of detail |
|
--quiet, -q | don't print the start screen, processing time or anything like that |
HELP AND INFO OPTIONS | ||
in decoder.cfg | on the command line | Comments |
|
--about | print info about the decoder (authors etc.) |
|
--copyright | print copyright details |
|
--help, -h | print help screen |
|
--version, -v | print version number |
DECODING PARAMETERS -- DECODING ALGORITHMS | ||
in decoder.cfg | on the command line | Comments |
EvalAlignments=<true|false> | --eval[=<true|false>] |
evaluate given alignments input format:
je ne comprend pas . i do not understand . 1 3 4 3 5 |
Gloss1=<true|false> | --gloss1[=<true|false>] | gloss maximizing P(translation source|translation target) |
Gloss2=<true|false> | --gloss2[=<true|false>] | gloss maximizing P(translation source|translation target)*P(translation target|translation source) |
Gloss3=<true|false> | --gloss3[=<true|false>] | gloss maximizing P(translation target|translation source) |
Greedy1=<true|false> | --greedy1[=<true|false>], -g1 | fast greedy decoding |
Greedy2=<true|false> | --greedy2[=<true|false>], -g2 | thorough greedy decoding |
TMWeight=<number> | --tmweight=<number> | specify the translation model weight (value must be greater than zero and less than 1) |
DECODING PARAMETERS -- LANGUAGE MODEL OPTIONS | ||
LanguageModelFile=<file> | --lmfile=<file> | specify language model (.binlm) file |
LmBeginSentenceMarker=<string> | --lmbsm=<string> | specify begin sentence marker used in the language model (default is <s>) |
LmEndSentenceMarker=<string> | --lmesm=<string> | specify end sentence marker used in the language model (default is </s>) |
NgramSize | --ngram-size=<n> | specify the ngram size for the language model; by default, the largest ngram-size that the LM can handle is used |
DECODING PARAMETERS -- TRANSLATION MODEL OPTIONS | ||
TranslationModelConfigFile=<file> | --tmfile=<file> | specify the location of the Translation Model Config File |
TM_BinDataDir=<directory> | --tm-bindatadir=<directory> | specify the directory for the binary version of the data files |
|
--zero-fert-file=<file> | specify the location of the list of words of fertility zero; overrides the parameter FZeroWords in the translation model config file. |
DECODING PARAMETERS -- SEARCH SPACE OPTIONS | ||
MaxSwapDistance | --max-swap-distance=<n>, --msd=<n> | maximum distance between words or segments being swapped in greedy decoding |
MaxSwapSegmentSize | --max-swap-segment-size=<n>, --msss=<n> | maximum size of segments being swapped in greedy decoding |
MaxTimePerSentence | --max-time-per-sentence=<n>, --mtps=<n> | don't spend more than n seconds on decoding any particular sentence |
MaxZeroFertWords | --zfmax=<n> | don't consider more than the top n zero fertility words for insertion during decoding. Long lists of zero fertility words drive up decoding times significantly without really improving the overall performance. Empirically, n = 128 is a pretty good bet. |
RandomSearches | --random-searches=<n> | specify the number of additional searches from randomized starting points |
SkipUnknownWords | --skip-unknown-words=<true|false> | Skip unknown words in the input (that is, align them with NULL). The default behavior is to pass them through as-is. |
OUTPUT OPTIONS | ||
in decoder.cfg | on the command line | Comments |
PrintAll=<true|false> | --print-all=<true|false> | set all print options below to true unless specified otherwise on the command line |
PrintAlignment=<true|false> | --print-alignment=<true|false> | print alignment information with the translation (implies --print-original=true) |
PrintOriginal=<true|false> | --print-original=<true|false> | print the input sentence together with the translation |
PrintProbabilities=<true|false> | --print-probabilities=<true|false> | print probabilities |
RandSearchOutputMode=<all|best|shortest> | --rsprint=<all|best|shortest> | when using randomized searches, print all / only the best / only the shortest translation |
TraceGreedy=<true|false> | --trace-greedy=<true|false> | trace search history (improvements only) of the final alignments |
XmlCharEscapes | --xml-char-escapes=<true|false> | escape characters with special function in XML in the output (default is 'no') |
LOGGING OPTIONS (produce information about the progress of the decoding) | ||
in decoder.cfg | on the command line | Comments |
|
--log=<true|false> |
turn logging on or off Note: In the decoder config file, set LogFile to the name of the destination file (or '-' for STDOUT) to turn logging on, leave it empty to turn logging off. |
LogFile=<file> | --logfile=<file> |
specify destination for the logging output; '-' means: write to STDOUT |
LogLM=<true|false> | --logLM=<true|false> | include language model look-ups in the log |
LogTM=<true|false> | --logTM=<true|false> | include translation model look-ups in the log |
|
--nologLM | equivalent to --logLM=false |
|
--nologTM | equivalent to --logTM=false |
TRANSLATION SERVER OPTIONS | ||
in decoder.cfg | on the command line | Comments |
Port=<number> | --port=<number> | specify the port number that the server should use |
|
--server | run the decoder as a TCP server |
ServerLockFile=<file> | --server-lock-file | name of the server lock file (default: <config-file>.lock) |