USC Information Sciences InstituteNatural Language Group

Alignments of the LDC Hong Kong News Corpus

Release 8/2001

Ulrich Germann, Editor

Description

This package contains alignment data for the LDC Hong Kong News Corpus. It essentially consists of a data file and a Perl script that allows you to restore the aligned data provided that you have the original LDC data. The package does not provide any actual text; it is completely useless without the data from the LDC.

The alignment was performed in the following manner:

  1. Both sides of the corpus were split into sentences; the English side with Adwait Ratnaparkhi's MxTerminator, the Chinese side with a rule-based sentence-splitter for Chinese written by Ulrich Germann.

  2. The sentence-split text pairs were then aligned with an implementation of the Gale and Church Algorithm (however, using the number of words on the English side as a length measure rather than the number of characters). The alignment utility was written by Yaser Al-Onaizan.

  3. The alignments were split into one training set (roughly 80% of the data), two sets for training (5% each), and two sets for final evaluation(5% each).

The alignment of the LDC Hong Kong News Corpus was performed as part of the ReWrite project under funding from the DARPA TIDES program.

Data Volume

EnglishChinese
Number of Words/Tokens (Chinese: Characters) ca. 6,000,000 ca. 10,000,000
Number of Sentences ca. 345,000 ca. 293,000
Number of Parallel Chunks ca. 292,000

Disclaimer

The alignments are far from perfect and come without any expressed or implied warranty whatsoever. Use them at your own risk. There may or may not be updated versions in the future.

Contents of the Package

alignments.datdata file with the alignment data
restore-alignmentsPerl script to restore the alignments
repair-hc-tablePerl script to “repair” the hc conversion tables (see below)
big5hkscs2gb2312-preprocpre-processor for code conversion
extract-pairsextracts alignment pairs
README.htmlthis file

Installation Instructions

Below, commands that you type in are blue, your shell prompt is red. It is assumed that you are in some variant of csh (csh, tcsh, ntcsh).

  1. Make the LDC data available to the data restoration script

    In order to restore the data, you need a copy of the data as it is on the disk. For example
    [Put CD with Hong Kong News Corpus into CD-ROM]
    % mount /mnt/cdrom
    
    In the following, we assume that the root of the LDC distribution is accessible as /mnt/cdrom. If you have a verbatim copy of the disk somewhere on you network, replace /mnt/cdrom by the path to that copy.

  2. Restore the alignments

    % gzip -dc ldc-hknc-alignments.8.2001.tar.gz | tar xf -
    % cd hk-news-alignments
    % perl restore-alignments alignments.dat /mnt/cdrom
    
    Note: If you run restore-alignments with the option -t (test mode), it does not recreate the aligned corpus but writes the output to STDOUT instead.

  3. Convert Big5-HKSCS to GB2312 (if desired)

    The script restores the Chinese text in the original Big5-HKSCS encoding. If you want GB encoding, you will have to convert the text yourself. The tool that I found to work best (though not perfect) is hc-30. However, since the Hong Kong Supplementary Character Set (HKSCS) contains two-byte codes whose first range is below the regular Big5 range, certain characters will not be converted correctly. You will receive better conversion results if you replace such HKSCS characters with an “unknown” character. This package includes a pre-processor that performs these replacements.

    If you use the default conversion tables that come with hc-30 (hc.tab and hc-supp.tab), and the option -map all, a few conversions will not be unique. hc indicates this in its output by placing alternatives in double angled brackets: e.g. ...<<ÈýÈþ>>.... I do not recommend using the option -map one, which always returns the first mapping found in the tables, because this will actually remove some line breaks -- which are used to indicate alignments in the parallel files. Rather, “repair” the mapping table(s) with the script repair-hc-table provided in this package and run hc with -map all.

    In the following it is assumed that hc is in your path.
    % cat /wherever/it/is/hc.tab /wherever/that/is/hc-supp.tab | perl repair-hc-table > my-hc.tab
    % foreach file (`find c-g-aligned -name "*.c-big5-hkscs"`)
    foreach? set gbfile = `echo $file | sed 's/big5-hkscs/gb2312/'`
    foreach? cat $file | perl big5hkscs2gb2312-preproc | hc -map all -m b2g -t my-hc.tab > $gbfile
    foreach? end
    
  4. Extract alignment pairs

    The files restored by restore-alignments include segments that have no counterpart in the corresponding track of the bitext or that were not properly aligned (zero-to-one-or-many, one-or-many-to-zero alignments). The script extract-pairs produces a set of parallel files where such segments are removed.

    Usage:

    % perl extract-pairs c-g-aligned .e .c-gb2312 alignment-pairs
    

Download

You can download the tarball here.