| USC Information Sciences Institute | Natural Language Group |
This package contains alignment data for the LDC Hong Kong News Corpus. It essentially consists of a data file and a Perl script that allows you to restore the aligned data provided that you have the original LDC data. The package does not provide any actual text; it is completely useless without the data from the LDC.
The alignment was performed in the following manner:
The alignment of the LDC Hong Kong News Corpus was performed as part of the ReWrite project under funding from the DARPA TIDES program.
| English | Chinese | |
|---|---|---|
| Number of Words/Tokens (Chinese: Characters) | ca. 6,000,000 | ca. 10,000,000 |
| Number of Sentences | ca. 345,000 | ca. 293,000 |
| Number of Parallel Chunks | ca. 292,000 | |
The alignments are far from perfect and come without any expressed or implied warranty whatsoever. Use them at your own risk. There may or may not be updated versions in the future.
| alignments.dat | data file with the alignment data |
| restore-alignments | Perl script to restore the alignments |
| repair-hc-table | Perl script to repair the hc conversion tables (see below) |
| big5hkscs2gb2312-preproc | pre-processor for code conversion |
| extract-pairs | extracts alignment pairs |
| README.html | this file |
[Put CD with Hong Kong News Corpus into CD-ROM] % mount /mnt/cdromIn the following, we assume that the root of the LDC distribution is accessible as /mnt/cdrom. If you have a verbatim copy of the disk somewhere on you network, replace /mnt/cdrom by the path to that copy.
% gzip -dc ldc-hknc-alignments.8.2001.tar.gz | tar xf - % cd hk-news-alignments % perl restore-alignments alignments.dat /mnt/cdromNote: If you run restore-alignments with the option -t (test mode), it does not recreate the aligned corpus but writes the output to STDOUT instead.
The script restores the Chinese text in the original Big5-HKSCS encoding. If you want GB encoding, you will have to convert the text yourself. The tool that I found to work best (though not perfect) is hc-30. However, since the Hong Kong Supplementary Character Set (HKSCS) contains two-byte codes whose first range is below the regular Big5 range, certain characters will not be converted correctly. You will receive better conversion results if you replace such HKSCS characters with an unknown character. This package includes a pre-processor that performs these replacements.
If you use the default conversion tables that come with hc-30 (hc.tab and hc-supp.tab), and the option -map all, a few conversions will not be unique. hc indicates this in its output by placing alternatives in double angled brackets: e.g. ...<<ÈýÈþ>>.... I do not recommend using the option -map one, which always returns the first mapping found in the tables, because this will actually remove some line breaks -- which are used to indicate alignments in the parallel files. Rather, repair the mapping table(s) with the script repair-hc-table provided in this package and run hc with -map all.
In the following it is assumed that hc is in your path.% cat /wherever/it/is/hc.tab /wherever/that/is/hc-supp.tab | perl repair-hc-table > my-hc.tab % foreach file (`find c-g-aligned -name "*.c-big5-hkscs"`) foreach? set gbfile = `echo $file | sed 's/big5-hkscs/gb2312/'` foreach? cat $file | perl big5hkscs2gb2312-preproc | hc -map all -m b2g -t my-hc.tab > $gbfile foreach? end
% perl extract-pairs c-g-aligned .e .c-gb2312 alignment-pairs