Aligned Hansards of the 36th Parliament of Canada

Aligned Hansards of the 36th Parliament of Canada
Release 2001-1a

– provided by the Natural Language Group of the USC Information Sciences Institute –

Ulrich Germann, Editor


What it is

This release contains 1.3 million pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament. Samples are available online HERE.

Terms of Use

Copyright of the Hansards

The copyright on the hansards is held by the Canadian Parliament. The FAQ on the web site of the Canadian Parliament states:
The publishing of the official publications of the House of Commons is governed by the law of parliamentary privilege, by which the House of Commons has the right to control the publication of its proceedings. It may be used without seeking the permission of the Speaker of the House of Commons provided that it is accurately reproduced and that it does not offend the dignity of the House of Commons or one of its Members. Reproduction of the material is permitted in whole or in part, and by any means.

How The Alignment Was Performed

  1. Raw Text Extraction: The raw text was extracted from HTML files (as available via http://www.parl.gc.ca) with a perl script that makes use of the HTML::Parser module by Gisle Aas / Michael A. Chase.
  2. Sentence Splitting: The raw text was split into sentences with Adwait Ratnaparkhi's MXTERMINATOR. For details on MXTERMINATOR, see http://www.cis.upenn.edu/~adwait/statnlp.html.
  3. Text Alignment: The sentences were aligned with I. Dan Melamed's GSA tool. See http://www.cis.upenn.edu/~melamed/.GSA/README.htm for details on GSA.
  4. One-to-one correspondences were then extracted from the aligned data.

Coverage

The complete Hansards of the debates in the House and Senate of the 36th Canadian Parliament, as far as available, were aligned. The corpus was then split into 5 sets of sentence pairs: training (80% of the sentence pairs), two sets of sentence pairs for testing (5% each), and two sets of sentence pairs for final evaluation (5% each). The current release consists of the training and testing sets. The evaluation sets are reserved for future MT evaluation purposes and currently not available.

Table 1 shows the size of the various subcorpora:

 
Table 1: Corpus Size
 
  House Debates   Senate Debates
  Sentence
Pairs
English
Words
French
Words
Sentence
Pairs
English
Words
French
Words
training 948 K 14,614 K 15,657 K 182 K 3,030 K 3,249 K
testing 1 62 K 995 K 1,067 K 13 K 222 K 237 K
testing 2 60 K 758 K 806 K 12 K 211 K 226 K
eval 1* 63 K 975 K 1,048 K 13 K 204 K 216 K
eval 2* 62 K 969 K 1,039 K 12 K 211 K 225 K
released 1,070 K 16,366 K 17,530 K 208 K 3,464 K 3,712 K
Total* 1,195 K 18,311 K 19,618 K 233 K 3,879 K 4,153 K

* not released

File Format

The release consists of sets of corresponding files (*.e and *.f) that contain one sentence per line, so that sentence No. n in the file xyz.e corresponds to sentence No. n in the file xyz.f. In order to save space, individual files are gzipped individually.

For each day of parliamentary sessions, there is usually one file per language. Occasionally, there are several.

Caveats

  1. This release contains only sentence pairs. Even though the order of the sentences is the same as in the original, there may be gaps resulting from many-to-one, many-to-many, or one-to-many alignments that were filtered out. Therefore, this release may not be suitable for discourse-related research. Table 2 shows the total number of sentences on the English and French sides and the number of sentence pairs extracted.

  2. Neither the sentence splitting nor the alignments are perfect. In particular, watch out for pairs that differ considerably in length. You may want to filter these out before you do any statistical training.

Table 2: Total Number of Sentences in the Hansards
  English French pairs
released
pairs
total
House Debates 1,925 K 18,94 K 1,070 K 1,195 K
Senate Debates 281 K 276 K 208 K 233 K

Credits

We are very grateful to the Canadian people in general and to the Canadian Parliament in particular for making the Hansards available to the general public free of charge and without any restrictions that would hamper research. Their generosity is gratefully acknowledged.

We would also like to thank the authors of the aforementioned software tools for making them available to the research community. Without them the alignment of the Hansards would have been much more difficult, if not impossible. Their generosity, too, is gratefully acknowledged.

The alignment of the Hansards was performed as part of the ReWrite project under funding from the DARPA TIDES program.

Download The Corpus

"Installation"

Unpack the tar balls

% tar xvf *.tar