Aligned Hansards of the 36th Parliament of Canada
What it is
This release contains 1.3 million pairs of aligned text chunks
(sentences or smaller fragments) from the official records
(Hansards) of the 36th
Canadian Parliament.
Terms of Use
Copyright of the Hansards
The copyright on the hansards is held by the Canadian Parliament.
The copyright statement
on the web site of the Canadian Parliament allows the use and reproduction of its content.
How The Alignment Was Performed
- Raw Text Extraction: The raw text was extracted from HTML
files (as available via http://www.parl.gc.ca) with
a perl script that makes use of the
HTML::Parser module by Gisle Aas / Michael A. Chase.
- Sentence Splitting:
The raw text was split into sentences with Adwait Ratnaparkhi's
MXTERMINATOR. For details on MXTERMINATOR, see
http://www.cis.upenn.edu/~adwait/statnlp.html.
- Text Alignment:
The sentences were aligned with I. Dan Melamed's
GSA tool. See
http://www.cis.upenn.edu/~melamed/.GSA/README.htm
for details on GSA.
- One-to-one correspondences were then extracted from the aligned
data.
Coverage
The complete Hansards of the debates in the House and Senate of the
36th Canadian Parliament, as far as available, were aligned.
The corpus was then split into 5 sets of sentence pairs:
training (80% of the sentence pairs), two sets of sentence pairs for
testing (5% each), and two sets of sentence
pairs for final evaluation (5% each). The current release consists
of the training and testing sets. The evaluation sets are reserved
for future MT evaluation purposes and currently not available.
Table 1 shows the size of the various subcorpora:
|
Table 1: Corpus Size
|
|
|
House Debates |
|
Senate Debates |
|
Sentence Pairs |
English Words |
French Words |
Sentence Pairs |
English Words |
French Words |
training |
948 K |
14,614 K |
15,657 K |
182 K |
3,030 K |
3,249 K |
testing 1 |
62 K |
995 K |
1,067 K |
13 K |
222 K |
237 K |
testing 2 |
60 K |
758 K |
806 K |
12 K |
211 K |
226 K |
eval 1*
|
63 K |
975 K |
1,048 K |
13 K |
204 K |
216 K |
eval 2*
|
62 K |
969 K |
1,039 K |
12 K |
211 K |
225 K |
released |
1,070 K |
16,366 K |
17,530 K |
208 K |
3,464 K |
3,712 K |
Total*
|
1,195 K |
18,311 K |
19,618 K |
233 K |
3,879 K |
4,153 K |
* not released
|
File Format
The release consists of sets of corresponding files
(*.e and *.f) that contain one sentence per line, so that
sentence No. n in the file xyz.e corresponds
to sentence No. n in the file xyz.f.
In order to save space, individual files are gzipped individually.
For each day of parliamentary sessions, there is
usually one file per language. Occasionally, there are several.
Caveats
-
This release contains only sentence pairs. Even though the order
of the sentences is the same as in the original, there may be
gaps resulting from many-to-one, many-to-many, or one-to-many
alignments that were filtered out. Therefore, this release
may not be suitable for discourse-related research.
Table 2 shows the total number of sentences on the English and
French sides and the number of sentence pairs extracted.
- Neither the sentence splitting nor the alignments are perfect.
In particular, watch out for pairs that differ considerably
in length. You may want to filter these out before you do any
statistical training.
Table 2: Total Number of Sentences in the Hansards
|
|
English |
French |
pairs released |
pairs total |
House Debates |
1,925 K |
18,94 K |
1,070 K |
1,195 K |
Senate Debates |
281 K |
276 K |
208 K |
233 K |
Credits
We are very grateful to the Canadian people in general and to the
Canadian Parliament in particular for making the Hansards
available to the general public free of charge and without any restrictions
that would hamper research. Their generosity is gratefully acknowledged.
We would also like to thank the authors of the aforementioned software
tools for making them available to the research community.
Without them the alignment of the Hansards would have been
much more difficult, if not impossible. Their generosity, too, is
gratefully acknowledged.
The alignment of the Hansards was performed as part of the
ReWrite project under funding from the
DARPA TIDES program.
Download The Corpus
"Installation"
Unpack the tar balls
% tar xvf *.tar