Aligned Hansards of the 36th Parliament of Canada
Release 2001-1a
provided by the Natural Language Group of the USC Information Sciences Institute
Ulrich Germann, Editor
This release contains 1.3 million pairs of aligned text chunks (sentences or smaller fragments) from the official records (Hansards) of the 36th Canadian Parliament.
Samples are available online HERE.
Table 1: Corpus Size |
|||||||
---|---|---|---|---|---|---|---|
House Debates | Senate Debates | ||||||
Sentence Pairs |
English Words |
French Words |
Sentence Pairs |
English Words |
French Words |
||
training | 948 K | 14,614 K | 15,657 K | 182 K | 3,030 K | 3,249 K | |
testing 1 | 62 K | 995 K | 1,067 K | 13 K | 222 K | 237 K | |
testing 2 | 60 K | 758 K | 806 K | 12 K | 211 K | 226 K | |
eval 1* | 63 K | 975 K | 1,048 K | 13 K | 204 K | 216 K | |
eval 2* | 62 K | 969 K | 1,039 K | 12 K | 211 K | 225 K | |
released | 1,070 K | 16,366 K | 17,530 K | 208 K | 3,464 K | 3,712 K | |
Total* | 1,195 K | 18,311 K | 19,618 K | 233 K | 3,879 K | 4,153 K | |
* not released | |||||||
This release contains only sentence pairs. Even though the order of the sentences is the same as in the original, there may be gaps resulting from many-to-one, many-to-many, or one-to-many alignments that were filtered out. Therefore, this release may not be suitable for discourse-related research. Table 2 shows the total number of sentences on the English and French sides and the number of sentence pairs extracted.
Table 2: Total Number of Sentences in the Hansards |
||||
---|---|---|---|---|
English | French | pairs released |
pairs total |
|
House Debates | 1,925 K | 18,94 K | 1,070 K | 1,195 K |
Senate Debates | 281 K | 276 K | 208 K | 233 K |
We would also like to thank the authors of the aforementioned software tools for making them available to the research community. Without them the alignment of the Hansards would have been much more difficult, if not impossible. Their generosity, too, is gratefully acknowledged.
The alignment of the Hansards was performed as part of the ReWrite project under funding from the DARPA TIDES program.
% tar xvf *.tar