MTranslatability

AMTA-2000 Tutorial

A. Bernth (arendse@us.ibm.com) and C. Gdaniec (cgdaniec@us.ibm.com)

IBM T. J. Watson Research Center

P.O. Box 218

Yorktown Heights, NY 10598

USA

This document contains the tutorial text on MTranslatability, which we presented at AMTA-2000 in Cuernavaca, Mexico.
 

Contents:

Introduction

Ways to Improve MTranslatability

Check the file characteristics 

Check the punctuation

Check the spelling

Update your personal dictionaries

Check the grammar

Reduce ambiguity

Check the style

Tools

                        Spell checkers

                Grammar and style checkers

                Controlled language checkers

                Other helpful tools

        Ways to measure MTranslatability

Conclusion

Special Interest Group on MTranslatability

Resources

                Papers
                Periodicals
                Conferences




HOW TO WRITE GOOD by Frank L. Visco

http://www.ou.edu/special/owp/goodies/writegood.html

My several years in the word game have learnt me several rules:

1. Avoid alliteration. Always. 

2. Prepositions are not words to end sentences with. 

3. Avoid cliches like the plague. (They're old hat.) 

4. Employ the vernacular. 

5. Eschew ampersands & abbreviations, etc. 

6. Parenthetical remarks (however relevant) are

unnecessary. 

7. It is wrong to ever split an infinitive. 

8. Contractions aren't necessary. 

9. Foreign words and phrases are not apropos. 

10. One should never generalize. 

11. Eliminate quotations. As Ralph Waldo Emerson once said:

"I hate quotations. Tell me what you know." 

12. Comparisons are as bad as cliches. 

13. Don't be redundant; don't more use words than

necessary; it's highly superfluous. 

14. Profanity sucks. 

15. Be more or less specific. 

16. Understatement is always best. 

17. Exaggeration is a billion times worse than

understatement. 

18. One-word sentences? Eliminate. 

19. Analogies in writing are like feathers on a snake. 

20. The passive voice is to be avoided. 

21. Go around the barn at high noon to avoid colloquialisms. 

22. Even if a mixed metaphor sings, it should be derailed. 

23. Who needs rhetorical questions?

Introduction

Current MT systems are often unable to produce high-quality output on arbitrary, unseen input. The output frequently does not meet user needs and requirements.



Ÿ         Why is MT output not better? 

Ÿ         MT systems are not good enough

Ÿ         Statistical MT systems tend to use more simplistic language models that do not allow for several layers of abstraction.  This can result in less adequate coverage of linguistic rules and linguistic generalizations.

Ÿ         Knowledge-based MT systems depend on large amounts of hand-coded data (lexical data and syntactic rules).  It is very time-consuming to gain enough linguistic coverage.

Ÿ         MT input is not good enough

Ÿ         Bad markup

Ÿ         Incorrect punctuation

Ÿ         Incorrect spelling

Ÿ         Incorrect grammar

Ÿ         Ambiguous constructions

Ÿ         Bad style

Ÿ         What aspects can the MT user control? 

Ÿ         MT input 

Ÿ         Lexical coverage

Ÿ         Ways to change input in order to increase MTranslatability and thus improve the MT output.

Ÿ         Is it possible to predict the output quality for given input automatically?



Ways to Improve MTranslatability[1]

Ÿ        Check the file characteristics

Ÿ         Proofread and correct any scanned documents
Ÿ         OCR software is not 100% reliable

Ÿ         Avoid bitmaps when possible; these are usually not translated by MT systems

Ÿ         Use mark-up tags in a conceptional way; use header tags for headers, etc.

Ÿ         Do not abuse tags to accomplish a purely physical effect (e.g. a header tag just to achieve a bigger font) or tags that accomplish formatting on their own (e.g. <br>). 

Ÿ         Use mark-up to accomplish the desired layout for tables etc, rather than “manual” indentation.

Ÿ         Specify the LANG attribute for HTML documents.  Mark any parts that are in a different language from that of the main document.

Ÿ         Write hypertext links and bold-faced (italicized etc) text such that they can be translated as a single entity.  This way the markup will look better for the translation.  Mark strings that should not be translated.

Ÿ         Use ISO 8859 (or Unicode characters) throughout. Else, use entities for characters that are not part of the ASCII character set.  For instance, in the SGML/HTML source code, your entity for ü [u-umlaut] should be:&uuml.

Ÿ         Make sure that words that are used as labels or names are properly identified.

E.g. The red button vs The “RED” button.You can use defined tags such as <q> RED </q>.

Ÿ        Check the punctuation

Ÿ         Punctuation that indicates a new segment2 is especially important.

Ÿ         Remember correct use of hyphens.

Do not write: If the user provided file is not found, an error message is issued.

Do write: If the user-provided file is not found, an error message is issued.

Do not write: He bit-off more than he can chew. 

Do write: He bit off more than he can chew.

Ÿ         Commas do make a difference.

Do not write: Since Jay always jogs a mile doesn't seem that far to him.

Do write: Since Jay always jogs, a mile doesn't seem that far to him.

Ÿ         Avoid using (s) to indicate plural. This construction may not translate well into other languages.

Ÿ         Avoid using “/”as in “and/or” and “user/system”.  It is ambiguous.

Ÿ        Check the spelling

Ÿ         If a word is misspelled, it will -- at best --produce a non-translation.  At worst it will mess up source analysis and produce a wrong grammatical structure.



Ÿ        Update your personal dictionaries

If a word is not in any of the dictionaries that the MT system uses, there is no way the MT system will know how to translate it.  Worse still, it will not know how to analyze the sentence that the word occurs in. It is also important to make sure that all the relevant parts of speech of the words are covered in the dictionary.

E.g. Postage meter: External mail to be postage-metered.

Ÿ         Special terminology

Ÿ         You may use certain words in a nonstandard sense, but make sure you update your dictionary.

Ÿ         Multi words

Ÿ         Many noun strings cannot be translated compositionally and have to be treated as a unit.  But beware: Not all MT systems can handle coordination of premodifiers in multi words.  E.g. Forward and backward compatibleside and back exits.    

                                 

Ÿ        Check the grammar

Current MT systems have to rely on syntax to a large extent; therefore, ungrammatical input is bound to produce wrong output!

Ÿ         Subject-verb agreement

Do not write: File information and data type is of utmost importance.

Do write: File information and data type are of utmost importance.

 

Ÿ         Wrong modification

Do not write: Woven of combed cotton, you will love our sweater's soft feel.

Do write: Woven of combed cotton, this sweater will delight you with its soft feel.

Do write: Our sweater is woven of combed cotton, and you'll love its soft feel.

 

Ÿ        Reduce ambiguity

Adhering to the following recommendations is useful to varying degrees depending on the MT system that is being used. Some systems are more robust vis-a-vis certain structural ambiguities. 



Ÿ        Use syntactic cues (avoid use of the telegraphic style):

Ÿ         Use articles whenever possible

Do not write: Meeting requirements.

Do write: Meeting the requirements.

 

Ÿ         In coordinated phrases:

Repeat articles

Do not write: The system reads the file or result field definition.

Do write: The system reads the file or the result field definition.

Repeat any modal/auxiliary verb

Do not write: The application can use the window to establish a dialog with the user and format text responses.

Do write: The application can use the window in order to establish a dialog with the user and can format text responses.

Repeat “to” before infinitives         

Do not write: The application can use the window to establish a dialog with the user and format text responses.

Do write: The application can use the window in order to establish a dialog with the user and to format text responses.                                          

Repeat the preposition before any prepositional objects

Do not write: The coordinates that are displayed correspond to the top of the slider in the vertical slide bar, and the top edge of the slider in the horizontal slide bar.

Do write: The coordinates that are displayed correspond to the top of the slider in the vertical slide bar, and to the top edge of the slider in the horizontal slide bar.

Use “either”-”or” instead of “or” alone

Do not write: The system immediately terminates the program if a hard error or exception occurs.

Do write: The system immediately terminates the program if either a hard error or an exception occurs.

Use “both”-”and” instead of “and” alone

Do not write: The system immediately terminates the program if it detects a hard error and exception.

Do write: The system immediately terminates the program if it detects both a hard error and an exception.

 

Ÿ       Avoid long noun phrases, if possible

Do not write: The uninterruptible power supply message queue system value allows you to specify where you want your messages sent when the power to the system is interrupted.




Ÿ        Do not omit relative pronouns; write “that” (“which”, “who” etc) explicitly

Do not write: The cotton shirts are made from comes from Arizona.

Do write: The cotton that shirts are made from comes from Arizona.

Do not write: In experiment 6 we were interested in the reading subjects spontaneously achieve for such a headline.

Do write: In experiment 6 we were interested in the reading that subjects spontaneously achieve for such a headline.

 

Do not write: After a process creates a resource, any process it starts inherits the resource identifiers.

Do write: After a process creates a resource, any process that it starts inherits the resource identifiers.


Ÿ        Expand postnominal modifiers into full relative clauses

Do not write: The amount of adjacent space available in storage does not restrict the size of a library, or of any other object.

Do write: The amount of adjacent space that isavailable in storage does not restrict the size of a library, or of any other object.

 

Do not write: Programs currently running in the system are indicated by icons in the lower part of the screen.

Do write: Programs that are currently running in the system are indicated by icons in the lower part of the screen. 

Do write: Icons in the lower part of the screen indicate programs that are currently running in the system.

 

Do not write: The horse raced past the barn fell.

Do write: The horse that was raced past the barn fell.

Ÿ        Always write the complementizer “that” explicitly

Do not write: Make sure the power is turned off.

Do write: Make sure that the power is turned off.


Ÿ        Always write in order to before an infinitive in a purpose clause instead of just to

Do not write: Use this function to copy project data to a new or existing project.

Do write: Use this function in order to copy project data to a new or existing project.


Ÿ        Avoid -ing-forms

Ÿ         Rewrite -ing verbs that post-modify a noun as a relative clause or add a suitable preposition, depending on what you mean

Do not write: You can develop an application using the TCP/IP sockets.

Do write: You can develop an application that uses the TCP/IP sockets.

Do write: You can develop an application by using the TCP/IP sockets.

Ÿ         Rewrite -ing verbs pre-modifying a noun to include an article

Do not write: DATAMAX continues processing statements after repairing the data set.

Do write: DATAMAX continues the processing statements after it repairs the data set.

If that is what you meant...

Ÿ         Rewrite -ing verbs that are complements of other verbs

Do not write: The motor starts using a gas-powered pull start or pushbutton ignition via a rechargeable battery. 

Do write: You use a gas-powered pull start or pushbutton ignition via a rechargeable battery in order to start the motor.

Ÿ         Rewrite -ing verbs that can take an infinitive complement as “to” + infinitive

Do not write: Receiving notices.

Do write: To receive notices.

Ÿ         Make sure the implicit subject of an -ing verb that occurs in a subordinate clause starting with a subordinate conjunction (“after”, “when”, “while” etc.) has the same subject as in the superordinate clause

Do not write: After inserting the diskette, the system will read the file.

Do write: After you insert the diskette, the system will read the file.

Ÿ         Beware.  Kohl(1999) claims that it is not necessary to worry about the following cases:

a.       ing-verbs that are preceded by a preposition.  A slight variation of his example is For more information about printing files, see Chapter 3.  However, in the context of MT, this is ambiguous between the reading where files is the object of print, and the reading where printing pre-modifies files

b.      ing-verbs that are the subject of a clause of a sentence  .His example is Specifying the system password gives you full administrative access.  He goes on to say:“When it’s the first word of a simple sentence, an -ING can only be a gerund.”  This is not generally true. The reason this example is not ambiguous is that there is a determiner (the) between the ing-verb and the following noun.

Humans often disambiguate by applying real-world knowledge, but even then there may be problems as evidenced by the notorious example Visiting relatives can be a nuisance.

Or how about this real, but truly ambiguous sentence: At XYZ Inc. we don't waste any time improving service for our customers!

Ÿ        Minimize use of pronouns

Ÿ         In many languages the pronoun has to agree in number and gender with its antecedent.  Most MT systems do not support pronoun resolution, which is a rather difficult task.



The police refused the anarchists a permit because they feared violence. 

The police refused the anarchists a permit because they advocated violence. 

La police a refusé un permis aux anarchistes parce qu'elle craint des actes de violence. 

La police a refusé un permis aux anarchistes parce qu'ils prônent la violence.

This example shows that an MT system would have to be extremely smart to “know” the reference of the pronoun “they”.  At present, there are no NLP programs that can reliably identify the reference of pronouns.  Therefore, strictly controlled languages ban the use of 3rd person pronouns altogether. Unless you are willing to adhere to the rules of a CL, there is not much you can do about the pronoun reference issue if you want to write fluent text.  One way of avoiding pronouns is to repeat the noun phrase in a reduced form and write The spool file space on the disk should not get too large, and you should reduce the space to conform to specifications instead of The spool file space on the disk should not get too large, and you should reduce it to conform to specifications, if you think this is acceptable.

Ÿ        Use one-word verbs instead of verb+particle whenever possible

English verb particles represent a challenge to MT systems because of the ambiguity of particles and prepositions.  If there is a choice between two synonymous verbs, one with a particle and one without, do choose the latter.  E.g. She ran up a bill.vs.She ran up a hill.

Do not write:She ran up a bill.

Do write:She accumulated a bill.

Ÿ        Check the style

Ÿ        Avoid overly long sentences and very short sentences

Do not write: Transfer file.

Do write: Transfer the file.

Do write: The transfer file.

 

Do not write: At all levels of security, the system-supplied defaults in the user profile can be changed and authority can be specifically given or taken away from the users.

Do write: At all levels of security, the system-supplied defaults in the user profile can be changed. Authority can be specifically given to the users or taken away from the users.

Ÿ        Avoid metaphors, idioms, slang, dialect, irony

Ÿ         Do not write: He got my goat.

Ÿ         Do write: He annoyed me.

Ÿ        Avoid overly complex constructions

Do not write: Communication between programs, between jobs, between users, between users and programs and between users and the system occurs through messages.

Do write: Communication occurs through messages.  This is true for communciation between programs, between jobs, between users, as well as for communcation between users and programs, and between users and the system.


Ÿ        Avoid ellipsis

Do not write: Is she suing the hospital? -- She is the doctor.

Do write: Is she suing the hospital? -- She is suing the doctor.

Ÿ        Avoid passive constructions, if possible3

Do not write: The size of a library, or of any other object, is not restricted by the amount of adjacent space available in storage. 

Do write: The amount of adjacent space that is available in storage does not restrict the size of a library, or of any other object.

Ÿ       Make sure each segment can stand alone, e.g do not let individual list elements be part of the sentence leading in to the list 

Do not write: 

After you have set up your workstation, you can:

a.       Log on to the network 

b.      Work locally 

Do write:  After you have set up your workstation, you can log on to the network or work locally.

Do write: 

After you have set up your workstation, you can do the following:

a.       You can log on to the network 

b.      You can work locally 

Avoid footnotes in the middle of a sentence, and make footnotes independent segments

Ÿ        What makes life easier for the human reader is not always useful in the context of MT:

Ÿ         Exact repetitions make it more fruitful to use translation memory

Ÿ         Short words are often more abstract and polysemous, and hence prone to bad translation





Tools 

Ÿ        Spell Checkers

Ÿ         The objective of spell checkers is to point out misspelled words and, where possible, suggest the correct spelling.

Ÿ         Most spell checkers work with a dictionary.  If a word is not found in the dictionary (including user-defined dictionaries), it will be flagged as a misspelling, and alternatives given.

Ÿ         Spell checkers do not generally discover words that happen to be valid words, but incorrect in context.

Do not write: There very happy.

Do write: They’re very happy.
 

THE SPELLING CHEQUER 

(or Poet Tree Without Mist Aches)

I have a spelling chequer 

It came with my pea sea 

It plainly marques four my revue 

Miss steaks eye cannot sea 

When eye strike a quay, right a word 

I weight four it two say 

Weather eye am wrong or wright 

It shows me strait away 

As soon as a mist ache is maid 

It nose bee fore two late 

And eye can put the error rite 

Its rarely, rarely grate 

I've run this poem threw it 

I'm shore your pleased to no 

It's letter perfect in it's weigh 

My chequer tolled me sew. 

                                --Sauce unknown

 

 

        WHY SPELL CHECK DOES NOT WORK--A LINGUISTIC ODYSSEY 

            Thanks to M. Zarnosky: bruin@vt.edu Thu Feb 23 08:08:05 1995 



        From IEEE Transactions on Aerospace and Electronic Systems, Vol. 26, No. 2,

        March, 1990 -- p. 209, author name n.a. -- 

Catching Misspilled Words with Spilling Checker

As an extra addled service, I am going to put this column in the Spilling Checker, where I tryst it will sale through with flying colons.In this modern ear, it is simply inexplicable to ask readers to expose themselves to misspelled swords when they have bitter things to do. And with all the other timesaving features on my new work processor, it is in realty very easy to pit together a colon like this one and get it tight. For instants, if there is a work that is wrong, I just put the curse on it, press Delete and its    Well sometimes it deletes to the end of the lion or worst yet the whole rage.  Four bigger problems, there is the Cat and Paste option.  If there is some test that is somewhere were you wish it where somewhere else you jest put the curse at both ends and wash it dissapear.  Where you want it to reappear simply bring four quarts of water to a rotting boil and throw in 112 pounds of dazed chicken.  Sometimes it brings in the Cat that was Pasted yesterday.  But usually it comes out as you planned, or better.  And if it doesn't, there are lots of other easy to lose options... 

Grammar and Style Checkers

Ÿ         The objective of grammar checkers is to point out ungrammatical constructions.

Ÿ         Grammatical input to MT stands a better chance of getting a good translation; however, it is not sufficient to guarantee a correct translation.

Ÿ         Grammar checking is a very difficult process because the program basically has to try to make (grammatical) sense of (grammatical) nonsense.  Consequently, the precision of grammar checkers is notoriously low.

Ÿ         Grammar checkers show a tendency to lump together different kinds of problems.  Some of these problems are more relevant for MTranslatability than others; consequently, some checks fall into more than one usefulness category, depending on which aspect you are looking at.




Ÿ        Microsoft Word2000 checks for the following problems:

Ÿ        Useful for MTranslatability

Ÿ         Capitalization of first word in a sentence

Ÿ         Hyphenated and compound words

Ÿ         Words in split infinitives ( > 1)

Ÿ         Passive sentences

Ÿ         Commonly confused words ( its/it’s, their/there/they’re)

Ÿ         Punctuation

Ÿ         Relative clauses (who, which, that)

Ÿ         Sentence structure (e.g. bad participial modification:  Having run the marathon, it was time to rest.)

Ÿ         Subject-verb agreement

Ÿ         Successive nouns ( > 3)

Ÿ         Successive prepositional phrases ( > 3)

Ÿ         Verb and noun phrases 

Ÿ         Cliches (these tend to be idiomatic)

Ÿ         Colloquialisms

Ÿ         Jargon

Ÿ         Unclear phrasing (various cases of ambiguous scope)

Ÿ         Double negation

Ÿ         Sentence length ( > 60 words) (this maximum is very high, but it’s better than nothing)

Ÿ         Wordiness (to the extent it reduces sentence length)

Ÿ         Verb contractions (‘s, which is ambiguous between is, has, and possessive; ‘d, which is ambigous between had and would)

Ÿ         Possessives and plurals(houses vs. house’s)

Ÿ         Misused words (includes various grammatical mistakes for adjectives and adverbs; wrong case)

Ÿ        Not useful for MTranslatability

Ÿ         Gender-specific words

Ÿ         Sentences beginning with AndButHopefully, and Plus

Ÿ         Use of first person

Ÿ         Numbers (use of digits instead of spelled-out numbers)
 

Ÿ        Slightly harmful for MTranslatability

Ÿ         Verb contractions (‘m, n’t, ‘re, ‘ll, ‘ve; these help parsing)

Ÿ         Sentence structure(e.g. repetition of conjunctions:She ate a hot dog and a coke and an ice cream cone.)

Ÿ         Wordiness (to the extent it prevents disambiguation)

Ÿ        CorrecText Grammar Correction System (Word Pro 97)

Ÿ        Useful for MTranslatability

Ÿ         Verb agreement with there/here

Ÿ         Capitalization errors

Ÿ         Compounding errors (missing or superfluous hyphen.)

Ÿ         Doubled words (the the)

Ÿ         Open vs closed spelling (spelling errors that result from incorrect use of spaces. never the less instead of nevertheless.)

Ÿ         Clause errors (punctuation; incomplete sentences)

Ÿ         Double negations

Ÿ         Formatting errors 

Ÿ         format of numbers (placement of periods and commas; endings of ordinal numbers; spelling of fractions and other numbers)

Ÿ         dates (use of cardinal and ordinal numbers)

Ÿ         times (use of abbreviations and punctuation marks) 

Ÿ         currency and other symbols

Ÿ         addresses

Ÿ         Inappropriate prepositions (adhere to instead of adhere by; center on instead of center around.)

Ÿ         Mass/count noun agreementwith adjectives (less vs fewer)

Ÿ         Misused words (confused words: sit vs. set)

Ÿ         Nonstandard modification (adjectives instead of adverbs; hyphenation).

Ÿ         Noun phrase consistency errors (errors of number agreement between determiners and nouns). 

Ÿ         Pronoun errors (errors in case and ordering; which instead of that in restrictive clauses.)

Ÿ         Punctuation errors

Ÿ         Subject-verb agreement errors

Ÿ         Non-standard English (seeing as how instead of since)

Ÿ         Verb group consistency errors (errors in the use of the present, the past, and the past participle, as well as errors in the choice of auxiliary verbs.)

Ÿ         Word order errors (incorrect ordering of certain words that modify nouns; my both instead of both my).

Ÿ         Commonly confused words (commonly confused words of different parts of speech that have similar though not identical pronunciations; advice vs advise.) and homonyms.

Ÿ         Clichés

Ÿ         Verb contractions (‘s, which is ambiguous betweenis, has, and possessive; ‘d, which is ambigous between had and would)

Ÿ         Informal expressions

Ÿ         Jargon

Ÿ         Passive voice usage

Ÿ         Overused phrases (blissful ignorance instead of ignorance), stock phrases (fillers like in fact), and wordy expressions (vague or wordy expressions; in all probability instead of probably).

Ÿ         Redundant expressions (sufficient enough instead of sufficient or enough).

Ÿ         Weak modifiers (overused or colloquial modifiers; funny, pretty well, or nice).

Ÿ         Many consequtive prepositional phrases (limit is user-definable)

Ÿ         Many consequtive nouns (limit is user-definable)

Ÿ         Split infinitives (limit is user-definable)

Ÿ         Misspelled foreign expressions

Ÿ         Nonstandard terms

Ÿ         Archaic expressions

Ÿ         ‘A’ vs ‘An’

Ÿ        Not useful for MTranslatability

Ÿ         Gender-specific expressions

Ÿ         Sexist expressions

Ÿ         Vague, wordy, or informal quantifiers

Ÿ         Unnecessary prepositions.

This check seems incorrect, judging from the help text, which is as follows:

These rules flag expressions that include an unnecessary preposition and suggest deleting it to make the expression more concise. Example: in the sentence 'I sat down on the lawn,' the preposition 'down' is superfluous since it is implied by the word 'sat.' 

In our view, the sentence without the particle has a different meaning.
 

Ÿ        Slightly harmful for MTranslatability

Ÿ         Clause errors (repetition of conjunctions: We chopped up fruit, and we diced the potatoes, and we made a pie crust)

Ÿ         Verb contractions (‘mn’t‘re‘ll,‘ve)

Ÿ         Pretentious words (unnecessarily complex words; eventuate instead of take place).

Ÿ         Identical sentence openers




Ÿ        Grammatik (Corel WordPerfect, version 7)

Ÿ        Useful for MTranslatability

Ÿ         Abbreviation

Ÿ         Confused adjective or adverb 

Ÿ         Archaic

Ÿ         ‘A’ vs ’An’

Ÿ         Capitalization

Ÿ         Cliche (idiomatic)

Ÿ         Colloquial (idiomatic)

Ÿ         Commonly confused words and similar words (from vs form)

Ÿ         Wrong comparative or superlative

Ÿ         Conditional Clause (incorrect verb forms)

Ÿ         Conjunctions (neither-nor; between X and Y; parallelism)

Ÿ         Consequtive elements (number of nouns or prepositions in a row; user-definable)

Ÿ         Date and time format

Ÿ         Double negation

Ÿ         Doubled word or negation

Ÿ         End-of-sentence preposition

Ÿ         End-of-sentence punctuation

Ÿ         Foreign expressions

Ÿ         Formalisms 

Ÿ         Dangling modifiers (subjectless -ing-verb) 

Ÿ         disinterested vs. uninterested

Ÿ         Wrong use of hopefully (the value of this is questionable)

Ÿ         Latin singulars and plurals (singular of strata is stratum)

Ÿ         who vs. whom

Ÿ         Hyphenation

Ÿ         Idiomatic usage

Ÿ         Incomplete sentence, including stand-alone subordinate clauses

Ÿ         Other incorrect verb forms, including infinitive used incorrectly instead of -ing-verb and tense shifts

Ÿ         Jargon

Ÿ         Long sentence 

Ÿ         Mid-sentence adverb (position before auxiliary verb) 

Ÿ         Noun phrases (missing article before singular, countable noun; number disagreement; scrambled word order)

Ÿ         Object of verb (missing or superfluous objects; number disagreement with complement of linking verb; missing preposition for prepositional complement)

Ÿ         Overstated

Ÿ         Passive voice

Ÿ         Pronoun errors (errors in case and number agreement; which vs who)

Ÿ         Punctuation (missing commas; comma splice; apostrophe; colon; semicolon; question mark; quotation marks, unbalanced (), {}, [], “”)

Ÿ         Questionable usage

Ÿ         Redundancy

Ÿ         Spelling

Ÿ         Split infinitive

Ÿ         Subject-verb agreement

Ÿ         Trademarks (xerox vs photocopy)

Ÿ         Wordy

Ÿ        Not useful for MTranslatability

Ÿ         Conjunctions (plus vs also as sentence starter)

Ÿ         Formalisms (beginning a sentence with a conjunction)

Ÿ         Gender-specific

Ÿ         Number style

Ÿ         Offensive

Ÿ         One-sentence paragraphs

          

Ÿ        Harmful for MTranslatability

Ÿ         Sentence variety

Ÿ         Run-on sentence (many ands instead of separation by commas)

Ÿ         Second-person address ( you vs one).“One” is at least as ambiguous as “you”.

Ÿ         Ellipsis spaces (between the dots).  Better not to use ellipsis at all.



Ÿ        MULTILINT

MULTILINT is a research and development project sponsored by the German Ministry of Economy. Project partners are the Institute for Applied Information Sciences in Saarbrücken and BMW AG. The tools apply to automotive repair manuals.



MULTILINT’s German grammar checker looks for:

Ÿ        Useful for MTranslatability

Ÿ         wrong punctuation

Ÿ         wrong case 

Ÿ         incorrect word separation 

Ÿ         lack of subject-predicate agreement, etc.

 

The rule set for the grammar checker covers 55 grammatical error classes.  According to a corpus study of German automotive technical documents, the overwhelming majority of grammatical errors in technical documentation consists of punctuation errors, followed by errors of capitalization, separating or combining words, agreement, and other syntactic errors.

The style checker should result in higher clarity and readability of a processed document. It gives the following recommendations:
 

Ÿ        Useful for MTranslatability

Ÿ         Sentence is too long, contains too many information units

(Es dient bei evtl. Reklamationen mit dem numerierten Arbeitsauftrag als Nachweis der im einzelnen durchgeführten Arbeiten und schützt den ausführenden Betrieb vor unberechtigen Werkstatt-, Gewährleistungs- oder sonstigen Regreßansprüchen.) 

Ÿ         Avoid complex attributes (Darüberhinaus wird ein externer, kabelloser, über eine Infrarotverbindung am DIS angeschlossener Drucker angeboten.)

Ÿ         No more than 14 words before the verb (Die beiden vom rechten Radhauskanal kommenden Kraftstoff-Stahlleitungen an den Schlauchanschlüssen zum Kraftstoff-Filter bzw. zur fahrzeugbodenseitigen Rücklaufleitung abziehen.)

Ÿ         Avoid ambiguous structures (Anlageflächen von Schaumresten reinigen.) 

Ÿ         Rephrase groups of prepositional phrases (Undichtheit am Kraftstoff-Entlüftungswellrohr von rechter Tankkammer zu Tankeinfüllstutzen infolge Knickbeschädigung anläßlich der Tankmontage.)

Ÿ         The subject should come before the verb in the main clause (Das Gras frißt die Kuh.)

Ÿ         Separate main clauses (Kaltstartprobleme, DDD-Kontrollampe leuchtet, Motor läuft im Notprogramm.)

Ÿ         Do not insert too many elements between the parts of the verb

(Dieser stellt sich beim Beschleunigen aus ca. 1500 U/min. insbesondere im zweiten Gang unter hoher Last bzw. Vollast als inhomogenes Beschleunigungsverhalten dar.) 

Ÿ         Use a conditional conjunction for conditional clauses (Wird Korrosion festgestellt, sind die betroffenenen Bauteile auszutauschen.)

Ÿ         Write complete sentences(Wärmetauscher undicht?)

 

References: Schmidt-Wigger 1998; Reuther 1998.

http://www.iai.uni-sb.de/en/multien.html

Contact person: Ursula Reuther ursel@iai.uni-sb.de
 

Ÿ        FLAG4 German grammar checker

Based on an annotated corpus of German e-mail messages, researchers found that out of 14,492 sentences, 6473 contained at least one error.  83% of the errors were purely orthographic; grammar errors made up 16%.  This finding motivated them to develop a “phenomenon-based approach to grammar checking” which scans a document for the occurrence of error candidates. One example that they mention is the ungrammatical construction Meines Wissens nach, which is a conflation of the formulaic expressions meines Wissens and meiner Meinung nach5

The researchers plan to develop rules for some 200 grammatical errors.

Once this grammar checker is finished, it should be useful for translatability check.  It is expressly restricted to certain grammatical errors, which is necessary but not sufficient for improved translatability.

References: Becker et al. 1999; Bredenkamp et al. 2000.

http://www.dfki.de

Grammar and Style Checkers: Conclusion

Grammar and style checkers demonstrate a limited usefulness in the preparation of a document for MTranslatability.  Most of the problems that they check for are very relevant for MTranslatability because they are directly related to spelling mistakes and ungrammatical constructions (as you would expect from a grammar checker).  However, a few of the recommendations are directly opposed to MTranslatability (some cases of verb contractions, which may actually help the parser, and repetitions etc).  As long as the user is aware of these particular pitfalls, the checkers are useful tools, but not sufficient for reducing ambiguities.  Ambiguity appears not to be addressed, which is a serious drawback.



 

Ÿ        Controlled Language (CL) Checkers

Ÿ         A CL is a form of language with special restrictions on grammar, style, and vocabulary usage

Ÿ         The objective of a CL is to improve consistency, readability, translatability, and retrievability



Ÿ        KANT Controlled English

Ÿ         Kant Controlled English from Carnegie Mellon University was designed with MT in mind.  This controlled language aims at balancing the control of the vocabulary with the control of the grammar.  In this way, the writer is not forced to write very convoluted sentences in order to stay within the controlled vocabulary. 

Vocabulary constraints include the following:

Ÿ         Limit the meaning per word/part-of-speech to a single meaning.

Ÿ         Encode synonyms in the lexicon in order to flag deviations from the single, approved term.

Ÿ         State all ambiguous terms separately in the lexicon in order to support interactive disambiguation.

Ÿ         The use of determiners is encouraged, whereas the use of pronouns and conjunctions is limited.

Ÿ         The sense and use of modal verbs is clearly specified.

Ÿ         The use of -ing verbs and -ed verbs is restricted.

Ÿ         Abbreviations

Ÿ         Orthography

                Phrase-level constraints include the following:

Ÿ         Avoid verbs with particles; use single-word verbs instead

Ÿ         Do not coordinate verb phrases

Ÿ         Repeat the preposition in coordinated prepositional phrases

Sentence-level constraints include the following:

Ÿ         Parallelism in coordination

Ÿ         Write relative pronouns explicitly

Ÿ         Avoid ellipsis

 

Ÿ         All these checks enhance MTranslatability, which is not surprising since they were designed for the express purpose of improving MTranslatability.

 

Ÿ         The KANT technology is part of the ClearCheck checker used by Caterpillar for their controlled language system.

 

References: Mitamura and Nyberg 1995; Nyberg and Mitamura 1996; Mitamura 1999; Hayes et al 1996.

http://www.lti.cs.cmu.edu/Research/Kant/

Contact person: Teruko@cs.cmu.edu




Ÿ        MAXit Checker 

Ÿ         The MAXit AECMA Simplified English checker offers the following checks:

Ÿ        Useful for MTranslability

Ÿ         Abbreviation

Ÿ         Adjective that does not modify a noun

Ÿ         Adverb that does not modify a verb

Ÿ         Subject-verb agreement and subject-pronoun agreement

Ÿ         Contraction or possessive

Ÿ         Awkward sentence

Ÿ         Capitalization

Ÿ         Change verb to noun

Ÿ         Change noun to verb

Ÿ         Missing, superfluous or misplaced comma

Ÿ         Superfluous word

Ÿ         Gerund

Ÿ         Missing or superfluous hyphen

Ÿ         Missing subject or object

Ÿ         Negation

Ÿ         Word not in Simplified English dictionary

Ÿ         Parallelism

Ÿ         Passive voice

Ÿ         Verb with particle

Ÿ         Non-allowed prefix or suffix

Ÿ         Wrong position of preposition

Ÿ         Wrong punctuation

Ÿ         Rephrasing required

Ÿ         Long sentence (> 21 words)

Ÿ         Spelling error

Ÿ         Missing article

Ÿ         Wrong use of terminology

Ÿ         “That” vs “which” vs “who”

Ÿ         Translation problem

Ÿ         Complex verb tense

Ÿ         Wrong word

Ÿ         Noun cluster (> 2 nouns in a row)

Ÿ         Wrong verb

Ÿ         Date format

Ÿ        Not useful for MTranslability

Ÿ         Wrong word for Simplified English

Ÿ         Vague measurement

Ÿ         Label

Ÿ         Number style

Ÿ         Safety warnings required

Ÿ         Gender-specific pronoun

 

AECMA Simplified English was designed to make the text unambiguous and also easier to read for non-native speakers of English.  It was not designed to enhance MT.  Therefore, it is not surprising that there are some AECMA-specific checks that do not improve MTranslatability.

 

References:  Http://www.smartny.com/top_maxit.htm

 

Boeing technology

Ÿ         The Boeing Simplified English Checker is the most complete and accurate checker of Simplified English requirements.  In addition to checking for SE compliance, the Boeing SE Checker also catches mistakes like lack of subject-verb agreement, repeated words, misspelled words, and punctuation problems.



Ÿ         The Boeing Technical English Checker is a modified version of the Boeing SE Checker that supports more general technical writing. 

Ÿ         The Boeing Plain English Checker checks for compliance with the U.S. Government’s Plain Language requirements. (http://www.plainlanguage.gov)

 

References:Wojcik and Hoard; Wojcik and Holmback 1996; Wojcik et al. 1998.

Http://www.boeing.com/assocproducts/sechecker/se.html




Ÿ        EasyEnglishAnalyzer (EEA)

Ÿ         IBM’s EEA tool is an authoring tool that points out ambiguity and complexity, thereby helping writers produce documents that are more MTranslatable.  EEA also does some standard grammar checking.  EEA is used by information developers in IBM.  Some checks that are not directly aimed at improving MTranslatability are included in order to accommodate corporate writing guidelines.

 

Ÿ        Useful for MTranslability

Ÿ         Ambiguous nonfinite verb phrase

Ÿ         Ambiguous conjunction 

Ÿ         Ambiguous scope in coordination 

Ÿ         Passive voice and ambiguous double passives

Ÿ         Long sentence

Ÿ         Long noun string

Ÿ         Nonparsed sentence 

Ÿ         Unknown or misspelled words

Ÿ         Punctuation (missing commas, hyphens, periods, question marks; comma splice; slash to mean "and/or"; plural with (s))

Ÿ         Wrong comparative or superlative form 

Ÿ         Lack of subject-verb agreement 

Ÿ         Nonparallel coordinated phrase

Ÿ         Double negative

Ÿ         Noun phrase with many prepositions

Ÿ         Potentially wrong subject for verb phrase

Ÿ         Potentially wrong modification

Ÿ         Pronoun problems: Pronoun case and lack of agreement for reflexives

Ÿ         Dangling preposition

Ÿ         Noncapitalization of first word in a sentence 

Ÿ         Duplicated word

Ÿ         Verb contractions (‘s, which is ambiguous between is, has, and possessive; ‘d, which is ambigous between had and would)

Ÿ         Missing "that"

Ÿ         Word not in controlled vocabulary 

Ÿ         Incomplete sentence

Ÿ        Not useful for MTranslability

Ÿ         Latin abbreviation 

Ÿ         First occurrence of abbreviation

Ÿ         Wrong indefinite article "a" or "an"

Ÿ         Verb contractions (‘m, n’t, ‘re, ‘ll, ‘ve)

Ÿ         Restricted word; prohibited word

 

 

Ÿ         EEA’s Clarity Index summarizes the problems that are encountered in a given document as a single number that indicates the clarity (or MTranslatibility) for the whole document.  The problems are weighted according to severity (impact), context, and document size.

 

Ÿ         EEA also includes ETerms, which collects multinouns and unknown words.  These are candidates for terminology to be added to the user lexicons.

References: Bernth 1997, Bernth 1998a, Bernth 1999.

Other Helpful Tools

A very different way to prepare a document for better MTranslatabilityis annotating (or tagging) it. This method is used for various purposes, such as markup for formatting purposes or for enriching the semantic and knowledge content of documents. It is also used for easier accessing and processing of information on the World Wide Web.6  Two workshops were held following the recent COLING conference in August of 2000 -- one on syntactic annotation and one on semantic annotation.  Both workshops included presentations and discussions on tools and techniques for linguistic annotation (http://www.coling.org/workshops.html).

 

Ÿ        Global Document Annotation (GDA)7

“The GDA initiative aims at having Internet authors annotate their electronic documents with a common standard tag set which allows machines to automatically recognize the semantic and pragmatic structures of the documents.”8  The GDA tags “are designed to aid machines understand documents”; not only for the purpose of translation.  The notorious sentence Time flies like an arrow could be annotated as follows:



<su> 

<np sem=time0> time</np>

<v sem=fly1>flies</v>

<adp>like<np>an arrow</np></adp> 

</su> 

 

where “XML elements such as <np>...</np>encode parse tree bracketing, and the property sem disambiguates polysemy of words.  ”The word senses here (time0 and fly1) are based on WordNet senses. The plan is that a growing population of GDA users will develop their own ontologies for all languages.  The way such an XML tagger improves MTranslatability -- assuming all MT engines are modified to recognize the tags -- is obvious:  Some of the hardest problems for the MT parser will be solved. Disambiguation on both the syntactic and the semantic levels will be resolved and proper nouns will be identified.

“The difficulties in MT (machine translation) are mostly due to various types of ambiguity, concerning polysemy of words, phrase/clause attachment, coordination, anaphoric reference, scope of logical/modal operators, and so on. Unknown words and phrases are another major source of difficulty.  Translation accuracy is expected to drastically improve if the input documents are marked up with appropriate tags which resolve such ambiguities or supply missing information.  Some GDA tagsets will be designed for this purpose  .An MT system which exploits such tags to generate very accurate translations could be developed very soon if you already have a translation dictionary. The GDA sense tag dictionary and your translation dictionary could be automatically aligned for the most part.”  (http://www.etl.go.jp/etl/nl/GDA/translation.html)

The inventory of GDA tags is very comprehensive. In addition to syntactic and semantic word disambiguation, it includes tags for scoping, tense and aspect, indicators of levels of politeness, and types of utterances. Consequently, it is enormous. Without an efficient and user-friendly interface,using the tags seems a daunting task. But doubtlessly, if the tags are used and MT engines can interpret them, the translation output will improve dramatically.An interactive editor for GDA has been developed.

References: Hasida 2000;  http://www.etl.go.jp/~hasida/talk/gda/IC-e/20000806saic.html
 

Ÿ        Linguistic Annotation Language (LAL)

Linguistic Annotation Language (or LAL) is an XML-compliant tag set for assisting natural language processing programs.  It consists of linguistic information tags such as tags that specify word/phrasal boundaries and dependencies, and task-dependent instruction tags such as tags that define the scope of translation for machine translation.

 

Linguistic information tags include both syntactic and semantic tags.  The syntactic tags include tags that identify sentence boundaries; tags that denote word information (including attributes such as base form, semantic type, unique word ID, part-of-speech, dependencies, and language-specific features such as number, gender, tense etc; and tags that denote phrase and clause boundaries.  Besides boundaries, dependencies can also be expressed by using ids and modifier attributes of the word tag.  The user-definable semantic tags include tags indicating proper names (of e.g. persons, places, organizations, and countries), acronyms (and other abbreviations), dates, times, numbers, and monetary units.       

Task-dependent instruction tags include a tag that indicates whether a piece of text should be translated or not and a tag that indicates whether a piece of text should be considered for summarization purposes. 

LAL tags are usually expressed by using XML namespaces.  Their XML namespace prefix is lal. 

LAL is recognized by two types of programs:  NLP systems for generating and using the LAL annotation, and an annotation editor as explained below.

 

NLP integration: (1) The English Slot Grammar (McCord 1980; McCord 1990) parser generates and accepts LAL annotation.

(2) A post-processing routine converts the output of the Japanese KNP parser (Kurohasi and Nagao 1994) into LAL format. 

The annotations produced in (1) and (2) are used as input to the annotation editors for English and Japanese. 

(3) IBM's English to German, French, Spanish, Italian, and Japanese translation engines can utilize LAL-annotated input. This means that ambiguities can be resolved by using the annotation editor to pre-edit the source text before translation into several languages. 

 

The annotation editor allows the user to edit the LAL annotation of a text. This editor is interfaced to the LAL-generating grammar, which provides annotation for each segment.  A human editor can then use the annotation editor's graphical user interface to check over the automatically-produced annotation and change it as necessary.  The user can do this without having to see the tags by working on the graphical representation of the tree; the changes are then reflected in the internal LAL annotation. 

LAL annotation is distinguished from previous tag-defining efforts by providing a comprehensive, yet simple list of annotation tags.  Keeping things simple is crucial for user acceptance.

 

Examples of LAL-annotation: 

 

She saw a man with a telescope. 

She <lal:w id="w1" lex="see" pos="v" sense="see1">saw</lal:w>a man<lal:w mod="w1">with </lal:w>a telescope.

 

This example shows that the seeing action is done with the telescope because "with" modifies the entity having id "w1", i.e. "see".

 

IBM 

<lal:acronym expan="International Business Machines">IBM</lal:acronym>

 

In this example "IBM" is marked as an acronym with expansion "International Business Machines". 

 

The cat chased a mouse. After it caught the mouse, it ate it. 

<lal:s>The <lal:w id="w1">cat</lal:w>chased<lal:w id="w2">a mouse</lal:w>.</lal:s><lal:s>After<lal:w ref="w1">it</lal:w>caught the<lal:w ref=w2>mouse</lal:w>,<lal:w ref="w1">it</lal:w> ate<lal:w ref="w2">it.</lal:w></lal:s>

           

Here the id tag is used in the context of pronoun resolution. "cat" is assigned id=w1, and "mouse" has id=w2. The human editor can mark the ref value for the pronouns appropriately.

 

References:  Watanabe et al. 2000;   http://www.trl.ibm.co.jp/projects/langtran/lal_e.htm

Developed by IBM Research Division   

Contact person: Hideo Watanabe, hiwat@jp.ibm.com



Ÿ        OTELO Text Handling Format, OTEXT 

OTEXT is a subsystem of the OTELO project, which is a collaborative effort between the European Union and a consortium of industrial partners whose aim is to design and develop a comprehensive automated translator«s environment. See http://www.otelo.lu/broctxt.htm. The project partners have developed a standard set of tags that help exchange documents across different MT and translation memory systems. There are a couple of MT-specific tags that mark strings that should not be translated by the MT system. The <pr> (protect) tag protects strings that are not part of the text flow.  These are typically parameter settings, internal control information etc.  The <l> (literal) tag protects strings that are part of the text flow such as a piece of code, an address.  In contrast to these two tags, the <sp/> (special character) tag specifies characters which have special meanings and that need to be preserved by the MT system.  Examples are soft returns, hard blanks etc.  Finally, the <tu> (text unit) tag is used to indicate segmentation.

Reference:  G. Thurmair 1998.



Ways to Measure Mtranslatability

Ÿ        Automatic readability scoring

Ÿ         Is often provided with standard grammar checkers (Microsoft Word2000, Lotus Word Pro 97, WordPerfect)

Ÿ         Is designed for human readability, not MTranslatability

Ÿ         Based on sentence length and word length          

Ÿ         Shorter words and shorter segments are considered easier to read.

Ÿ         But shorter words are often more ambiguous.

Ÿ         And very short segments (4 words or less) are very ambiguous in English due to the great ambiguity of part of speech in English.

 

We built a short test corpus of problematic sentences and edited them according to the recommendations in the section on How to Improve MTranslatability.We found that the corpus showed improved clarity and translatability after pre-editing, but at the same time it achieved a reduced readability score. One would assume-- and many writers claim it -- that readability and translatability are almost synonomous, or at least that one is a prerequisite of the other. It turns out that this is not the case, at least not with the automated readability scores incorporated with the common word processors.

Carol Shehadeh and Judith Strother (1994) report on a survey they undertook on “The Use of Computerized Readability Formulas: Bane or Blessing?”  They criticize existing readability scales for not taking into account such factors as organization, clarity, syntax and structure.

Ÿ        Automatic detection of lexical inadequacies

Most MT systems have a utility that enables the user to detect words or phrases that are not listed in the dictionary. But a word or phrase may be found in the dictionary and thus not appear on the list of “unfound” or “unknown” words even though it is not covered in the dictionary with the appropriate part of speech of its use. The other possible shortcoming is the word being in the dictionary but not in connection with the semantic sense and appropriate transfer that is required for the document to be translated. For instance, the word pig is in the lexicon as referring to an animal, with the German transfer Schwein. The document to be translated, however, deals with the domain of oil production, where a pig refers to a technical device and should be translated as Molch.  Or, the phrase OK will not be listed as “unfound”because the dictionary contains the adjective. The document, however, uses the word as a verb, which is not covered in the dictionary. Because of such deficiencies of a simple, context-free, dictionary look-up, some MT systems come with more context-sensitive listings where one can query the coverage for a particular domain or subject area, or where one can generate a list of all content words with their anticipated translation in context. Checking such a list is time-consuming, but rewarding, if one finds uncovered entries or transfers.



Ÿ        Automatic MTranslatability scoring

 

The Logos Translatability Index (TI)

In the early 1990s, researchers at Logos Corporation developed a utility prototype that automatically measures and scores the suitability of English and German documents for the Logos MT system.

 

Ÿ        Gross statistical properties of the document as a whole

This Translatability Index (TI) is based on gross statistical properties of a document rather than on parsing the sentences. This was suggested by the fact that there appeared to be a rough correlation between the quality of raw MT output and certain gross properties of the text, such as length of the sentences, degree of syntactic complexity, discourse characteristics, etc.  Although the TI score is derived on the basis of gross sentence properties, sentence-specific information cannot be provided with any degree of reliability because there’s no full-scale parsing. 

 

Ÿ        Scoring procedure

The program starts off with a score of 7 and then penalizes the sentences for negative properties. The decision as to the minimum score that a document must reach in order to be acceptable for gisting or post-editing purposes is subjective. There is no absolute, objective threshold.

 

Ÿ        Statistical data and results

“Negative” sentences properties are:  too long or too short; words not found in the MT dictionary; short parentheses; coordination; homographs; interrogatives; unmatched parentheses; embedded clauses; part of speech ambiguities; certain ambiguous words (such as -ing verbs,as, with, etc.), and so forth.

 

Ÿ        Operational use and benefits

Before translation, the user can have the document scored by the TI program.  It will return with a score and a recommendation such as This document is not suitable for MT or This document is conditionally suitable for MT. The TI would also suggest why a particular document is not or only conditionally suitable.  It would tell the user, for instance, 

The sentences on the whole are too long 

Sentence # x is far too long 

The document contains many words and compounds that are not in the dictionary. Run your document through the New-Word-Search utility and update your dictionary 

The document contains many difficult words such as ... 

 

The user can make changes in the document in order to decrease complexity and ambiguity and update new words and phrases. Thus, the TI can provide users with a measure that not only correlates with the quality of the MT output but can also help them modify their source document in such a way as to improve the MT output quality.

 

Reference: Gdaniec 1994.
 

Translation Confidence Index (TCI)

Ÿ         IBM’s Translation Confidence Index automatically provides an index of the MT system’s own confidence in its translation, for a given segment.  In other words, the TCI returns a translation quality value for each segment.  This value can be used to mark segments that need special attention during post-editing. The confidence value is calculated during the various stages of the MTranslation process. It is based on such factors as parse scores, text characteristics (ambiguity, difficult constructions), lexical coverage, and success of structural generation (transformations).  These factors can be set on or off in the TCI’s language-pair-specific user profile.  Whereas the TCI was designed to give an overall picture of the expected quality of the MT output by taking all aspects of the MTranslation process into account, the parts that deal with source analysis give a picture of the general MTranslatability.  Turning all non-source language-specific factors off in the user profile in effect gives an MTranslatability score, independent of the target language. With all aspects taken into consideration, the TCI score will give the translatability for a particular language pair for a specific MT system.

 

References: Bernth 1999; Bernth and McCord 2000.



Conclusion

Ÿ         Be careful when you create your documents:

Ÿ         Avoid ambiguity

Ÿ         Avoid bad style

Ÿ         Avoid incorrect grammar

Ÿ         Avoid incorrect spelling

Ÿ         Avoid incorrect punctuation

Ÿ         Avoid bad markup

 

Ÿ         If you expect your documents to be translated by an MT system, make sure that the MT dictionary is updated to cover adequate parts of speech and subject area senses for all your terminology.

 

Ÿ         And remember: What makes life easier for the human reader is not always useful in the context of MT!

                                                             

 

 

If you have control over the source text, there is a lot you can do to improve MTranslatability!

 

 

Discussion of a Special Interest Group on MTranslatability

This SIG is devoted to MTranslatability, i.e. translatability seen in the context of MT.Relevant topics include --but are not limited to -- document and text characteristics such as markup, spelling, grammar, style, and ambiguity, as well as tools to check for and improve these characteristics.  Among these tools are Controlled Language checkers, grammar and spell checkers; dictionary management, annotation, and translatability scoring tools.


If you are interested in joining this SIG please send email to the moderators, Arendse Bernth and Claudia Gdaniec, at arendse@us.ibm.com and cgdaniec@us.ibm.com.



Resources

Ÿ        Papers

Adriaens, Geert: Simplified English Grammar and Style Correction in an MT Framework: The LRE SEEC Project”, Proceedings of the 16th Conference on Translating and the Computer, 1994.

Allen, Jeff: “Different Types of Controlled Languages”, TC-Forum 1-99, 1999, http://www.tc-forum.org  (look under “Archive”).

Almquist, Ingrid and Anna SŒgvall Hein: “Defining Scania Swedish -- a Controlled Language for Truck Maintenance, Proceedings of The First International Workshop On Controlled Language Applications, Katholieke Universiteit Leuven, Belgium, 1996, pp.159--165.

 

Baker, K., A. Franz, P. Jordan, T. Mitamura, and E. Nyberg: “Coping with Ambiguity in a Large-Scale Machine Translation System”, Proceedings of COLING-94, 1994.

 

Barthe, Kathy, Claire Juaneda, Dominique Leseigneur, Jean-Claude Loquet, Claude Morin, Jean Escande, Annick Vayrette, “GIFAS Rationalized French: A Controlled Language for Aerospace Documentation in French.” Technical Communication, Second Quarter 1999: 220-229.

 

Becker, Markus, Andrew Bredenkamp, Berthold Crysmann, and Judith Klein, “Annotations of Error Types for German USENET news corpus”, Proceedings of the ATALA workshop on Treebanks, Paris,1999.

 

Bernth, Arendse: “EasyEnglish: A Tool for Improving Document Quality”, Proceedings of the Fifth Conference on Applied Natural Language Processing, 1997, Association for Computational Linguistics, pp. 159--165.

 

Bernth, Arendse: “EasyEnglish:  Preprocessing for MT”, Proceedings of the Second International Workshop on Controlled Language Applications, Pittsburgh, PA, Carnegie Mellon University, 1998a, pp. 30--41.

 

Bernth, Arendse: “EasyEnglish: Addressing Structural Ambiguity”, Proceedings of AMTA-98, Association for Machine Translation in the Americas, 1998b, pp. 164--173.

 

Bernth, Arendse: “A Confidence Index for Machine Translation”, Proceedings of Theoretical and Methodological Issues in Machine Translation, Chester, England, 1999a, 120--127.

 

Bernth, Arendse: “Tools for Improving E-G MT Quality”, Proceedings of Theoretical and Methodological Issues in Machine Translation, Workshop on Problems and Potential of English-to-German MT Systems, Chester, England, 1999b.

 

Bernth, Arendse: “Controlling Input and Output of MT for Greater User Acceptance, Proceedings 21st Conference on Translating and the Computer, London, England, 1999c.

 

Bernth, Arendse and Michael C. McCord: “Effect of Source Analysis on Translation Confidence”, Proceedings of AMTA-2000, Association for Machine Translation in the Americas, 2000.

 

Blicq, Ron: “A Standard for International English Spelling?”, PCS Newsletter, Vol. 44, no. 2, IEEE Professional Communication Society, pp. 21--22.

 

Blicq, Ron: “Do Technical Writers Need an International Standard for English-Language Spelling?”, TC-Forum 4-99, 1999 http://www.tc-forum.org (look under “Archive”).

 

Bredenkamp, Andrew, Berthold Crysmann, and Mirela Petrea: “Looking for Errors: A Declarative Formalism for Resource-Adaptive Language Checking”, Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece, 2000.

 

Gdaniec, Claudia: “The Logos Translatability Index”, Proceedings of AMTA-94, Association for Machine Translation in the Americas, 1994, pp.97--105.

 

Harkus, Susan: “Writing for Translation”, Proceedings of the Australasian Online Documentation Conference, Brisbane, Australia, http://www.multilingualwebmaster.com/library/writing-TR.html

 

Hasida, Koiti, “GDA: Semantically Annotated Documents as Intelligent Content”, Proceedings of the COLING 2000 post-conference workshop Semantic Annotation and Intelligent Content, August 2000.

 

Hayes, Phil, Steve Maxwell and Linda Schmandt:  “Controlled English Advantages for Translated and Original English Documents”, Proceedings of The First International Workshop On Controlled Language Applications, Katholieke Universiteit Leuven, Belgium, 1996, pp. 84--92.

 

Holmback, Heather, Lisbet Duncan and Philip Harrison:  “A Word sense Checking Application for Simplified English”, Proceedings of The Third International Workshop On Controlled Language Applications, Seattle,WA, 2000, pp. 120--133.

 

Isseroff, Ami: “Comment on Technical Writers Gain Control”, TC-Forum 3-99, http://www.tc-forum.org (look under “Archive”).

 

Janowski, Wladyslaw: “Controlled Language -- Risks and Side Effects”, TC-Forum 2-98, 1998, http://www.tc-forum.org (look under “Archive”).

 

Janssen, Gerd, Gerhard Mark and Bernd Dobbert: “Simplified German -- A Practical Approach to Documentation and Translation”, Proceedings of The First International Workshop On Controlled Language Applications, Katholieke Universiteit Leuven, Belgium, 1996, pp. 150--158.

 

Kamprath, Christine, E. Adolphson, T. Mitamura, and E. Nyberg:  “Controlled Language for Multilingual Document Production: Experience with Caterpillar Technical English”, Proceedings of the Second International Workshop on Controlled Language Applications, Pittsburgh, PA, Carnegie Mellon U.,1998.

 

Knops, Uus and B. Depoortere: “Controlled Language and Machine Translation”, Proceedings of the Second International Workshop on Controlled Language Applications, Pittsburgh, PA, Carnegie Mellon University, 1998, pp. 42--50.

 

Kohl, John R.:  “Improving Translatability and Readability with Syntactic Cues”, TechnicalCOMMUNICATION, May 1999, pp.149--166.

 

Korpela, Jukka: “Translation-friendly authoring, especially in HTML for the WWW”, 1998,  http://www.hut.fi/u/jkorpela/transl/master.html

 

Kurohasi, S. and M. Nagao: “A Syntactic Analysis Method of Long Japanese Sentences Based on the Detection of Conjunctive Structures”, Computational Linguistics, vol. 20, no. 4, 1994.

 

Kumhyr, David, Carla Merrill, and Karin Spalink: “Internationalization and Translatability”, Proceedings of AMTA-94,Association for Machine Translation in the Americas, 1994, pp.142--148.

 

Lucent Global Translations: “Translatability”, http://www.lucent.com/translations/translate.html

 

McCord, M. C.: “Slot Grammars”, Computational Linguistics, vol. 6,1980, pp.31--43.

 

McCord, M. C.: “Slot Grammar:A System for Simpler Construction of Practical Natural Language Grammars”, in R. Studer: Natural Language and Logic: International Scientific Symposium, Lecture Notes in Computer Science, Springer Verlag, Berlin, 1990, pp. 118--145.

 

Means, Linda and Kurt Godden: “The Controlled Automotive Service Language (CASL) Project”, Proceedings of The First International Workshop On Controlled Language Applications, Katholieke Universiteit Leuven, Belgium, 1996, pp.106--114.

 

Miles, Thomas H.: “Guide to Building Sensible Phrases with Noun Strings and Unit Modifiers”, Society for Technical Communication, 1994.

 

Mitamura, Teruko: “Controlled Language for Multilingual Machine Translation”, Proceedings of Machine Translation Summit VII, Singapore, 1999.

 

Mitamura, Teruko and Eric Nyberg:  “Controlled English for Knowledge-Based MT:  Experience with the KANT System”, Proceedings of the 6th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-95), 1995.

 

Mitamura, Teruko and Eric Nyberg: “Controlled Languages”, tutorial given at AMTA2000.

 

Muldoon, Donna: “A Writer’s View of Using a controlled Language”, TC-Forum 3-99, 1999,http://www.tc-forum.org (look under “Archive”).

 

Nasr, A., O. Rambow and R. Kittredge: “A Lingistic framework for controlled Language Systems”,Proceedings of the Second International Workshop on Controlled Language Applications, (CLAW-98),1998.

 

Nyberg, Eric and Teruko Mitamura: “Controlled Language and Knowledge-Based Machine Translation: Principles and Practice”, Proceedings of The First International Workshop On Controlled Language Applications, Katholieke Universiteit Leuven, Belgium, 1996, pp.74--83.

 

Reuther, Ursula: “Controlling language in an Industrial Application”, Proceedings of the Second International Workshop on Controlled Language Applications, Pittsburgh, PA, Carnegie Mellon University, 1998, pp. 174--184.

 

Schmidt-Wigger, Antje: “Grammar and Style Checking for German”, Proceedings of the Second International Workshop on Controlled Language Applications, (CLAW-98), 1998, pp. 76--85.

 

Shehadeh, Carol M. El. and Judith B. Strother:  “The Use of Computerized Readability Formulas: Bane or Blessing?”Proceedings of the Society for Technical Communication Annual Conference, 1994, pp. 225--227.

 

Thurmair, G.:  “OTELO Text Handling Format”, OTEXT, V32, LE1-2703-TX-SR. November 1998.

 

van der Eijck, Pim, Michiel de Koning and Gert van der Steen: “Controlled Language Correction and Translation”, Proceedings of The First International Workshop On Controlled Language Applications, Katholieke Universiteit Leuven, Belgium, 1996, pp.64--73.

 

Waldhör, Kl.:  “The Euramis Text Handling Format”, report 1997.

 

Waldhör, Kl.:  “OTEXT Revision Proposal”, report 1998.

 

Watanabe, H., Nagao, K., McCord, M.C., and Bernth, A.:  "Improving Natural Language Processing by Linguistic Document Annotation", Proc. of Coling 2000 Workshop on Semantic Annotation and Intelligent Content, pp. 20--27.

 

Wojcik, Richard H. and James E. Hoard: “Controlled Languages in Industry”,  Http://cslu.cse.ogi.edu/HLTsurvey/ch7node8.html

 

Wojcik, Richard H. and Heather Holmback: “Getting a Controlled Language off the Ground at Boeing, Proceedings of The First International Workshop On Controlled Language Applications, Katholieke Universiteit Leuven, Belgium, 1996, pp.22-31.

 

Wojcik, Richard, Heather Holmback and James Hoard:  “Boeing Technical English:  An Extension of AECMA beyond the Aircraft Maintenance Doman”, Proceedings of the Second International Workshop on Controlled Language Applications, Pittsburgh, PA, Carnegie Mellon University, 1998, pp. 114--123.

             
 

Ÿ        Periodicals

TC-Forum

Multilingual Computing & Technology

                          

Ÿ        Conferences

CLAW96 http://www.ccl.kuleuven.ac.be/CLAW/programme.html
CLAW98 http://www.lti.cs.cmu.edu/CLAW98/

CLAW2000 http://www.up.univ-mrs.fr/~veronis/claw2000

Ordering information for proceedings: http://www.controlled-language.org/

 

International Technical Communication Conference (see www.stc-va.org)

Translating and the Computer http://www.aslib.co.uk/conferences/tc22.html
 

Ÿ        Organizations

Society for Technical Communication: http://www.stc-va.org/

International Communication Association: http://www.icahdq.org/

Communication and Technical Division: http://www.spcomm.uiuc.edu:8000/ica/commtech/                                        

IEEE Professional Communication Society: http://www.ieeepcs.org/

Association for Machine Translation in the Americas (AMTA): http://www.isi.edu/natural-language/organizations/AMTA.html

Association of Teachers of Technical Writing: http://english.ttu.edu/attwtest/default.asp

Association for Information Management: http://www.aslib.co.uk
 

Ÿ        Web Sites

http://www.languagepartners.com/ has a number of good articles on MT and translatabilty


http://www.plainlanguage.gov/

http://www.nmsu.edu/techprof/attwrsrc/ControlledLangBib.html

 

http://ourworld.compuserve.com/homepages/mpp_europe/HOMEPAGE.HTM

 

http://www.smartny.com/top_maxit.htm offers the MAXit Checker for Controlled English or Simplified English

 

http://www.linglink.lu/hlt/projects/multidoc/AR-99/AR-99.asp describes the Multilingual Technical Documention project for European automotive industry.

 

http://www.lant.com/efficient.htm

 

Http://www.bridgeterm.com/newsite/home.asp

 

http://www.boeing.com/assocproducts/sechecker/se.html describes the Boeing products

 

http://clwww.essex.ac.uk/~doug/book/node53.html#SECTION00820000000000000000

Good chapter on ambiguity and MT from the book "Machine Translation: An Introductory Guide" by Doug Arnold, Lorna Balkan, Siety Meijer, R. Lee Humphreys, and Louisa Sadler.Full text available at:  http://clwww.essex.ac.uk/~doug/book/book.html

 

http://www.world-ready.com/r_intl.htm  Nancy Hoft Consulting: "Reading on the Web about International Issues"

 

http://www.aecma.org/sengprd.htm  A list of producers of software products that support Simplified English, and of organizations who market and undertake training in the use of Simplified English.

Appendix

Original sentence with translations, and modified sentence with improved translations.9



(a) If the user provided file is not found, an error message is issued. 

Wenn der Benutzer, der Datei geliefert wird, nicht gefunden wird, wird eine Fehlermeldung ausgegeben. 

Si no se encuentra el usuario proporcionado fichero, se publica un mensaje de error.

(b) If the user-provided file is not found, an error message is issued. 

Wenn die benutzergelieferte Datei nicht gefunden wird, wird eine Fehlermeldung ausgegeben. 

Si no se encuentra el fichero usuario proporcionado, se publica un mensaje de error.

 

(a) He bit-off more than he can chew. 

Er biß aus mehr als er kauen kann.

Él el trozo de más que él puede rumiar.

(b) He bit off more than he can chew. 

Er biß mehr ab als er kauen kann.

Él abarcó más que él puede masticar.

 

(a)This is a postage meter. 

Dies ist eine Frankiermaschine.

Esto es un contador del franqueo.

(b) External mail to be postage metered. 

Externe Post Porto sein gemessen. 

Correo externo ser franqueo medido. 

 

(a) The file information and data type is of utmost importance. 

Die Dateiangabe, und Datentyp ist von äußerster Wichtigkeit. 

Presente información y el tipo de datos está de importancia mayor. 

(b) The file information and the data type are of utmost importance. 

Die Dateiinformationen und der Datentyp sind von äußerster Wichtigkeit.

Información de fichero y tipo de datos están de importancia mayor.

 

(a) Woven of combed cotton, you will love our sweater's soft feel. 

Wenn Sie von gekämmter Baumwolle gewoben werde, werden Sie das weiche Gefühl unseres Pullovers lieben. 

Tejido de algodón peinado, le encantará la sensación suave de nuestro suéter.

(b) Woven of combed cotton, this sweater will delight you with its soft feel. 

Wenn dieser Pullover von gekämmter Baumwolle gewoben wird, wird er Sie mit seinem weichen Gefühl erfreuen. 

Tejido de algodón peinado, este suéter lo deleitará con su sensación suave.

(c) Our sweater is woven of combed cotton, and you'll love its soft feel. 

Unser Pullover wird von gekämmter Baumwolle gewoben und Sie werden sein weiches Gefühl lieben. 

Nuestro suéter es tejido de algodón peinado, y le encantará su sensación suave.

 

(a) Is John going to come? -- He was to, but he may not. 

Wird John kommen? Er war zu, aber er nicht kann.

(b) Is John going to come? -- He was going to, but he may not. 

Wird John kommen? Er ging zu, aber er nicht kann.

(c) Is John going to come? -- He was going to come, but he may not. 

Wird John kommen? -- Er war im Begriff zu kommen, aber er kann nicht.

 

(a) Meeting requirements. 

Sitzung Anforderungen.

Requisitos de Reunión.

(b) Meeting the requirements. 

Erfüllen der Bedingungen.

Conociendo los requisitos. 

 

(a) The system reads the file or result field definition. 

Das System liest die Datei oder Ergebnisfelddefinition.

El sistema lee el fichero o definición de campo de resultado. 

(b) The system reads the file or the result field definition. 

Das System liest die Datei oder die Ergebnisfelddefinition. 

El sistema lee el fichero o la definición de campo de resultado. 

 

(a) The application can use the window to establish a dialog with the user and format text responses. 

Die Anwendung kann das Window benutzen, um einen Dialog mit den Benutzer- und Formattextantworten herzustellen.

(b) The application can use the window in order to establish a dialog with the user and can format text responses. 

Die Anwendung kann das Window benutzen, um einen Dialog mit dem Benutzer herzustellen und kann Textantworten formatieren.

(c) The application can use the window in order to establish a dialog with the user and to format text responses. 

Die Anwendung kann das Window benutzen, um einen Dialog mit dem Benutzer herzustellen und Textantworten zu formatieren.

 

(a)The system immediately terminates the program if a hard error and exception occur. 

Das System beendet das Programm sofort, wenn ein schwerer Fehler und Ausnahme auftreten. 

El sistema termina el programa inmediatamente si un error duro o la excepción ocurre. 

(b) The system immediately terminates the program if both a hard error and an exception occur. 

Das System beendet das Programm sofort, wenn sowohl ein schwerer Fehler als auch eine Ausnahme auftreten. 

El sistema termina el programa inmediatamente si un error duro o una excepción ocurre. 

The uninterruptible power supply message queue system value 

 

Der Nachrichtenschlangensystemwert von unterbrechungsfreier Stromversorgung

Der uninterruptible Macht Versorgung Mitteilung Schlange System Wert

El valor de sistema de cola de mensaje de fuente de alimentación ininterrumpible

 

(a) The cotton shirts are made from comes from Arizona. 

Die Baumwollhemden werden gemacht von aus Arizona kommt. 

Se hacen las camisas de algodón de viene de Arizona.

(b) The cotton that shirts are made from comes from Arizona. 

Die Baumwolle, von der Hemden gemacht werden, kommt aus Arizona.

El algodón del que se hacen camisas viene de Arizona.

 

(a) In experiment 6 we were interested in the reading subjects spontaneously achieve for such a headline. 

In Versuch erreichen 6, die wir daran interessiert waren, daß der Versuchspersonen spontan vorliest, für solch eine Schlagzeile. 

En experimento 6 que nosotros estábamos interesados al que el lea a asuntos espontáneamente consiguen para tal titular. 

(b) In experiment 6 we were interested in the reading that subjects spontaneously achieve for such a headline. 

In Versuch 6 waren wir an dem Lesen interessiert, das Versuchspersonen spontan für solch eine Schlagzeile erreichen.

En experimento 6 nosotros estábamos interesados en la lectura que los asuntos consiguen espontáneamente para tal titular.

 

(a) The amount of adjacent space available in storage does not restrict the size of a library, or of any other object. 

Die Menge des angrenzenden Platzes vorhanden in der Speicherung schränkt nicht die Größe einer Bibliothek oder irgendeiner anderen Nachricht ein. 

(b) The amount of adjacent space that is available in storage does not restrict the size of a library, or of any other object. 

Die Menge des angrenzenden Platzes, der in der Speicherung vorhanden ist, schränkt nicht die Größe einer Bibliothek oder irgendeiner anderen Nachricht ein. 

 

(a) The horse raced past the barn fell. 

Mit dem ein Wettrennen an der Scheune vorbei gemachten Pferd fiel.

El caballo hecho correr por delante del granero cayó. 

(b) The horse that was raced past the barn fell. 

Das Pferd, mit dem ein Wettrennen an der Scheune vorbei gemacht wurde, fiel.

El caballo que se hizo correr por delante del granero cayó. 

 

(a) Make sure the power is turned off. 

Stellen Sie die Macht sicher, hat ausgemacht.

(b) Make sure that the power is turned off. 

Stellen Sie sicher, daß die Macht ausgemacht hat.

 

(a) You can develop an application using the TCP/IP sockets. 

Sie können eine Anwendung mit den TCP/IP Einfaßungen entwickeln.

(b) You can develop an application that uses the TCP/IP sockets. 

Sie können eine Anwendung entwickeln, die die TCP/IP Einfaßungen benutzt. 

(c) You can develop an application by using the TCP/IP sockets. 

Sie können eine Anwendung entwickeln, indem Sie die TCP/IP Einfaßungen verwenden. 

 

(a) DATAMAX continues processing statements after repairing the data set. 

DATAMAX setzt das Verarbeiten von Aussagen nach dem Reparieren des Daten-Satzes fort.

(b) DATAMAX continues the processing statements after it repairs the data set. 

DATAMAX setzt die Verarbeitung-Aussagen fort, nachdem es den Daten-Satz repariert hat.

 

(a) Receiving notices. 

Empfangsmitteilungen.

Avisos receptores.

(b) To receive notices. 

Um Mitteilungen zu erhalten.

Para recibir avisos.

 

(a) After inserting the diskette, the system will read the file. 

Nachdem das System die Diskette einführt, wird es die Datei lesen. 

(b) After you insert the diskette, the system will read the file. 

Nachdem Sie die Diskette einführen, wird das System die Datei lesen.

 

Visiting relatives can be a nuisance. 

Das Besuchen von Verwandten kann ein Ärgernis sein. 

Besuchsverwandte können eine Beeinträchtigung sein. 

Visitar familiares puede ser una molestia.

Los parientes que visitan pueden ser un fastidio. 

 

At XYZ Inc. we don't waste any time improving service for our customers! 

Bei XYZ Inc. vergeuden wir keine Zeit und verbessern Dienst für unsere Kunden!

Bei XYZ Inc. vergeuden wir keine Zeit damit, Dienst für unsere Kunden zu verbessern! 

 

(a) She ran up a bill. 

Sie lief eine Rechnung herauf.

Ella corrió hacia arriba una factura.

(b) She accumulated a bill. 

Sie akkumulierte eine Rechnung.

Ella acumuló una factura.

 

(a) Transfer file. 

Übertragungsdatei.

Fichero de transferencia.

(b) Transfer the file. 

Übertragen Sie die Datei.

Transfiera el fichero.

(c) The transfer file. 

Die Übertragungsdatei.

El fichero de transferencia.

 

(a) The size of a library, or of any other object, is not restricted by the amount of adjacent space available in storage.

Die Menge benachbarten Raumes verfügbar in Lagerung beschränkt die Größe einer Bibliothek nicht, oder von irgendeinem anderen Gegenstand.

(b) The amount of adjacent space that is available in storage does not restrict the size of a library, or of any other object. 

Die Menge von benachbartem Raum, der in Lagerung verfügbar ist, beschränkt die Größe einer Bibliothek nicht, oder von irgendeinem anderen Gegenstand.

 

(a) After you have set up your workstation, you can: 

- Log on to the network 

- Work locally 

Nachdem Sie Ihren Arbeitsplatzrechner aufgestellt haben, können Sie: 

- Melden Sie sich beim Netz an 

- Arbeiten Sie am Ort 

Después de que usted ha preparado su puesto de trabajo, usted puede: 

- El leño en a la red

- Trabaje localmente 

(b) After you have set up your workstation, you can do the following:

- You can log on to the network. 

- You can work locally. 

Nachdem Sie Ihren Arbeitsplatzrechner aufgestellt haben, können Sie folgendes machen: 

- Sie können sich beim Netz anmelden.

- Sie können am Ort arbeiten.

Después de que usted ha preparado su puesto de trabajo, usted puede hacer a lo siguiente:

- Usted puede anotar adelante a la red. 

- Usted puede trabajar localmente. 



[1]

The recommendations in this section are based partly on guidelines for clear writing and writing for translatability published in many places; for example, see:

Kohl, John R.: “Improving Translatability and Readability with Syntactic Cues”,  TechnicalCOMMUNICATION, May 1999, pp.149--166.

 

“Writing for Translation” (http://www.languagepartners.com//reference-center/wri4tr.htm)

 

Korpela, Jukka: “Translation-friendly authoring, especially in HTML for the WWW”,

1998, http://www.hut.fi/u/jkorpela/transl/master.html 

 

Harkus, Susan: “Writing for Translation”, Proceedings of the Australasian Online

Documentation Conference, Brisbane, Australia, 
http://www.multilingualwebmaster.com/library/writing-TR.html

 

Mitamura, Teruko: “Controlled Language for Multilingual Machine Translation”,

Proceedings of Machine Translation Summit VII, Singapore, 1999.

2

Segments are the textual units that are processed by an MT system. The ends of segments are normally signalled by terminating punctuation such as periods or certain tags; correct automatic segmentation of a text can sometimes be difficult. Segments are most often sentences, but may be phrases of any type.

 

3

 In English, passive constructions often have an important role in the flow of a text because a passive allows the speaker to vary the word order according to given and new information. Therefore, changing passive constructions to active tends to change the meaning and may not lead to the desired result.

 

4

FLAG stands for Flexible Language and Grammar Checker. The project was sponsored by the German Federal Ministry for Research.

 

5

Meines Wissens is genitive; meiner Meinung nach is dative, but it has the same form as a genitive, which makes the conflation understandable. 

 

6

One instance is the Resource Description (RDF) language developed by the World Wide Web Consortium, which provides metadata for webpages. “Metadata is machine understandable information for the web”. http://www.w3.org/Metadata

 

7

For more information on this project, see http://www.etl.go.jp/etl/nl/GDA

 

8

The general description can be found here: http://www.etl.go.jp/etl/nl/GDA/

The tag set is listed here:  http://www.etl.go.jp/etl/nl/GDA/tagset.html

 

9

The translations are from various MT systems. We have not identified them because an evaluation is not the purpose of this tutorial. Suffice it to say that some MT systems are significantly more robust with respect to some of the ambiguity issues than others and that, obviously, some translations are significantly better than others. Thus, forexample, one MT system translates both “Make sure (that) the power is turned off.” with ”Vergewissern Sie sich, daß der Strom ausgeschaltet ist.”  For our purpose, we chose translations not for the quality of their output but for the difference in translation before and after a rewrite.