MTranslatability

AMTA-2000 Tutorial

A. Bernth (arendse@us.ibm.com) and C. Gdaniec (cgdaniec@us.ibm.com)

IBM T. J. Watson Research Center

P.O. Box 218

Yorktown Heights, NY 10598

USA

This document contains the tutorial text on MTranslatability, which we presented at AMTA-2000 in Cuernavaca, Mexico.
 

Contents:

Introduction

Ways to Improve MTranslatability

Check the file characteristics 

Check the punctuation

Check the spelling

Update your personal dictionaries

Check the grammar

Reduce ambiguity

Check the style

Tools

                        Spell checkers

                Grammar and style checkers

                Controlled language checkers

                Other helpful tools

        Ways to measure MTranslatability

Conclusion

Special Interest Group on MTranslatability

Resources

                Papers
                Periodicals
                Conferences




HOW TO WRITE GOOD by Frank L. Visco

http://www.ou.edu/special/owp/goodies/writegood.html

My several years in the word game have learnt me several rules:

1. Avoid alliteration. Always. 

2. Prepositions are not words to end sentences with. 

3. Avoid cliches like the plague. (They're old hat.) 

4. Employ the vernacular. 

5. Eschew ampersands & abbreviations, etc. 

6. Parenthetical remarks (however relevant) are

unnecessary. 

7. It is wrong to ever split an infinitive. 

8. Contractions aren't necessary. 

9. Foreign words and phrases are not apropos. 

10. One should never generalize. 

11. Eliminate quotations. As Ralph Waldo Emerson once said:

"I hate quotations. Tell me what you know." 

12. Comparisons are as bad as cliches. 

13. Don't be redundant; don't more use words than

necessary; it's highly superfluous. 

14. Profanity sucks. 

15. Be more or less specific. 

16. Understatement is always best. 

17. Exaggeration is a billion times worse than

understatement. 

18. One-word sentences? Eliminate. 

19. Analogies in writing are like feathers on a snake. 

20. The passive voice is to be avoided. 

21. Go around the barn at high noon to avoid colloquialisms. 

22. Even if a mixed metaphor sings, it should be derailed. 

23. Who needs rhetorical questions?

Introduction

Current MT systems are often unable to produce high-quality output on arbitrary, unseen input. The output frequently does not meet user needs and requirements.



Ÿ         Why is MT output not better? 

Ÿ         MT systems are not good enough

Ÿ         Statistical MT systems tend to use more simplistic language models that do not allow for several layers of abstraction.  This can result in less adequate coverage of linguistic rules and linguistic generalizations.

Ÿ         Knowledge-based MT systems depend on large amounts of hand-coded data (lexical data and syntactic rules).  It is very time-consuming to gain enough linguistic coverage.

Ÿ         MT input is not good enough

Ÿ         Bad markup

Ÿ         Incorrect punctuation

Ÿ         Incorrect spelling

Ÿ         Incorrect grammar

Ÿ         Ambiguous constructions

Ÿ         Bad style

Ÿ         What aspects can the MT user control? 

Ÿ         MT input 

Ÿ         Lexical coverage

Ÿ         Ways to change input in order to increase MTranslatability and thus improve the MT output.

Ÿ         Is it possible to predict the output quality for given input automatically?



Ways to Improve MTranslatability[1]

Ÿ        Check the file characteristics

Ÿ         Proofread and correct any scanned documents
Ÿ         OCR software is not 100% reliable

Ÿ         Avoid bitmaps when possible; these are usually not translated by MT systems

Ÿ         Use mark-up tags in a conceptional way; use header tags for headers, etc.

Ÿ         Do not abuse tags to accomplish a purely physical effect (e.g. a header tag just to achieve a bigger font) or tags that accomplish formatting on their own (e.g. <br>). 

Ÿ         Use mark-up to accomplish the desired layout for tables etc, rather than “manual” indentation.

Ÿ         Specify the LANG attribute for HTML documents.  Mark any parts that are in a different language from that of the main document.

Ÿ         Write hypertext links and bold-faced (italicized etc) text such that they can be translated as a single entity.  This way the markup will look better for the translation.  Mark strings that should not be translated.

Ÿ         Use ISO 8859 (or Unicode characters) throughout. Else, use entities for characters that are not part of the ASCII character set.  For instance, in the SGML/HTML source code, your entity for ü [u-umlaut] should be:&uuml.

Ÿ         Make sure that words that are used as labels or names are properly identified.

E.g. The red button vs The “RED” button.You can use defined tags such as <q> RED </q>.

Ÿ        Check the punctuation

Ÿ         Punctuation that indicates a new segment2 is especially important.

Ÿ         Remember correct use of hyphens.

Do not write: If the user provided file is not found, an error message is issued.

Do write: If the user-provided file is not found, an error message is issued.

Do not write: He bit-off more than he can chew. 

Do write: He bit off more than he can chew.

Ÿ         Commas do make a difference.

Do not write: Since Jay always jogs a mile doesn't seem that far to him.

Do write: Since Jay always jogs, a mile doesn't seem that far to him.

Ÿ         Avoid using (s) to indicate plural. This construction may not translate well into other languages.

Ÿ         Avoid using “/”as in “and/or” and “user/system”.  It is ambiguous.

Ÿ        Check the spelling

Ÿ         If a word is misspelled, it will -- at best --produce a non-translation.  At worst it will mess up source analysis and produce a wrong grammatical structure.



Ÿ        Update your personal dictionaries

If a word is not in any of the dictionaries that the MT system uses, there is no way the MT system will know how to translate it.  Worse still, it will not know how to analyze the sentence that the word occurs in. It is also important to make sure that all the relevant parts of speech of the words are covered in the dictionary.

E.g. Postage meter: External mail to be postage-metered.

Ÿ         Special terminology

Ÿ         You may use certain words in a nonstandard sense, but make sure you update your dictionary.

Ÿ         Multi words

Ÿ         Many noun strings cannot be translated compositionally and have to be treated as a unit.  But beware: Not all MT systems can handle coordination of premodifiers in multi words.  E.g. Forward and backward compatibleside and back exits.    

                                 

Ÿ        Check the grammar

Current MT systems have to rely on syntax to a large extent; therefore, ungrammatical input is bound to produce wrong output!

Ÿ         Subject-verb agreement

Do not write: File information and data type is of utmost importance.

Do write: File information and data type are of utmost importance.

 

Ÿ         Wrong modification

Do not write: Woven of combed cotton, you will love our sweater's soft feel.

Do write: Woven of combed cotton, this sweater will delight you with its soft feel.

Do write: Our sweater is woven of combed cotton, and you'll love its soft feel.

 

Ÿ        Reduce ambiguity

Adhering to the following recommendations is useful to varying degrees depending on the MT system that is being used. Some systems are more robust vis-a-vis certain structural ambiguities. 



Ÿ        Use syntactic cues (avoid use of the telegraphic style):

Ÿ         Use articles whenever possible

Do not write: Meeting requirements.

Do write: Meeting the requirements.

 

Ÿ         In coordinated phrases:

Repeat articles

Do not write: The system reads the file or result field definition.

Do write: The system reads the file or the result field definition.

Repeat any modal/auxiliary verb

Do not write: The application can use the window to establish a dialog with the user and format text responses.

Do write: The application can use the window in order to establish a dialog with the user and can format text responses.

Repeat “to” before infinitives         

Do not write: The application can use the window to establish a dialog with the user and format text responses.

Do write: The application can use the window in order to establish a dialog with the user and to format text responses.                                          

Repeat the preposition before any prepositional objects

Do not write: The coordinates that are displayed correspond to the top of the slider in the vertical slide bar, and the top edge of the slider in the horizontal slide bar.

Do write: The coordinates that are displayed correspond to the top of the slider in the vertical slide bar, and to the top edge of the slider in the horizontal slide bar.

Use “either”-”or” instead of “or” alone

Do not write: The system immediately terminates the program if a hard error or exception occurs.

Do write: The system immediately terminates the program if either a hard error or an exception occurs.

Use “both”-”and” instead of “and” alone

Do not write: The system immediately terminates the program if it detects a hard error and exception.

Do write: The system immediately terminates the program if it detects both a hard error and an exception.

 

Ÿ       Avoid long noun phrases, if possible

Do not write: The uninterruptible power supply message queue system value allows you to specify where you want your messages sent when the power to the system is interrupted.




Ÿ        Do not omit relative pronouns; write “that” (“which”, “who” etc) explicitly

Do not write: The cotton shirts are made from comes from Arizona.

Do write: The cotton that shirts are made from comes from Arizona.

Do not write: In experiment 6 we were interested in the reading subjects spontaneously achieve for such a headline.

Do write: In experiment 6 we were interested in the reading that subjects spontaneously achieve for such a headline.

 

Do not write: After a process creates a resource, any process it starts inherits the resource identifiers.

Do write: After a process creates a resource, any process that it starts inherits the resource identifiers.


Ÿ        Expand postnominal modifiers into full relative clauses

Do not write: The amount of adjacent space available in storage does not restrict the size of a library, or of any other object.

Do write: The amount of adjacent space that isavailable in storage does not restrict the size of a library, or of any other object.

 

Do not write: Programs currently running in the system are indicated by icons in the lower part of the screen.

Do write: Programs that are currently running in the system are indicated by icons in the lower part of the screen. 

Do write: Icons in the lower part of the screen indicate programs that are currently running in the system.

 

Do not write: The horse raced past the barn fell.

Do write: The horse that was raced past the barn fell.

Ÿ        Always write the complementizer “that” explicitly

Do not write: Make sure the power is turned off.

Do write: Make sure that the power is turned off.


Ÿ        Always write in order to before an infinitive in a purpose clause instead of just to

Do not write: Use this function to copy project data to a new or existing project.

Do write: Use this function in order to copy project data to a new or existing project.


Ÿ        Avoid -ing-forms

Ÿ         Rewrite -ing verbs that post-modify a noun as a relative clause or add a suitable preposition, depending on what you mean

Do not write: You can develop an application using the TCP/IP sockets.

Do write: You can develop an application that uses the TCP/IP sockets.

Do write: You can develop an application by using the TCP/IP sockets.

Ÿ         Rewrite -ing verbs pre-modifying a noun to include an article

Do not write: DATAMAX continues processing statements after repairing the data set.

Do write: DATAMAX continues the processing statements after it repairs the data set.

If that is what you meant...

Ÿ         Rewrite -ing verbs that are complements of other verbs

Do not write: The motor starts using a gas-powered pull start or pushbutton ignition via a rechargeable battery. 

Do write: You use a gas-powered pull start or pushbutton ignition via a rechargeable battery in order to start the motor.

Ÿ         Rewrite -ing verbs that can take an infinitive complement as “to” + infinitive

Do not write: Receiving notices.

Do write: To receive notices.

Ÿ         Make sure the implicit subject of an -ing verb that occurs in a subordinate clause starting with a subordinate conjunction (“after”, “when”, “while” etc.) has the same subject as in the superordinate clause

Do not write: After inserting the diskette, the system will read the file.

Do write: After you insert the diskette, the system will read the file.

Ÿ         Beware.  Kohl(1999) claims that it is not necessary to worry about the following cases:

a.       ing-verbs that are preceded by a preposition.  A slight variation of his example is For more information about printing files, see Chapter 3.  However, in the context of MT, this is ambiguous between the reading where files is the object of print, and the reading where printing pre-modifies files

b.      ing-verbs that are the subject of a clause of a sentence  .His example is Specifying the system password gives you full administrative access.  He goes on to say:“When it’s the first word of a simple sentence, an -ING can only be a gerund.”  This is not generally true. The reason this example is not ambiguous is that there is a determiner (the) between the ing-verb and the following noun.

Humans often disambiguate by applying real-world knowledge, but even then there may be problems as evidenced by the notorious example Visiting relatives can be a nuisance.

Or how about this real, but truly ambiguous sentence: At XYZ Inc. we don't waste any time improving service for our customers!

Ÿ        Minimize use of pronouns

Ÿ         In many languages the pronoun has to agree in number and gender with its antecedent.  Most MT systems do not support pronoun resolution, which is a rather difficult task.



The police refused the anarchists a permit because they feared violence. 

The police refused the anarchists a permit because they advocated violence. 

La police a refusé un permis aux anarchistes parce qu'elle craint des actes de violence. 

La police a refusé un permis aux anarchistes parce qu'ils prônent la violence.

This example shows that an MT system would have to be extremely smart to “know” the reference of the pronoun “they”.  At present, there are no NLP programs that can reliably identify the reference of pronouns.  Therefore, strictly controlled languages ban the use of 3rd person pronouns altogether. Unless you are willing to adhere to the rules of a CL, there is not much you can do about the pronoun reference issue if you want to write fluent text.  One way of avoiding pronouns is to repeat the noun phrase in a reduced form and write The spool file space on the disk should not get too large, and you should reduce the space to conform to specifications instead of The spool file space on the disk should not get too large, and you should reduce it to conform to specifications, if you think this is acceptable.

Ÿ        Use one-word verbs instead of verb+particle whenever possible

English verb particles represent a challenge to MT systems because of the ambiguity of particles and prepositions.  If there is a choice between two synonymous verbs, one with a particle and one without, do choose the latter.  E.g. She ran up a bill.vs.She ran up a hill.

Do not write:She ran up a bill.

Do write:She accumulated a bill.

Ÿ        Check the style

Ÿ        Avoid overly long sentences and very short sentences

Do not write: Transfer file.

Do write: Transfer the file.

Do write: The transfer file.

 

Do not write: At all levels of security, the system-supplied defaults in the user profile can be changed and authority can be specifically given or taken away from the users.

Do write: At all levels of security, the system-supplied defaults in the user profile can be changed. Authority can be specifically given to the users or taken away from the users.

Ÿ        Avoid metaphors, idioms, slang, dialect, irony

Ÿ         Do not write: He got my goat.

Ÿ         Do write: He annoyed me.

Ÿ        Avoid overly complex constructions

Do not write: Communication between programs, between jobs, between users, between users and programs and between users and the system occurs through messages.

Do write: Communication occurs through messages.  This is true for communciation between programs, between jobs, between users, as well as for communcation between users and programs, and between users and the system.


Ÿ        Avoid ellipsis

Do not write: Is she suing the hospital? -- She is the doctor.

Do write: Is she suing the hospital? -- She is suing the doctor.

Ÿ        Avoid passive constructions, if possible3

Do not write: The size of a library, or of any other object, is not restricted by the amount of adjacent space available in storage. 

Do write: The amount of adjacent space that is available in storage does not restrict the size of a library, or of any other object.

Ÿ       Make sure each segment can stand alone, e.g do not let individual list elements be part of the sentence leading in to the list 

Do not write: 

After you have set up your workstation, you can:

a.       Log on to the network 

b.      Work locally 

Do write:  After you have set up your workstation, you can log on to the network or work locally.

Do write: 

After you have set up your workstation, you can do the following:

a.       You can log on to the network 

b.      You can work locally 

Avoid footnotes in the middle of a sentence, and make footnotes independent segments

Ÿ        What makes life easier for the human reader is not always useful in the context of MT:

Ÿ         Exact repetitions make it more fruitful to use translation memory

Ÿ         Short words are often more abstract and polysemous, and hence prone to bad translation





Tools 

Ÿ        Spell Checkers

Ÿ         The objective of spell checkers is to point out misspelled words and, where possible, suggest the correct spelling.

Ÿ         Most spell checkers work with a dictionary.  If a word is not found in the dictionary (including user-defined dictionaries), it will be flagged as a misspelling, and alternatives given.

Ÿ         Spell checkers do not generally discover words that happen to be valid words, but incorrect in context.

Do not write: There very happy.

Do write: They’re very happy.
 

THE SPELLING CHEQUER 

(or Poet Tree Without Mist Aches)

I have a spelling chequer 

It came with my pea sea 

It plainly marques four my revue 

Miss steaks eye cannot sea 

When eye strike a quay, right a word 

I weight four it two say 

Weather eye am wrong or wright 

It shows me strait away 

As soon as a mist ache is maid 

It nose bee fore two late 

And eye can put the error rite 

Its rarely, rarely grate 

I've run this poem threw it 

I'm shore your pleased to no 

It's letter perfect in it's weigh 

My chequer tolled me sew. 

                                --Sauce unknown

 

 

        WHY SPELL CHECK DOES NOT WORK--A LINGUISTIC ODYSSEY 

            Thanks to M. Zarnosky: bruin@vt.edu Thu Feb 23 08:08:05 1995 



        From IEEE Transactions on Aerospace and Electronic Systems, Vol. 26, No. 2,

        March, 1990 -- p. 209, author name n.a. -- 

Catching Misspilled Words with Spilling Checker

As an extra addled service, I am going to put this column in the Spilling Checker, where I tryst it will sale through with flying colons.In this modern ear, it is simply inexplicable to ask readers to expose themselves to misspelled swords when they have bitter things to do. And with all the other timesaving features on my new work processor, it is in realty very easy to pit together a colon like this one and get it tight. For instants, if there is a work that is wrong, I just put the curse on it, press Delete and its    Well sometimes it deletes to the end of the lion or worst yet the whole rage.  Four bigger problems, there is the Cat and Paste option.  If there is some test that is somewhere were you wish it where somewhere else you jest put the curse at both ends and wash it dissapear.  Where you want it to reappear simply bring four quarts of water to a rotting boil and throw in 112 pounds of dazed chicken.  Sometimes it brings in the Cat that was Pasted yesterday.  But usually it comes out as you planned, or better.  And if it doesn't, there are lots of other easy to lose options... 

Grammar and Style Checkers

Ÿ         The objective of grammar checkers is to point out ungrammatical constructions.

Ÿ         Grammatical input to MT stands a better chance of getting a good translation; however, it is not sufficient to guarantee a correct translation.

Ÿ         Grammar checking is a very difficult process because the program basically has to try to make (grammatical) sense of (grammatical) nonsense.  Consequently, the precision of grammar checkers is notoriously low.

Ÿ         Grammar checkers show a tendency to lump together different kinds of problems.  Some of these problems are more relevant for MTranslatability than others; consequently, some checks fall into more than one usefulness category, depending on which aspect you are looking at.




Ÿ        Microsoft Word2000 checks for the following problems:

Ÿ        Useful for MTranslatability

Ÿ         Capitalization of first word in a sentence

Ÿ         Hyphenated and compound words

Ÿ         Words in split infinitives ( > 1)

Ÿ         Passive sentences

Ÿ         Commonly confused words ( its/it’s, their/there/they’re)

Ÿ         Punctuation

Ÿ         Relative clauses (who, which, that)

Ÿ         Sentence structure (e.g. bad participial modification:  Having run the marathon, it was time to rest.)

Ÿ         Subject-verb agreement

Ÿ         Successive nouns ( > 3)

Ÿ         Successive prepositional phrases ( > 3)

Ÿ         Verb and noun phrases 

Ÿ         Cliches (these tend to be idiomatic)

Ÿ         Colloquialisms

Ÿ         Jargon

Ÿ         Unclear phrasing (various cases of ambiguous scope)

Ÿ         Double negation

Ÿ         Sentence length ( > 60 words) (this maximum is very high, but it’s better than nothing)

Ÿ         Wordiness (to the extent it reduces sentence length)

Ÿ         Verb contractions (‘s, which is ambiguous between is, has, and possessive; ‘d, which is ambigous between had and would)

Ÿ         Possessives and plurals(houses vs. house’s)

Ÿ         Misused words (includes various grammatical mistakes for adjectives and adverbs; wrong case)

Ÿ        Not useful for MTranslatability

Ÿ         Gender-specific words

Ÿ         Sentences beginning with AndButHopefully, and Plus

Ÿ         Use of first person

Ÿ         Numbers (use of digits instead of spelled-out numbers)
 

Ÿ        Slightly harmful for MTranslatability

Ÿ         Verb contractions (‘m, n’t, ‘re, ‘ll, ‘ve; these help parsing)

Ÿ         Sentence structure(e.g. repetition of conjunctions:She ate a hot dog and a coke and an ice cream cone.)

Ÿ         Wordiness (to the extent it prevents disambiguation)

Ÿ        CorrecText Grammar Correction System (Word Pro 97)

Ÿ        Useful for MTranslatability

Ÿ         Verb agreement with there/here

Ÿ         Capitalization errors

Ÿ         Compounding errors (missing or superfluous hyphen.)

Ÿ         Doubled words (the the)

Ÿ         Open vs closed spelling (spelling errors that result from incorrect use of spaces. never the less instead of nevertheless.)

Ÿ         Clause errors (punctuation; incomplete sentences)

Ÿ         Double negations

Ÿ         Formatting errors 

Ÿ         format of numbers (placement of periods and commas; endings of ordinal numbers; spelling of fractions and other numbers)

Ÿ         dates (use of cardinal and ordinal numbers)

Ÿ         times (use of abbreviations and punctuation marks) 

Ÿ         currency and other symbols

Ÿ         addresses

Ÿ         Inappropriate prepositions (adhere to instead of adhere by; center on instead of center around.)

Ÿ         Mass/count noun agreementwith adjectives (less vs fewer)

Ÿ         Misused words (confused words: sit vs. set)

Ÿ         Nonstandard modification (adjectives instead of adverbs; hyphenation).

Ÿ         Noun phrase consistency errors (errors of number agreement between determiners and nouns). 

Ÿ         Pronoun errors (errors in case and ordering; which instead of that in restrictive clauses.)

Ÿ         Punctuation errors

Ÿ         Subject-verb agreement errors

Ÿ         Non-standard English (seeing as how instead of since)

Ÿ         Verb group consistency errors (errors in the use of the present, the past, and the past participle, as well as errors in the choice of auxiliary verbs.)

Ÿ         Word order errors (incorrect ordering of certain words that modify nouns; my both instead of both my).

Ÿ         Commonly confused words (commonly confused words of different parts of speech that have similar though not identical pronunciations; advice vs advise.) and homonyms.

Ÿ         Clichés

Ÿ         Verb contractions (‘s, which is ambiguous betweenis, has, and possessive; ‘d, which is ambigous between had and would)

Ÿ         Informal expressions

Ÿ         Jargon

Ÿ         Passive voice usage

Ÿ         Overused phrases (blissful ignorance instead of ignorance), stock phrases (fillers like in fact), and wordy expressions (vague or wordy expressions; in all probability instead of probably).

Ÿ         Redundant expressions (sufficient enough instead of sufficient or enough).

Ÿ         Weak modifiers (overused or colloquial modifiers; funny, pretty well, or nice).

Ÿ         Many consequtive prepositional phrases (limit is user-definable)

Ÿ         Many consequtive nouns (limit is user-definable)

Ÿ         Split infinitives (limit is user-definable)

Ÿ         Misspelled foreign expressions

Ÿ         Nonstandard terms

Ÿ         Archaic expressions

Ÿ         ‘A’ vs ‘An’

Ÿ        Not useful for MTranslatability

Ÿ         Gender-specific expressions

Ÿ         Sexist expressions

Ÿ         Vague, wordy, or informal quantifiers

Ÿ         Unnecessary prepositions.

This check seems incorrect, judging from the help text, which is as follows:

These rules flag expressions that include an unnecessary preposition and suggest deleting it to make the expression more concise. Example: in the sentence 'I sat down on the lawn,' the preposition 'down' is superfluous since it is implied by the word 'sat.' 

In our view, the sentence without the particle has a different meaning.
 

Ÿ        Slightly harmful for MTranslatability

Ÿ         Clause errors (repetition of conjunctions: We chopped up fruit, and we diced the potatoes, and we made a pie crust)

Ÿ         Verb contractions (‘mn’t‘re‘ll,‘ve)

Ÿ         Pretentious words (unnecessarily complex words; eventuate instead of take place).

Ÿ         Identical sentence openers




Ÿ        Grammatik (Corel WordPerfect, version 7)

Ÿ        Useful for MTranslatability

Ÿ         Abbreviation

Ÿ         Confused adjective or adverb 

Ÿ         Archaic

Ÿ         ‘A’ vs ’An’

Ÿ         Capitalization

Ÿ         Cliche (idiomatic)

Ÿ         Colloquial (idiomatic)

Ÿ         Commonly confused words and similar words (from vs form)

Ÿ         Wrong comparative or superlative

Ÿ         Conditional Clause (incorrect verb forms)

Ÿ         Conjunctions (neither-nor; between X and Y; parallelism)

Ÿ         Consequtive elements (number of nouns or prepositions in a row; user-definable)

Ÿ         Date and time format

Ÿ         Double negation

Ÿ         Doubled word or negation

Ÿ         End-of-sentence preposition

Ÿ         End-of-sentence punctuation

Ÿ         Foreign expressions

Ÿ         Formalisms 

Ÿ         Dangling modifiers (subjectless -ing-verb) 

Ÿ         disinterested vs. uninterested

Ÿ         Wrong use of hopefully (the value of this is questionable)

Ÿ         Latin singulars and plurals (singular of strata is stratum)

Ÿ         who vs. whom

Ÿ         Hyphenation

Ÿ         Idiomatic usage

Ÿ         Incomplete sentence, including stand-alone subordinate clauses

Ÿ         Other incorrect verb forms, including infinitive used incorrectly instead of -ing-verb and tense shifts

Ÿ         Jargon

Ÿ         Long sentence 

Ÿ         Mid-sentence adverb (position before auxiliary verb) 

Ÿ         Noun phrases (missing article before singular, countable noun; number disagreement; scrambled word order)

Ÿ         Object of verb (missing or superfluous objects; number disagreement with complement of linking verb; missing preposition for prepositional complement)

Ÿ         Overstated

Ÿ         Passive voice

Ÿ         Pronoun errors (errors in case and number agreement; which vs who)

Ÿ         Punctuation (missing commas; comma splice; apostrophe; colon; semicolon; question mark; quotation marks, unbalanced (), {}, [], “”)

Ÿ         Questionable usage

Ÿ         Redundancy

Ÿ         Spelling

Ÿ         Split infinitive

Ÿ         Subject-verb agreement

Ÿ         Trademarks (xerox vs photocopy)

Ÿ         Wordy

Ÿ        Not useful for MTranslatability

Ÿ         Conjunctions (plus vs also as sentence starter)

Ÿ         Formalisms (beginning a sentence with a conjunction)

Ÿ         Gender-specific

Ÿ         Number style

Ÿ         Offensive

Ÿ         One-sentence paragraphs

          

Ÿ        Harmful for MTranslatability

Ÿ         Sentence variety

Ÿ         Run-on sentence (many ands instead of separation by commas)

Ÿ         Second-person address ( you vs one).“One” is at least as ambiguous as “you”.

Ÿ         Ellipsis spaces (between the dots).  Better not to use ellipsis at all.



Ÿ        MULTILINT

MULTILINT is a research and development project sponsored by the German Ministry of Economy. Project partners are the Institute for Applied Information Sciences in Saarbrücken and BMW AG. The tools apply to automotive repair manuals.



MULTILINT’s German grammar checker looks for:

Ÿ        Useful for MTranslatability

Ÿ         wrong punctuation

Ÿ         wrong case 

Ÿ         incorrect word separation 

Ÿ         lack of subject-predicate agreement, etc.

 

The rule set for the grammar checker covers 55 grammatical error classes.  According to a corpus study of German automotive technical documents, the overwhelming majority of grammatical errors in technical documentation consists of punctuation errors, followed by errors of capitalization, separating or combining words, agreement, and other syntactic errors.

The style checker should result in higher clarity and readability of a processed document. It gives the following recommendations:
 

Ÿ        Useful for MTranslatability

Ÿ         Sentence is too long, contains too many information units

(Es dient bei evtl. Reklamationen mit dem numerierten Arbeitsauftrag als Nachweis der im einzelnen durchgeführten Arbeiten und schützt den ausführenden Betrieb vor unberechtigen Werkstatt-, Gewährleistungs- oder sonstigen Regreßansprüchen.) 

Ÿ         Avoid complex attributes (Darüberhinaus wird ein externer, kabelloser, über eine Infrarotverbindung am DIS angeschlossener Drucker angeboten.)

Ÿ         No more than 14 words before the verb (Die beiden vom rechten Radhauskanal kommenden Kraftstoff-Stahlleitungen an den Schlauchanschlüssen zum Kraftstoff-Filter bzw. zur fahrzeugbodenseitigen Rücklaufleitung abziehen.)

Ÿ         Avoid ambiguous structures (Anlageflächen von Schaumresten reinigen.) 

Ÿ         Rephrase groups of prepositional phrases (Undichtheit am Kraftstoff-Entlüftungswellrohr von rechter Tankkammer zu Tankeinfüllstutzen infolge Knickbeschädigung anläßlich der Tankmontage.)

Ÿ         The subject should come before the verb in the main clause (Das Gras frißt die Kuh.)

Ÿ         Separate main clauses (Kaltstartprobleme, DDD-Kontrollampe leuchtet, Motor läuft im Notprogramm.)

Ÿ         Do not insert too many elements between the parts of the verb

(Dieser stellt sich beim Beschleunigen aus ca. 1500 U/min. insbesondere im zweiten Gang unter hoher Last bzw. Vollast als inhomogenes Beschleunigungsverhalten dar.) 

Ÿ         Use a conditional conjunction for conditional clauses (Wird Korrosion festgestellt, sind die betroffenenen Bauteile auszutauschen.)

Ÿ         Write complete sentences(Wärmetauscher undicht?)

 

References: Schmidt-Wigger 1998; Reuther 1998.

http://www.iai.uni-sb.de/en/multien.html

Contact person: Ursula Reuther ursel@iai.uni-sb.de
 

Ÿ        FLAG4 German grammar checker

Based on an annotated corpus of German e-mail messages, researchers found that out of 14,492 sentences, 6473 contained at least one error.  83% of the errors were purely orthographic; grammar errors made up 16%.  This finding motivated them to develop a “phenomenon-based approach to grammar checking” which scans a document for the occurrence of error candidates. One example that they mention is the ungrammatical construction Meines Wissens nach, which is a conflation of the formulaic expressions meines Wissens and meiner Meinung nach5

The researchers plan to develop rules for some 200 grammatical errors.

Once this grammar checker is finished, it should be useful for translatability check.  It is expressly restricted to certain grammatical errors, which is necessary but not sufficient for improved translatability.

References: Becker et al. 1999; Bredenkamp et al. 2000.

http://www.dfki.de

Grammar and Style Checkers: Conclusion

Grammar and style checkers demonstrate a limited usefulness in the preparation of a document for MTranslatability.  Most of the problems that they check for are very relevant for MTranslatability because they are directly related to spelling mistakes and ungrammatical constructions (as you would expect from a grammar checker).  However, a few of the recommendations are directly opposed to MTranslatability (some cases of verb contractions, which may actually help the parser, and repetitions etc).  As long as the user is aware of these particular pitfalls, the checkers are useful tools, but not sufficient for reducing ambiguities.  Ambiguity appears not to be addressed, which is a serious drawback.



 

Ÿ        Controlled Language (CL) Checkers

Ÿ         A CL is a form of language with special restrictions on grammar, style, and vocabulary usage

Ÿ         The objective of a CL is to improve consistency, readability, translatability, and retrievability



Ÿ        KANT Controlled English

Ÿ         Kant Controlled English from Carnegie Mellon University was designed with MT in mind.  This controlled language aims at balancing the control of the vocabulary with the control of the grammar.  In this way, the writer is not forced to write very convoluted sentences in order to stay within the controlled vocabulary. 

Vocabulary constraints include the following:

Ÿ         Limit the meaning per word/part-of-speech to a single meaning.

Ÿ         Encode synonyms in the lexicon in order to flag deviations from the single, approved term.

Ÿ         State all ambiguous terms separately in the lexicon in order to support interactive disambiguation.

Ÿ         The use of determiners is encouraged, whereas the use of pronouns and conjunctions is limited.

Ÿ         The sense and use of modal verbs is clearly specified.

Ÿ         The use of -ing verbs and -ed verbs is restricted.

Ÿ         Abbreviations

Ÿ         Orthography

                Phrase-level constraints include the following:

Ÿ         Avoid verbs with particles; use single-word verbs instead

Ÿ         Do not coordinate verb phrases

Ÿ         Repeat the preposition in coordinated prepositional phrases

Sentence-level constraints include the following:

Ÿ         Parallelism in coordination

Ÿ         Write relative pronouns explicitly

Ÿ         Avoid ellipsis

 

Ÿ         All these checks enhance MTranslatability, which is not surprising since they were designed for the express purpose of improving MTranslatability.

 

Ÿ         The KANT technology is part of the ClearCheck checker used by Caterpillar for their controlled language system.

 

References: Mitamura and Nyberg 1995; Nyberg and Mitamura 1996; Mitamura 1999; Hayes et al 1996.

http://www.lti.cs.cmu.edu/Research/Kant/

Contact person: Teruko@cs.cmu.edu




Ÿ        MAXit Checker 

Ÿ         The MAXit AECMA Simplified English checker offers the following checks:

Ÿ        Useful for MTranslability

Ÿ         Abbreviation

Ÿ         Adjective that does not modify a noun

Ÿ         Adverb that does not modify a verb

Ÿ         Subject-verb agreement and subject-pronoun agreement

Ÿ         Contraction or possessive

Ÿ         Awkward sentence

Ÿ         Capitalization

Ÿ         Change verb to noun

Ÿ         Change noun to verb

Ÿ         Missing, superfluous or misplaced comma

Ÿ         Superfluous word

Ÿ         Gerund

Ÿ         Missing or superfluous hyphen

Ÿ         Missing subject or object

Ÿ         Negation

Ÿ         Word not in Simplified English dictionary

Ÿ         Parallelism

Ÿ         Passive voice

Ÿ         Verb with particle

Ÿ         Non-allowed prefix or suffix

Ÿ         Wrong position of preposition

Ÿ         Wrong punctuation

Ÿ         Rephrasing required

Ÿ         Long sentence (> 21 words)

Ÿ         Spelling error

Ÿ         Missing article

Ÿ         Wrong use of terminology

Ÿ         “That” vs “which” vs “who”

Ÿ         Translation problem

Ÿ         Complex verb tense

Ÿ         Wrong word

Ÿ         Noun cluster (> 2 nouns in a row)

Ÿ         Wrong verb

Ÿ         Date format

Ÿ        Not useful for MTranslability

Ÿ         Wrong word for Simplified English

Ÿ         Vague measurement

Ÿ         Label

Ÿ         Number style

Ÿ         Safety warnings required

Ÿ         Gender-specific pronoun

 

AECMA Simplified English was designed to make the text unambiguous and also easier to read for non-native speakers of English.  It was not designed to enhance MT.  Therefore, it is not surprising that there are some AECMA-specific checks that do not improve MTranslatability.

 

References:  Http://www.smartny.com/top_maxit.htm

 

Boeing technology

Ÿ         The Boeing Simplified English Checker is the most complete and accurate checker of Simplified English requirements.  In addition to checking for SE compliance, the Boeing SE Checker also catches mistakes like lack of subject-verb agreement, repeated words, misspelled words, and punctuation problems.



Ÿ         The Boeing Technical English Checker is a modified version of the Boeing SE Checker that supports more general technical writing. 

Ÿ         The Boeing Plain English Checker checks for compliance with the U.S. Government’s Plain Language requirements. (http://www.plainlanguage.gov)

 

References:Wojcik and Hoard; Wojcik and Holmback 1996; Wojcik et al. 1998.

Http://www.boeing.com/assocproducts/sechecker/se.html




Ÿ        EasyEnglishAnalyzer (EEA)

Ÿ         IBM’s EEA tool is an authoring tool that points out ambiguity and complexity, thereby helping writers produce documents that are more MTranslatable.  EEA also does some standard grammar checking.  EEA is used by information developers in IBM.  Some checks that are not directly aimed at improving MTranslatability are included in order to accommodate corporate writing guidelines.

 

Ÿ        Useful for MTranslability

Ÿ         Ambiguous nonfinite verb phrase

Ÿ         Ambiguous conjunction 

Ÿ         Ambiguous scope in coordination 

Ÿ         Passive voice and ambiguous double passives

Ÿ         Long sentence

Ÿ         Long noun string

Ÿ         Nonparsed sentence 

Ÿ         Unknown or misspelled words

Ÿ         Punctuation (missing commas, hyphens, periods, question marks; comma splice; slash to mean "and/or"; plural with (s))

Ÿ         Wrong comparative or superlative form 

Ÿ         Lack of subject-verb agreement 

Ÿ         Nonparallel coordinated phrase

Ÿ         Double negative

Ÿ         Noun phrase with many prepositions

Ÿ         Potentially wrong subject for verb phrase

Ÿ         Potentially wrong modification

Ÿ         Pronoun problems: Pronoun case and lack of agreement for reflexives

Ÿ         Dangling preposition

Ÿ         Noncapitalization of first word in a sentence 

Ÿ         Duplicated word

Ÿ         Verb contractions (‘s, which is ambiguous between is, has, and possessive; ‘d, which is ambigous between had and would)

Ÿ         Missing "that"

Ÿ         Word not in controlled vocabulary 

Ÿ         Incomplete sentence

Ÿ        Not useful for MTranslability

Ÿ         Latin abbreviation 

Ÿ         First occurrence of abbreviation

Ÿ         Wrong indefinite article "a" or "an"

Ÿ         Verb contractions (‘m, n’t, ‘re, ‘ll, ‘ve)

Ÿ         Restricted word; prohibited word

 

 

Ÿ         EEA’s Clarity Index summarizes the problems that are encountered in a given document as a single number that indicates the clarity (or MTranslatibility) for the whole document.  The problems are weighted according to severity (impact), context, and document size.

 

Ÿ         EEA also includes ETerms, which collects multinouns and unknown words.  These are candidates for terminology to be added to the user lexicons.

References: Bernth 1997, Bernth 1998a, Bernth 1999.

Other Helpful Tools

A very different way to prepare a document for better MTranslatabilityis annotating (or tagging) it. This method is used for various purposes, such as markup for formatting purposes or for enriching the semantic and knowledge content of documents. It is also used for easier accessing and processing of information on the World Wide Web.6  Two workshops were held following the recent COLING conference in August of 2000 -- one on syntactic annotation and one on semantic annotation.  Both workshops included presentations and discussions on tools and techniques for linguistic annotation (http://www.coling.org/workshops.html).

 

Ÿ        Global Document Annotation (GDA)7

“The GDA initiative aims at having Internet authors annotate their electronic documents with a common standard tag set which allows machines to automatically recognize the semantic and pragmatic structures of the documents.”8  The GDA tags “are designed to aid machines understand documents”; not only for the purpose of translation.  The notorious sentence Time flies like an arrow could be annotated as follows:



<su> 

<np sem=time0> time</np>

<v sem=fly1>flies</v>

<adp>like<np>an arrow</np></adp> 

</su> 

 

where “XML elements such as <np>...</np>encode parse tree bracketing, and the property sem disambiguates polysemy of words.  ”The word senses here (time0 and fly1) are based on WordNet senses. The plan is that a growing population of GDA users will develop their own ontologies for all languages.  The way such an XML tagger improves MTranslatability -- assuming all MT engines are modified to recognize the tags -- is obvious:  Some of the hardest problems for the MT parser will be solved. Disambiguation on both the syntactic and the semantic levels will be resolved and proper nouns will be identified.

“The difficulties in MT (machine translation) are mostly due to various types of ambiguity, concerning polysemy of words, phrase/clause attachment, coordination, anaphoric reference, scope of logical/modal operators, and so on. Unknown words and phrases are another major source of difficulty.  Translation accuracy is expected to drastically improve if the input documents are marked up with appropriate tags which resolve such ambiguities or supply missing information.  Some GDA tagsets will be designed for this purpose  .An MT system which exploits such tags to generate very accurate translations could be developed very soon if you already have a translation dictionary. The GDA sense tag dictionary and your translation dictionary could be automatically aligned for the most part.”  (http://www.etl.go.jp/etl/nl/GDA/translation.html)

The inventory of GDA tags is very comprehensive. In addition to syntactic and semantic word disambiguation, it includes tags for scoping, tense and aspect, indicators of levels of politeness, and types of utterances. Consequently, it is enormous. Without an efficient and user-friendly interface,using the tags seems a daunting task. But doubtlessly, if the tags are used and MT engines can interpret them, the translation output will improve dramatically.An interactive editor for GDA has been developed.

References: Hasida 2000;  http://www.etl.go.jp/~hasida/talk/gda/IC-e/20000806saic.html
 

Ÿ        Linguistic Annotation Language (LAL)

Linguistic Annotation Language (or LAL) is an XML-compliant tag set for assisting natural language processing programs.  It consists of linguistic information tags such as tags that specify word/phrasal boundaries and dependencies, and task-dependent instruction tags such as tags that define the scope of translation for machine translation.

 

Linguistic information tags include both syntactic and semantic tags.  The syntactic tags include tags that identify sentence boundaries; tags that denote word information (including attributes such as base form, semantic type, unique word ID, part-of-speech, dependencies, and language-specific features such as number, gender, tense etc; and tags that denote phrase and clause boundaries.  Besides boundaries, dependencies can also be expressed by using ids and modifier attributes of the word tag.  The user-definable semantic tags include tags indicating proper names (of e.g. persons, places, organizations, and countries), acronyms (and other abbreviations), dates, times, numbers, and monetary units.       

Task-dependent instruction tags include a tag that indicates whether a piece of text should be translated or not and a tag that indicates whether a piece of text should be considered for summarization purposes. 

LAL tags are usually expressed by using XML namespaces.  Their XML namespace prefix is lal. 

LAL is recognized by two types of programs:  NLP systems for generating and using the LAL annotation, and an annotation editor as explained below.

 

NLP integration: (1) The English Slot Grammar (McCord 1980; McCord 1990) parser generates and accepts LAL annotation.

(2) A post-processing routine converts the output of the Japanese KNP parser (Kurohasi and Nagao 1994) into LAL format. 

The annotations produced in (1) and (2) are used as input to the annotation editors for English and Japanese. 

(3) IBM's English to German, French, Spanish, Italian, and Japanese translation engines can utilize LAL-annotated input. This means that ambiguities can be resolved by using the annotation editor to pre-edit the source text before translation into several languages. 

 

The annotation editor allows the user to edit the LAL annotation of a text. This editor is interfaced to the LAL-generating grammar, which provides annotation for each segment.  A human editor can then use the annotation editor's graphical user interface to check over the automatically-produced annotation and change it as necessary.  The user can do this without having to see the tags by working on the graphical representation of the tree; the changes are then reflected in the internal LAL annotation. 

LAL annotation is distinguished from previous tag-defining efforts by providing a comprehensive, yet simple list of annotation tags.  Keeping things simple is crucial for user acceptance.

 

Examples of LAL-annotation: 

 

She saw a man with a telescope. 

She <lal:w id="w1" lex="see" pos="v" sense="see1">saw</lal:w>a man<lal:w mod="w1">with </lal:w>a telescope.

 

This example shows that the seeing action is done with the telescope because "with" modifies the entity having id "w1", i.e. "see".

 

IBM 

<lal:acronym expan="International Business Machines">IBM</lal:acronym>

 

In this example "IBM" is marked as an acronym with expansion "International Business Machines". 

 

The cat chased a mouse. After it caught the mouse, it ate it. 

<lal:s>The <lal:w id="w1">cat</lal:w>chased<lal:w id="w2">a mouse</lal:w>.</lal:s><lal:s>After<lal:w ref="w1">it</lal:w>caught the<lal:w ref=w2>mouse</lal:w>,<lal:w ref="w1">it</lal:w> ate<lal:w ref="w2">it.</lal:w></lal:s>

           

Here the id tag is used in the context of pronoun resolution. "cat" is assigned id=w1, and "mouse" has id=w2. The human editor can mark the ref value for the pronouns appropriately.

 

References:  Watanabe et al. 2000;   http://www.trl.ibm.co.jp/projects/langtran/lal_e.htm

Developed by IBM Research Division   

Contact person: Hideo Watanabe, hiwat@jp.ibm.com



Ÿ        OTELO Text Handling Format, OTEXT 

OTEXT is a subsystem of the OTELO project, which is a collaborative effort between the European Union and a consortium of industrial partners whose aim is to design and develop a comprehensive automated translator«s environment. See http://www.otelo.lu/broctxt.htm. The project partners have developed a standard set of tags that help exchange documents across different MT and translation memory systems. There are a couple of MT-specific tags that mark strings that should not be translated by the MT system. The <pr> (protect) tag protects strings that are not part of the text flow.  These are typically parameter settings, internal control information etc.  The <l> (literal) tag protects strings that are part of the text flow such as a piece of code, an address.  In contrast to these two tags, the <sp/> (special character) tag specifies characters which have special meanings and that need to be preserved by the MT system.  Examples are soft returns, hard blanks etc.  Finally, the <tu> (text unit) tag is used to indicate segmentation.

Reference:  G. Thurmair 1998.



Ways to Measure Mtranslatability

Ÿ        Automatic readability scoring

Ÿ         Is often provided with standard grammar checkers (Microsoft Word2000, Lotus Word Pro 97, WordPerfect)

Ÿ         Is designed for human readability, not MTranslatability

Ÿ         Based on sentence length and word length          

Ÿ         Shorter words and shorter segments are considered easier to read.

Ÿ         But shorter words are often more ambiguous.

Ÿ         And very short segments (4 words or less) are very ambiguous in English due to the great ambiguity of part of speech in English.

 

We built a short test corpus of problematic sentences and edited them according to the recommendations in the section on How to Improve MTranslatability.We found that the corpus showed improved clarity and translatability after pre-editing, but at the same time it achieved a reduced readability score. One would assume-- and many writers claim it -- that readability and translatability are almost synonomous, or at least that one is a prerequisite of the other. It turns out that this is not the case, at least not with the automated readability scores incorporated with the common word processors.

Carol Shehadeh and Judith Strother (1994) report on a survey they undertook on “The Use of Computerized Readability Formulas: Bane or Blessing?”  They criticize existing readability scales for not taking into account such factors as organization, clarity, syntax and structure.

Ÿ        Automatic detection of lexical inadequacies

Most MT systems have a utility that enables the user to detect words or phrases that are not listed in the dictionary. But a word or phrase may be found in the dictionary and thus not appear on the list of “unfound” or “unknown” words even though it is not covered in the dictionary with the appropriate part of speech of its use. The other possible shortcoming is the word being in the dictionary but not in connection with the semantic sense and appropriate transfer that is required for the document to be translated. For instance, the word pig is in the lexicon as referring to an animal, with the German transfer Schwein. The document to be translated, however, deals with the domain of oil production, where a pig refers to a technical device and should be translated as Molch.  Or, the phrase OK will not be listed as “unfound”because the dictionary contains the adjective. The document, however, uses the word as a verb, which is not covered in the dictionary. Because of such deficiencies of a simple, context-free, dictionary look-up, some MT systems come with more context-sensitive listings where one can query the coverage for a particular domain or subject area, or where one can generate a list of all content words with their anticipated translation in context. Checking such a list is time-consuming, but rewarding, if one finds uncovered entries or transfers.



Ÿ        Automatic MTranslatability scoring

 

The Logos Translatability Index (TI)

In the early 1990s, researchers at Logos Corporation developed a utility prototype that automatically measures and scores the suitability of English and German documents for the Logos MT system.

 

Ÿ        Gross statistical properties of the document as a whole

This Translatability Index (TI) is based on gross statistical properties of a document rather than on parsing the sentences. This was suggested by the fact that there appeared to be a rough correlation between the quality of raw MT output and certain gross properties of the text, such as length of the sentences, degree of syntactic complexity, discourse characteristics, etc.  Although the TI score is derived on the basis of gross sentence properties, sentence-specific information cannot be provided with any degree of reliability because there’s no full-scale parsing. 

 

Ÿ        Scoring procedure

The program starts off with a score of 7 and then penalizes the sentences for negative properties. The decision as to the minimum score that a document must reach in order to be acceptable for gisting or post-editing purposes is subjective. There is no absolute, objective threshold.

 

Ÿ        Statistical data and results

“Negative” sentences properties are:  too long or too short; words not found in the MT dictionary; short parentheses; coordination; homographs; interrogatives; unmatched parentheses; embedded clauses; part of speech ambiguities; certain ambiguous words (such as -ing verbs,as, with, etc.), and so forth.

 

Ÿ        Operational use and benefits

Before translation, the user can have the document scored by the TI program.  It will return with a score and a recommendation such as This document is not suitable for MT or This document is conditionally suitable for MT. The TI would also suggest why a particular document is not or only conditionally suitable.  It would tell the user, for instance, 

The sentences on the whole are too long 

Sentence # x is far too long 

The document contains many words and compounds that are not in the dictionary. Run your document through the New-Word-Search utility and update your dictionary 

The document contains many difficult words such as ... 

 

The user can make changes in the document in order to decrease complexity and ambiguity and update new words and phrases. Thus, the TI can provide users with a measure that not only correlates with the quality of the MT output but can also help them modify their source document in such a way as to improve the MT output quality.

 

Reference: Gdaniec 1994.
 

Translation Confidence Index (TCI)

Ÿ         IBM’s Translation Confidence Index automatically provides an index of the MT system’s own confidence in its translation, for a given segment.  In other words, the TCI returns a translation quality value for each segment.  This value can be used to mark segments that need special attention during post-editing. The confidence value is calculated during the various stages of the MTranslation process. It is based on such factors as parse scores, text characteristics (ambiguity, difficult constructions), lexical coverage, and success of structural generation (transformations).  These factors can be set on or off in the TCI’s language-pair-specific user profile.  Whereas the TCI was designed to give an overall picture of the expected quality of the MT output by taking all aspects of the MTranslation process into account, the parts that deal with source analysis give a picture of the general MTranslatability.  Turning all non-source language-specific factors off in the user profile in effect gives an MTranslatability score, independent of the target language. With all aspects taken into consideration, the TCI score will give the translatability for a particular language pair for a specific MT system.

 

References: Bernth 1999; Bernth and McCord 2000.



Conclusion

Ÿ         Be careful when you create your documents:

Ÿ         Avoid ambiguity

Ÿ         Avoid bad style

Ÿ         Avoid incorrect grammar

Ÿ         Avoid incorrect spelling

Ÿ         Avoid incorrect punctuation

Ÿ         Avoid bad markup

 

Ÿ         If you expect your documents to be translated by an MT system, make sure that the MT dictionary is updated to cover adequate parts of speech and subject area senses for all your terminology.

 

Ÿ         And remember: What makes life easier for the human reader is not always useful in the context of MT!