No qualities are listed at present for this taxon.
According to White (2000) a feasibility study is an evaluation of the possibility that a particular approach has any potential for success after further research and implementation. Feasibility evaluations provide results of interest to researchers and to sponsors of research. The characteristics that a feasibility evaluation typically tests for are functionality attributes such as the coverage of sub-problems particular to a specific language pair and the possibility of extending to more general phenomena (changeability).
Coverage (2.2.1.1.2.1/504)
Accuracy (2.2.1.2/177)
Requirements discovery is often an iterative process in which developers create prototypes in order to elicit reactions from potential stakeholders. In so-called "rapid prototyping" approaches to requirements discovery, developers create prototypes designed to demonstrate specific aspects of functional capabilities that might ultimately be implemented. Scenario-based observational studies are often used to assess the utility of the functions demonstrated by the prototype.
Characteristics of the intended mode of use (2.1.4/160)
Utility (2.1.1.1.5/176)
Usability (2.2.3/603)
According to White 2000, internal evaluation occurs on continual or periodic bases in the course of research and development. Internal evaluations test whether, for example, the components of an experimental prototype or pre- release system work as they are intended.
This type of evaluation mainly concerns functionality and needs to show coverage of the fundamental contrastive phenomena of the language pair, just like feasibility evaluation. However, at this point in a system's life cycle, it must also be shown that the system is actually improving as a result of development (changeability), and that improvement in one area does not make something else worse (stability). (In terms of EAGLES 1996, this is a progress evaluation).
Translation process models (2.1.1/402)
Coverage (2.2.1.1.2.1/504)
Readability (2.2.1.1.1.1/172)
Terminology (2.2.1.2.3/175)
Accuracy (2.2.1.2/177)
Well-formedness (2.2.1.3/186)
Translation process models (2.1.1/402)
Coverage (2.2.1.1.2.1/504)
Readability (2.2.1.1.1.1/172)
Terminology (2.2.1.2.3/175)
Accuracy (2.2.1.2/177)
The property "glass/black box" does not spawn children, but is rather a property which distinguishes certain methods under more than one taxon.
According to White 2000, the purpose of declarative evaluation is to measure the ability of an MT system to handle texts representative of an actual end-user. It purports to measure the actual performance of a system external to the particulars of the feasibility of the approach or of the development process.
As with feasibility and internal evaluation, we look at coverage of linguistic phenomena and handling of samples of real text. However, these generally do not use constrained test patterns, and they are not directly used to determine the extensibility of the system, but how good it is right now. Declarative evaluations generally test for the functionality attributes of intelligibility, (how fluent or understandable it appears to be) and fidelity (the accurateness and completeness of the information conveyed).
Translation process models (2.1.1/402)
Linguistic resources and utilities (2.1.2/403)
Suitability (2.2.1.1/168)
Accuracy (2.2.1.2/177)
Well-formedness (2.2.1.3/186)
According to White 2000, operational evaluations generally address the question of whether an MT system will actually serve its purpose in the context of its operational use. The primary factors include the cost-benefit of bringing the system into the overall process (costs).
Linguistic resources and utilities (2.1.2/403)
Interoperability (2.2.1.4/192)
Reliability (2.2.2/600)
Maintainability (2.2.5/620)
Portability (2.2.6/622)
Cost (2.2.7/624)
A variety of issues are considered here, including such things as software and hardware compatibility with the incumbent office automation system (interoperabililty). However, the more fundamental question to ask for operational use is whether the MT system enhances the effectiveness of the down stream task, or whether the end-to-end process is better off without it.
As an example, consider cross lingual information retrieval. Evaluation of MT embedded into a cross lingual information processing environment takes into account the measures that are germane to the downstream task. So if we want to know whether an MT system helps information extraction we compare the recall and precision (metrics germane to extraction) of the MT plus extraction configuration to an expert translation plus extraction process, or to an extraction without any translation at all. Note that we do not measure functionality characteristics of the MT system itself, such as fidelity and intelligibility, but rather the effect of the MT (good or bad) on the downstream task in term of that task's metrics. To a large extent then, operational evaluation lies outside the bounds of this classification, which is concerned only with the classification and evaluation of MT systems.
Usability (2.2.3/603)
Efficiency (2.2.4/606)
It is important to determine what context of use is to be taken into consideration in the evaluation. Whilst it is impossible to give a detailed breakdown of all possible contexts of use, an example may help to clarify what is meant here.
Example. Imagine an MT system as one component in a system whose overall purpose is to retrieve and present to the user information on railway timetables, accepting voice input and producing voice output. The boundaries could be taken as:
(a) actual speakers receiving actual information; (b) the MT system receiving artificially constructed transcribed input on one side and the spoken output on the other; (c) the MT system receiving artificially transcribed input and producing as output the input to the speech synthesizer; (d) the MT system receiving artificially transcribed input and producing a query to the timetable query system; (e) the MT system receiving information from the timetable query system and generating from it the input to the speech synthesizer;
It can readily be seen that many other possibilities exist, especially if we widen the boundary even further to include different types of end users and different places where the system may be installed. This notion is closely related to the notion of set-up discussed in Sparck Jones.
As was noted by J.C.Sager for Machine Translation systems, "two types of use [are] to be considered: (a) the un-edited output; (b) the edited output. The output may be acceptable for either use or both and the evaluation should determine this. In the case of edited output the cost of revision, editing etc. has to be established and compared with the cost of manual translation. Since the type of use is related to the type of text, these types have to be established and taken into account."
In Toward Finely Differentiated Evaluation Metrics for Machine Translation, Hovy suggests dividing all the possible translation tasks into three main groups. He noted that "in order to make the taxonomization of features useful to people who do not already know much about MT and do not wish to become experts in evaluation, it is important to articulate its layers and choices in terms they can intuitively understand." This part of the present evaluation taxonomy describes three principal types of use in such a way that users can identify the particular type of work they want to have done, while developers can define in strict terms what their MT system can do.
The required translation quality is generally not high, though translation speed and wide coverage are important.
Input-to-output translation speed (2.2.4.1.2/611)
System external characteristics in general (2.2/166)
Fidelity (2.2.1.2.1/179) - important
Style (2.2.1.1.2.2/173) - not a very important factor
Wellformedness (2.2.1.3/186) - not a very important factor
From a general point of view, relevant qualities for this task are: (2.2.1/601) Functionality, (2.2.2/600) Reliability, (2.2.3/603) Usability, (2.2.4/606) Efficiency.
The most important detailed features for this type of work are:
Terminology precision (2.2.1.2.3/175) - how precisely does the system translate subject-matter terminology; can it differentiate between words that translate differently in slightly different domains.
Extensibility: changeability or ease of upgrading (2.2.5.2/213), in particular dictionary update (2.2.5.2.3/-) and improveability (2.2.5.2.2/215) - can the user dynamically add new words or phrases to the lexicon; how much effort is involved.
Adaptability (2.2.6.1/221) - can the system be tuned to recognize some types of material (based on domain or genre) and handle them differently (say, more quickly).
Fidelity (2.2.1.2.1/179)- is the translated output an accurate reflection of the input, are there even small distortions of meaning.
For summarization, coherence (2.2.1.1.1.3/182) and cohesion (2.2.1.1.1.4/503) provide useful cues.
Adaptability or customizability (2.2.6.1/221) - this customizability differs slightly from the customizability for document routing/sorting type of work: can the system be tuned to recognize some types of material (based on domain, content, or genre) and translate them with more care.
((2.2.1)) Functionality ((601))
((2.2.3)) Usability ((603))
((2.2.4)) Efficiency ((606))
End users with information needs
Professional searchers assisting end users with their search
Persons writing documents that they wish to make easily found
The translation quality required is generally high, but translation speed is usually not a factor. Often, the vocabulary and genre are somewhat limited because of the nature of the organization.
Wellformedness (2.2.1.3/186) - highly important
Accuracy (2.2.1.2/177) - highly important
Style (2.2.1.1.2.2/173) - important
Input-to-output translation speed (2.2.4.1.2/611) - not important
Adaptability - how easily can the system's output be changed in response to requests from the recipients (addition of new words, use of different phrases and expressions, etc.)
Stylistic quality (2.2.1.1.2.2/173) - how closely does the style of the translation match the style of the source text, how difficult is it to have the system produce translations in the organization's characteristic style.
Reliability (2.2.2/600) - how reliable is the system; if it breaks, how easily can the repair be tested, how does it behave on encountering unexpected or erroneous input, how does it behave when it crashes, how often does it happen, how difficult is it to restart.
Cross-document consistency (2.2.1.2.2/500) - how easily can stylistic and lexical policies be implemented for particular sets of documents.
Intelligibility / comprehensibility (2.2.1.1.1.2/180) - how intelligible is the output under different conditions (e.g, are the sentence fragments translated while being entered)
Dialogue - does the system support rudimentary pronominal reference and other multi-turn phenomena
Changeability or extensibility (2.2.5.2/213) - can the user dynamically add new words and phrases to the lexicon, and with how much effort.
Non-textual pragmatic content - how well does the system handle emotive words and other marks, such as smiley faces, unusual spacing, etc.
Reliability (2.2.2/600) - how reliable is the system; if it breakes, how easily can the repair be tested, how does it behave on encountering unexpected or erroneous input, how does it behave when it crashes, how often does it happen, how difficult is it to restart.
Time behavior (2.2.4.1/206) - how fast is the translation process; are interactive conversations possible.
Translators
Post-editors
Translation consumers
Translation managers
Higher level management
Functionality.
Usability.
Maintainability:changeability.
Portability: adaptability.
Portability: installability.
Translators
Post editors
This characteristic does not include the user's level of familiarity with the particular system being evaluated.
The user's level of proficiency will in part relate to evaluation of usability.
Standardised tests of computer literacy are being developed , for example the European "computer driving licence"
Functionality.
Developers
Purchasers
Translation managers
Higher level management in the translation provider
This may refer either to intermediate consumers or end consumers, for example a publication service may be a consumer of translation which is then subsequently transmitted to the end consumers who are the readers of what the publication service produces>
This characteristic influences acceptable levels of values for functionality measures.
Functionality
Developers
Purchasers
Translation managers
Higher level management in the translation provider
The level of proficiency may be measured, for example by local education tests, internationally recognised examination schemes or organisation internal testing.
However, in the case of the translation consumer it is often not feasible to have more than an informal notion of the degree of his proficiency in the source language.
The degree to which this characteristic is pertinent in a specific evaluation will depend on what the translation consumer will do with the translation delivered to him. For example, whether he will in some way repair or polish it.
Functionality
Developers
Purchasers
Translation managers
Higher level management in the translation provider
The level of proficiency may be measured, for example by local education tests, internationally recognised examination schemes or organisation-internal testing. However, in the case of the translation consumer it is probably unrealistic to imagine having more than an informal idea of target language proficiency.
In a specific evaluation, the level of proficiency in the target language of a potential or intended translation consumer will strongly influence what are considered as acceptable measures of functionality.
Functionality.
Reliability.
Usability
Efficiency
Maintainability
Portability
Developers
Purchasers
Translation managers
Higher level management in the organisation
Reliability.
Efficiency.
Maintainability
High level management within the organisation
Translation managers
Developers
Purchasers
Vendors
The volume of translation work may be measured in many ways, including pages per day, week, month or year. It can also be described by how much time is used in the translation work. Usually this is measured in person-hours. Naturally the amount of work correlates with the quality of the translation, which may vary from text to text.
In addition, the amount of work typically varies with the target and source languages (EAGLES-96).
A factor which has become relevant recently in estimating the quantity of translation in some contexts is the existence or non-existence of pertinent translation memories
Management
Efficiency.
Reliability.
Functionality.
Management.
Translation managers
Genre refers to the characteristic or definitive form and style peculiar to a type of document.
Examples of genre are: newspaper articles; scientific and technical articles; recipes and instructions; correspondence; business/commercial reports; marketing texts and advertisements; legal texts; literature: novels, poetry, etc.; and many others.
ISO 9126 distinguishes between internal characteristics which pertain to the internal static properties of the software and external characteristics which are the characteristics which can be observed when the system is in operation. There is some connection here with the notions of glass box and black box evaluation.
Characteristics of measurements in general
A measurement is the use of a metric to assign a measure (a value, which may be a number or category) from a scale to a quality/attribute of an entity (ISO/IEC 9126-1:2001(E)). Whatever the measure applied, all measures share certain characteristics. An adequate description of a measure should include the following properties:
1. Definitional characteristics
1.a Textual definition and description of the metric
1.b Input to measurement process (e.g., sentence, text, source+target text, system, etc.)
1.c Measure, i.e., output of measurement process (e.g., number on a scale, symbolic value, Y/N decision, etc.)
2. Dependencies
2.a Domain/genre dependence (lists of subject domains and/or genres to which the measure applies or doesn't)
2.b Task dependence (suitability; tasks for which the measure is or is not appropriate)
2.c Language dependence (source or target languages for which the measure holds or does not)
3. Metric Sensitivity (for numeric measurements only)
3.a Accuracy (accuracy of measurement: confidence interval, error bars, correlation with human judgments, etc.)
3.b Variance (variance of measurement across tests; inter-evaluator agreement)
4. Coverage
4.a Completeness of the metric (scope; proportion/degree/percentage of the quality that the measurement is designed to measure)
4.b Completeness of the measurement (scope; proportion/degree/percentage of the quality that the measurement has measured)
5. Costs
5.a Cost to prepare test materials (taking into account its reusability later)
5.b Cost to perform measurement (if necessary, expressed per repeatable unit)
6. Resources/knowledge required
6.a What people/equipment/data/information is required to perform the measurement?
6.b What knowledge/skill is required of the evaluators?
6.c How much time is required to perform the measurement?
Note also that there are multiple methods for measuring some of the qualities in this system. Some are more invasive, such as code review, grammar inspection, etc. Others are less so, as with the use of test suites. The level of testing granularity is applicable here, particularly when determining if the testing is glass-box (looking into the system structure / code) or black-box (seeing external behavior only).
Metric: The developer should provide a description of the theory and method of translation used by the system. -- Method: Provision of supporting documentation such as white papers. -- Scale: Percentage of conformance. --
There is a variety of each type of system but especially so with respect to rule-based systems.
Generally, it is assumed that a theoretically sound system is easier to use, manage, update, etc. than one that is not theoretically based.
As important as current coverage is a systems capacity for update and improvement.
Many of the techniques used in verification of particular models tend to be glass-box, that is, isolating an element of the system, or potentially examining source code and data files.
Metric: The developer should provide a description of the theory and method of translation used by the system. -- Method: The developer should provide white-papers and supporting documentation. -- Measurement: Confirmation of method by study of documentation.
Metric: If the system uses a grammar, the form and number of the rules. -- Method: Specification of the form of the grammmar rules and counting of the grammar rules. -- Measurement: Conformance to standard grammar specification; number of rules.
Metric: If the system uses a grammar, determine their coverage. -- Method: Two methods are running against a test suite and rule examination. -- Measurement: Number of grammatical functions covered by the system.
Metric: If the system uses a grammar, test their relaxation capacity. -- Method: Two methods are running against a test suite and rule examination. -- Measurement: Number of grammatical relaxations permitted.
Metric: If the system uses a grammar, the ease of adding or changing rules. -- Method: Design and add/change grammatical rule to the system. -- Measurement: Yes or no: Can rules be added or changed? -- Measurement: See ease of update in section 2.2.5.2.4
For further metrics, see 2.1.1.2 Models below.
Metric: The developer should provide a description of the theory and method of translation used by the system. -- Method: The developer should provide white-papers and supporting documentation. -- Measurement: Confirmation of method by study of documentation.
Metric: Minimum size of the training corpus -- Method: Specification by developer of minimum training corpus size -- Measurement: Yes or no: Size specification reported
Metric: Accessiblity of training corpus or techniques -- Method: Specification by the developer of interface / tools for training corpus -- Measurement: Yes or no: Training corpus is accessible
Metric: Specification for training corpus preparation -- Method: Provision by developer of training corpus preparation tools / documentation -- Measurement: Yes or no: Training corpus preparation tools / documentation is available
It is sometimes assumed that statistical MT systems constitute a new type of translation model. In fact, they implement one of the above-mentioned models in a different way, by building the lexicons, transfer rules, etc., unfortunately using large collections of data to learn from statistically. There is no new 'statistical MT program'. IBM's CANDIDE system (Della Pietra, et al.) and the EGYPT system (Knight, et al.) are examples of direct replacement systems involving some word order reorganization.
Metric: The developer should provide a description of the theory and method of translation used by the system. -- Method: The developer should provide white-papers and supporting documentation. -- Measurement: Confirmation of method by study of documentation.
Metric: Size of parallel corpus -- Method: Developer report of parallel corpus size -- Measurement: Size can be reported in terms of bytes, sentence pairs or words per language.
Metric: Accessibility of example corpus -- Method: Specification by developer of corpus accessibility -- Measurement: Yes or no: Corpus is accessible
Metric: Form of examples -- Method: Specification by developer of example formats -- Measurement: Confirmation of example format specifications
Metric: Number of examples -- Method: Counting of examples -- Measurement: Number of examples in corpus
Metric: Source language matching technique -- Method: Specification of source language matching technique and parameters -- Measurement: Yes or no: is specification provided. Also, can use test suite to test coverage and flexiblity of source language technique.
Metric: Ease of extending / adding examples -- Method: Test corpus -- Measurement: Percentage test items that can be added.
Somers, 2000
See also ease of update in section 2.2.5.2.4
A translation memory is a multilingual text archive containing multilingual texts, allowing storage and retrieval of aligned multilingual text segments against various search conditions.
Different translation memories differ as to the information stored along with the raw texts and the retrieval methods. This definition does not restrict translation memory to what is currently available in systems on the market.
A translation memory is a collection of multilingual correspondences with optional control information stored with each correspondence. This characterization abstracts away from the actual manner of storing the correspondences (one-one, one-many, or many-many).
The control information can include information about the source text of the correspondence, its date, author, company, subject domain. This information may be used in ranking matches.
When a translation memory is used to support a given direction of translation, we can identify one segment of each correspondence as the (stored) source segment and another one as the (stored) target segment. A given query with a current source segment may return a number of correspondences with matching stored source segments. (EAGLES).
Metric: The developer should provide a description of the incorporation of translation memory and how it fits into the MT process. -- Method: Provision of supporting documentation -- Measurement: Yes or no: Does the documentation describe the role and function of translation memory?
Metric: Size of parallel corpus -- Method: Developer report of parallel corpus size -- Measurement: Size can be reported in terms of bytes, sentence pairs or words per language
Metric: Form and number of text segments -- Method: Specification by developer of form, granularity and number of text segments -- Measurement: 1) Confirmation by test or examination of form and granularity of text segments. 2) Count of number of text segments. 3) Test suites may be designed and executed in which case the measurement is percentage of test suite cases accepted.
Metric: Type of control information permitted -- Method: Specification by developer of type of control information permitted. -- Measurement: Inspection of specifications. Number of specified control settings which work.
Metric: Source language matching technique -- Method: Specification of source language matching technique and parameters. -- Measurement: 1) Yes or no: Is specification provided? 2) Use test suite to test coverage and flexibility of source language matching algorithm.
Metric: Ease of extending parallel corpus. -- Method: Test corpus / test suite application -- Measurement: Percentage of test items that can be added
EAGLES Evaluation Standard for Translation Memory
The incorporation of translation memory into traditional machine translation platforms is a relatively new and under-represented field of study, although a few examples do exist (AMTA-2002)
Every MT system embodies some theory of language and of translation. Usually most of the theoretical assumptions are implicit, possibly not even known to the developer.
The simplest MT systems perform direct replacement of terms and phrases in the source language with target language equivalents. In addition, rudimentary word order changes may often be performed. Example-based systems (EMBT) are one type of this class; they replace phrases or even whole paragraphs at a time.
More sophisticated MT systems try to improve syntactic (grammatical) quality by analyzing the source sentence into a syntax tree and then converting the tree into the form required by the target syntax (for example, moving the verb complex). At the cost of building grammars and parsers, such syntactic transfer systems produce higher quality.
One level more complex, semantic transfer systems analyse the source text into some formalism that is intended to capture meaning, not just grammatical form. The formalisms used by shallow semantic systems are not fully language-independent and hence require some transformations into target form.
The most complex systems analyse the input into a language-neutral interlingual formalism, from which many target languages can be directly generated. No wide coverage interlingua has get been developed.
These levels of translation have been represented by the so-called MT triangle (Vauquois).
In general, the more sophisticated the internal processing, the higher the output quality, but the more domain-specific and brittle the system. Most modern working systems include a blend of syntactic and semantic transfer.
In practice, MT systems are not solely at one level of processing. In fact, fall back to a less complex strategy is often indicated when errors occur.
Metric: The developer should provide a description of the theory and method of translation used by the system. -- Method: Receipt and review of the specification of method. -- Measurement: Yes or no: Is the system a direct translation system?
Metric: The form and number of substitutions -- Method: Receipt and review of the form and number of substitutions -- Measurement: 1) Yes or no: Does the description of the substitution forms exist? 2) What are the number of substitution rules?
Metric: Number of reordering operations -- Method: Receipt and review of the reordering operations list -- Measurement: Yes or no: Description of reordering operations exists?
Metric: Types of reordering operations -- Method: Receipt and review of the reordering operations list -- Measurement: Count number of reordering operations possible
Metric: Coverage of reordering operations -- Method: Test suite application -- Measurement: Number of advertised reordering operations that are carried out successfully.
Metric: Relaxation capacity or wildcard processing -- Method: Test suite application -- Measurement: Number of test suite items that are processed successfully
Metric: Ease of adding or changing substitutions -- Method: -- Measurement:
Note that the test suite methods mentioned here test whether the system process a particular rule, combination without measuring or addressing the overall quality as defined in section 2.2
Metric: The developer should provide a description of the theory and method of translation used by the system. -- Method: Receipt and review of the specification of the transfer method. -- Measurement: 1) Yes or no: Is the system a transfer translation system? 2) At what level of transfer does the system operate primarily?
Metric: If the system uses a grammar, the form and number of grammatical analysis rules -- Method: Receipt and review of the grammatical rules in place -- Measurement: 1) Yes or no: Do the grammatical rules exist? 2) What number of grammatical rules exist?
Metric: If the system uses a grammar, the coverage of the grammatical rules -- Method: (a) Receipt and review of the grammatical rules in place -- Measurement: (a) Analyzed coverage of grammatical rules for number of phenomena covered.
Method: (b) Use test suite to test for grammatical phenomena coverage -- Measurement: (b) Number of test suite cases covered
Metric: If the system uses a grammar, the relaxation capacity -- Method: (a) Receipt and review of the grammatical relaxation algorithm -- Measurement: (a) Analyzed coverage of relaxation algorithm for number of phemomena covered
Method: (b) Use test suite to test for grammatical relaxations -- Measurement: (b) Number of test suite cases covered
Metric: If the system uses a grammar, the ease of adding or changing the rules -- Method: Add or change a grammar rule -- Measurement: Yes or no: Can the grammar be changed? --
NOTE: This is related to section 2.2.5.2.
Metric: With respect to the transfer component, the form and number of transfer rules -- Method: Receipt and review of the transfer rules -- Measurement: 1) The transfer rules are described in a standardized form. 2) Number of transfer rules
Metric: With respect to the transfer component, the coverage of transfer rules -- Method: (a) Receipt and review of the transfer rules -- Measurement: (a) Analyzed coverage of transfer rules for number of phenomena covered.
Method: (b) Use test suite to test for transfer phenomena coverage -- Measurement: (b) Number of test suite cases covered
Metric: With respect to the transfer component, the relaxation capacity of the transfer rules -- Method: (a) Receipt and review of the relaxation mechanism for the transfer rules -- Measurement: (a) Analyzed coverage of the relaxation mechanism in transfer
Method: (b) Use test suite to test for transfer rule relaxation -- Measurement: (b) Number of test suite cases covered
Metric: With respect to the transfer component, the ease of adding or changing the rules -- Method: Add or change transfer rules -- Measurement: Yes or no: Can transfer rules be added or changed?
NOTE: This is also related to section 2.2.2.5
Metric: The developer should provide a description of the theory and method of translaiton used by the system. -- Method: Receipt and review of the specification of the interlingual structure and methods. -- Measurement: Yes or no: Is the system an interlingual translation system?
If the system uses a grammar, then the grammar metrics of Transfer (2.1.1.2.2/412) apply.
Metric: Expressive power of the interlingual notation -- Method: Analysis of the interlingual notation scheme -- Measurement: 1) Yes or no: Does the interlingual notation exist? 2) How many levels of complexity for the notation exist?
Metric: Coverage of the interlingual notation -- Method: (a) Receipt and review of the interlingual notation, including markup instructions -- Measurement: (a) Analyzed coverage of interlingual notation (using formal methods).
Method: (b) Use test suite to test for phenomena coverage -- Measurement: (b) Number of test suite cases covered.
Metric: Representation of standard linguistic phenomena (e.g., sentence component promotion and demotion, phrasal expression, tense/time, aspect, etc.) must be given. See, for example, Ontological Semantics (Nirenburg and Raskin 2002). -- Method: Use test suite for phenomena coverage -- Measurement: Number of test suite cases covered.
Metric: Number of language pairs supported -- Method: (a) Count developer reported language pairs -- Measurement: (a) Number of language pairs reported
Method: (b) Run test document in each language reported through system. -- Measurement: (b) Number of successful runs.
Metric: Particular language pairs supported -- Method: (a) List developer reported language pairs -- Measurement: (a) List of language pairs reported
Method: (b) Run test document in each language reported through system. -- Measurement: (b) Number of successful language pairs ran.
Metric: Ease of adding new language pairs -- Method: (a) Study of documentation and developer interview -- Measurement: (a) Yes or no: New language pairs can be added?
Method: (b) Add new language pair -- Measurement: (b) Yes or no: New language pair successfully added.
NOTE: New language pairs could be creating using source / target of different language pairs while not adding an entirely new language to the mix. For instance, a system with French-English and English-German may be able to add French-German as a language pair.
Metric: Ease of adding new language pairs -- Method: Add new language pair -- Measurement: This ties to the measurements in section 2.2.5.
Metric: Kinds of dictionaries available -- Method: List of dictionaries provided by developer -- Measurement: List of dictionaries
Metric: Number of dictionaries available -- Method: Count number of dictionaries in list -- Measurement: Number of dictionaries available
Metric: Format of the dictionary entries -- Method: Examine developer documentation -- Measurement: Yes or no: Is the dictionary in a standard format? -- NOTE: A deeper discussion of dictionary formatting, getting new entries into the system, etc., occurs in the ease of update sections.
For metrics related to ease of dictionary update, see sections 2.1.3.4 / 2.2.5.2.
Metric: List of wordlists / glossaries available -- Method: Developer provides list of wordlists / glossaries -- Measurement: List of wordlists / glossaries
Metric: Number of wordlists / glossaries available -- Method: Count number of wordlists / glossaries -- Measurement: Number of wordlists / glossaries available
Metric: Format of the wordlists / glossaries -- Method: Examine developer documentation -- Measurement: Yes or no: Are the wordlists and/or glossaries in standard format? -- NOTE: A deeper discussion of wordlist and glossary formatting, putting new entries into the system, etc., occurs in the ease of update sections.
Metric: Types of corpora incorporated into the system. -- Method: Report by the developer. -- Measurement: List of described types of corpora - monolingual, comparable, parallel.
Metric: Number of each type of corpora incorporated into the system -- Method: Report by the developer. -- Measurement: List of described numbers of corpora, categorized by type.
Metric: Kinds of each type of corpora incorporated into the system. Beyond the type of corpora (monolingual, comparable, etc), there are the kinds. Kind will include domains, genre, dates of collection, etc. Method: Report by the developer. -- Measurement: List of described domains of corpora, categorized by type.
Note that the ease of update is covered in section 2.2.5.2
Metric: Type of grammar used -- Method: Specification by developer -- Measurement: The type of grammar used should specify which of the formalisms (e.g., lexical functional grammar(LFG), generalized phrase structure grammar (GPSG)) it conforms to and any adaptations needed.
Metric: Grammatical complexity -- Method: Analysis of grammar reported by developer -- Measurement: Analyzed complexity in order of magnitude measures per number of input tokens
Metric: Grammatical coverage -- Method: (a) Analysis of grammar reported by developer -- Measurement: (a) Analyzed coverage reported in terms of linguistic constructs covered
Method: (b) Test grammatical coverage using test suites -- Measurement: (b) Number and type of linguistic constructs covered in test suite
In addition to the operation of the system per se, other activities or processes must take place to enable successful MT operation.
Most translation technology products provide some facilities for customisation. This can range from machine translation systems that typically offer very little to some translator workbench products with many customisable features. The degree to which users can customize products to suit their own environment is a critical factor in selecting the most appropriate product.
This heading assesses user-definable features for areas such as project management and linguistic processing.
Pre-translation is defined as modifying translation memory without notifying user.
Translation preparation is related to transferring the source text into a form which the translation process can accept or which will facilitate translation.
The more the source text can be designed and created with translation in mind, the less work it will require when passing into translation process (OVUM report).
Metric: Can edit list of terms to be ignored during translation process -- Method: 1) Determine if system has feature through reading documentation. 2) Test the operation of the feature through one or more test cases. -- Measurement: 1) Yes or no: The feature exists. 2) The number successfully marked not-translate words
Metric: Ease of importing data into the system (for an example from word processors, see OVUM report) -- Method: 1) Review system documentation for list of data types accepted by the system, to include file types, code sets, data formats. 2) Import data into the system for each file type, code set and data format advertised. -- Measurement: 1) List of file types, code sets and data formats supported by the system. 2) List of file types, code sets and data formats successfully loaded into the system.
Metric: Input tolerance for typing / conversion and other errors -- Method: Design and execute characteristic error test suites -- Measurement: Percentage of ill-formed inputs successfully handled by the system.
Metric: Can mark terms to not translate -- Method: 1) Review system documentation to see if this is a feature and how it treats these terms. For instance, in Chinese-English, if a Chinese term is marked as do not translate, is it transliterated or rendered in the native font in the output text. 2) Mark words as do not translate and run them through the system -- Measurement: 1) Yes or no: Terms can be marked as do not translate. 2) Description of handling strategies for not translated words.
Metric: Maximum length of input text -- Method: 1) Review system documentation to see if there is a maximium supported input text length. 2) Run documents under and over input text length to determine handling of out of bounds text. -- Measurement: 1) Length (in words or bytes) of largest input text permitted. 2) Description of error handling for texts larger than the maximum length. That is, are they split into separate texts, does the system crash, etc.
Metric: Maximum length of input sentence -- Method: 1) Review system documentation to see if there is a maximum supported input sentence length. 2) Run suite of sentences under and over input length to determine handling of over-length text. -- Measurement: 1) Length (in words or bytes) of largest input sentence permitted. 2) Description of error handling for sentences larger than the maximum length. THat is, are they split into separate texts, does the system crash, etc.
Metric: System vocabulary search -- Method: 1) Review of system architecture to determine if module does a not-found check on words before translation process begins. 2) Review of intermediate system artifacts to see if marking occurs. -- Measurement: 1) Yes or no: Does the system pre-scan document for not-found words? 2) If so, how does the marking occur?
Metric: Time required for pre-processing of a particular test text -- Method: 1) Assemble necessary software modules if not incorporated into system. 2) Enable necessary software modules if incorporated into system. 3) Measure time required for pre-processing stages for one or more test texts. -- Measurement: Amount of additional time required by pre-processing.
This is more a part of the process flow as opposed to the translation process per se.
Post-translation activities relate to preparing the output texts to meet the requirements for final publication or delivery (OVUM report).
Revision of output translation interactively to produce a final version for printing (Trial of the Weidner Computer-Assisted Translation System, p.12, October,1985). Sometimes this is referred to as the camera-ready copy.
Metric: Correction rate defined as the ratio of the number of words corrected to the number of words in the translation (Van Slype) -- Method: Count number of words corrected, number of words in initial translation. -- Measurement: Ratio of number of words corrected to the number of words in the translation.
Metric: Correction rate defined as the amount of time required to correct a text after the translation -- Method: Time the correction of representative texts -- Measurement: Time to correct
Metric: Correction rate defined as the number of insertions, deletions and substitutions - "edit distance" required to correct a text after translation (Ney and Niessen)-- Method: Count the number of insertions, deletions, substitutions to correct a text. Note that this metric can be automated. -- Measurement: Edit distance which is often a linear combination of the three counts.
Metric: Availability of editing functions in system without retranslating (JEIDA report) -- Method: 1) Check the system documentation to check availability and operation of post-edit functions. 2) Test the operation of each function on test documents. -- Measurement: Description of the functions available with their operation parameters.
Traditionally, this has been an stage in the process requiring most of the time, for production-quality translations.
The designation of post-translation processing is often part of management control.
Interactive MT systems require user guidance at points when the system reaches an impasse during processing. The user's assistance (whether in the form of menu choices, parameter entry) constitutes a form of editing that can be called "inline editing" or "in-editing" (The Pangloss Mark III MT System).
Metric: Steps for translation -- Method: Count number of times system requires assistance when translating a test corpus. -- Measurement: Number of steps needed or number of steps as percentage of test corpus size.
Metric: Time for interactive translation -- Method: Measure the amount of time it takes to perform interactive translation on test corpus -- Measurement: Amount of time for interactive translation on test corpus.
This quality will not be appropriate for certain classes of MT process, such as embedded MT.
Metric: Production of a not-translated word list -- Method: 1) Read system documentation to determine if produces a separate not-translated word list. 2) Establish structure of list, if it exists. 3) Run test suite through system and examine not-translated word list. -- Measurement: 1) Yes or no: Not translated word list produced. 2) Description of format of not-translated word list, to include transliteration conventions and system specfic markings.
Metric: Ease of identifying source terms / their target language equivalents and grammatical information. -- Method: Check not-translated word list to see if contextual information is provided with the not-translated words. -- Measurement: Context is/is not sufficient for update.
The stage of dictionary update consists of entering in the machine's dictionary words or expressions the translator considers useful for future translations. It is recommended that the inserted terms are likely to be used in 20% of future texts (Trial of the Weidner System, 1985, p.12).
For metrics related to ease of dictionary update, see Changeability (2.2.5.2./213).
Metric: Can manager set level of access? -- Method: 1) Read documentation to determine if feature exists. 2) Set multiple layers of data access. -- Measurement: 1) Yes or no: Does the feature exist? 2) Description of the layers of data access.
Metric: Can manager set up directory structures? -- Method: 1) Read documentation to determine if feature exists. 2) Set up directory structure and see if system accesses it properly. -- Measurement: 1) Yes or no: Can the directory structure be set up? 2) Yes or no: The system accesses it properly.
Metric: Can files be prepared and tracked within the framework? -- Method: 1) Read documentation to determine if feature exists. 2) Prepare and track files within the framework. -- Measurement: 1) Yes or no: File preparation and tracking exists. 2) Yes or no: Documents can be prepared and tracked within framework.
Metric: Customized printouts can be obtained. -- Method: 1) Read documentation to determine if feature exists. 2) Prepare and print customized statistics, such as usage. -- Measurement: Yes or no: The feature exists.
Quality is a complex notion that depends on the point of view of the different actors related to an MT system. It is most often related to the judgment of final users (Dostert); or defined as the composite measurement of fidelity, intelligibility and elegance (Johnson); or it is a results of the analysis of situational dimensions (House) - all in the Van Slype report.
The quality of the translation can be evaluated in two modes.
Without adjustment: this aims to evaluate the quality of translation before the dictionary and/or grammar is adjusted. This is also an absolute evaluation of the system (JEIDA report).
With adjustment: this aims to evaluate the quality of translation after the dictionary and/or grammar is adjusted. The higher quality of translation the user needs, the more severely the evaluation is made. In this respect this item shows the degree of the user's satisfaction with the system (JEIDA report).
All of the following definitions are taken from Van Slype's Critical Report, 1979.. It seems very important to keep them in mind before proceeding to further discussion of the quality features.
DEFINITION OF TRANSLATION
J. HOUSE -- Translation is the replacement of a text written in a source language by a semantically and pragmatically equivalent text written in the target language. (The translation of oral texts is a different activity, namely interpretation).
TRANSLATION QUALITY
L'ASSOCIATION JEAN FAVARD distinguishes: (a) the intrinsic qualities, which are independent of the reader; (b) the extrinsic qualities, which are related to the "text-reader" couple. A text, even badly translated (and thus of low intrinsic quality) can nevertheless, for an informed reader, be as clear as if it had been well translated. However, beyond a certain deterioration in intrinsic quality, the extrinsic quality becomes very poor.
H. BRUDERER -- Quality is a relative concept, i.e. one related to a specific object. Quality can apparently be measured, at least in part, but it remains much more difficult to quantify abstract (conceptual, subjective) phenomena than concrete (perceptible, real, tangible) things. Quality can be evaluated: (a) either positively assessment of merits, advantages; (b) or negatively assessment of deficiencies, errors, disadvantages; (c) or totally assessment of the positive and the negative aspects. The evaluation of the translation quality -- whether human or computerized -- has to take into account the following intralinguistic and interlinguistic factors morphology, syntax, content, terminology, style, conformity. A faithful translation reproduces the sense of the original text, but it does not necessarily, if it is to be considered an intelligent translation, have to be identical to the original text. Given that they partially overlap, content and fidelity should be evaluated on an overall basis. Similarly, it is difficult to differentiate clearly syntax and semantics. Style, on the other hand, influences all levels (morphology, syntax, semantics, terminology).
IR.L. JOHNSON defines translation quality by three factors fidelity, intelligibility and elegance. The importance of these three factors may vary with the type of text considered. Features can be observed: (a) superficially, via linguistic elements such as lexical and syntactic exactitude; (b) indirectly, via the reactions of the users to the translated text.
B. KUHLEN stresses that there is not a universal criterion for MT evaluation: (a) on the one hand because it does not seem that MT can ever reach the level of quality of human translation; (b)on the other hand, because the evaluation criteria have to be chosen according to the aim in view; (c) finally, because the individual parameters, which taken together permit an assessment of the quality of MT, often contradict each other, with the result that an overall rating would not be significant to the specific performance of the components.
Z.L. PANKOMICZ feels that usefulness of MT and HT has to be based on quality, speed and cost. Determination of the optimal balance between these three parameters depends on the environment of each translation activity. It is necessary to understand, in his view, that the quality of HT and MT is indefinable, at least in any absolute way. The assessment of the quality of HT is traditionally based on its completeness and on stylistic elements.
A.J. PETIT takes the view that the translation should not comprise misconstruction, but admits however a tolerance of up to I % of the sentences in the case of translations to be supplied raw to the final user and 2 % of the sentences in the case of texts to be revised before submission to the users. This tolerance is intended to allow for normal risks of error or accident.
Y. WILKS thinks that the purist who feels that the least translation defect nullifies the translation is often mistakes in two of his postulates: (a) he exaggerates the attention and comprehension which the average reader achieves with a technical document (consequently, errors of translation do not negate the value of the text); (b) he exaggerates the quality of the mass of human translations produced on an enormous scale and at high speed.
RELATIONSHIP BETWEEN TRANSLATION QUALITIES AND EVALUATION CRITERIA
According to G. BOURQUIN the criteria for evaluating a translation will vary according to whether it is produced by a human translator or by the machine: (a) from the human, "finesse" will be required open to the ethnoculture and to work on linguistics, the human translates with his sensivity, his intuition, his common sense; (b) the computer will be expected to offer regularity, precision, infallibility, speed, and encyclopedic exhaustiveness.
M. MASTERMAN notes that our ignorance of the very nature of translation leads to a discordance between the evaluation criteria used or proposed by various authors.
A.J. PETIT -- A product is acceptable only if it meets the requirements of its users. As regards texts (original texts or human or machine translations), the principal requirements are:
(1) For utility technical texts (maintenance or user manuals): (a) errors, (b) homogeneity, (c) clarity, without ambiguity or gibberish which might obscure the sense of the message, (d) simple correct style, without extravagances or recherche' elements, (e) use of the terms recognized in the relevant sector.
(2) For educational technical texts: (a) no technical errors, (b) adaptation of the terms recognized in the relevant sector.
(3) For documentary scientific texts: (a) clear exposition of theory, (b) without errors flowing style without excessively long sentences incorporating several different ideas, (c) use of the basic terminology of the discipline.
These requirements have however to be viewed from a different angle according to whether the translation is intended: (a)to be revised in this case, the translation system (human or machine) has to be aware of its own shortcomings, and indicate by itself all the ambiguities which it was not able to resolve it delivers an incomplete product, but one without serious defects; (b) to be supplied direct to the final user the translation must then be complete (experienced human translator or a computerised system producing a complete translation, without any misconstruction) and without serious defects (human error or accident both being normal risks).
THE AUTHORS OF THE REPORT PRESENTED BY PHILIPS distinguish between evaluation of translations with and without comparison with the source text. In the first case, it is necessary to assess in what measure the translation (a) reproduces which is stated in the original (for example contractual texts), (b) reproduces what the author of the original intends to say, with the certainty that the message is properly understood (for example translation of manuals). To assess the quality of a translation, it is necessary to answer the following questions. (1) On the aim of the translation: (1.1) does the translation reproduce the content of the original? (1.2) does the translation reproduce the formulations of the original? (1.3) does the translation reproduce the intention of the author? (2) On the type of text: (2.1) all the information presented? (2.2) can the translation achieve the desired effect? (2.3) have the necessary corrections been made in such a way that communication has the best chance of success?
In the second case, evaluation of the translation without reference to the original, the assessment of the quality of the translation has to cover: (a) the grammatical correctness, (b) style of idioms, (c) the use of current words, expressions and structures in the target language, (d) the absence of contradictions or ambiguities.
ASSESSMENT
The concept of the quality of a manufactured product is, in general, unambiguous the product has to correspond to the specifications and a battery of quality control tests can easily be arranged, and made the responsibility of controllers often relatively unqualified. The concept of translation quality is much more indeterminate, and the authors' contributions can be summarized fairly briefly. (1) The quality has to be assessed, not in the absolute, but according to the aims of the writer of the texts to be translated and by those who decide how it is to be distributed. (2) The quality achieved by HT can not be expected of MT, and the latter has thus to be used for more limited aims than the former (which does not mean that, within the scope of these limited aims, there does not exist a major potential demand). (3) The evaluation criteria have to be chosen according to these specific aims. (4) Since translation quality can not be measured in the absolute, on the basis of a single criterion, its assessment should combine several criteria.
The extent to which a sentence reads naturally.
Ease with which a translation can be understood, i.e. its clarity to the reader. (Halliday in Van Slype's Critical Report) .
This has also been called fluency, intelligibility, and clarity.
Crook & Bishop (in Van Slype's Critical Report): Cloze test (every eighth word).
Crook & Bishop (in Van Slype's Critical Report): 7-point scale.
Halliday (in Van Slype's Critical Report): Clozentropy.
Sinaiko (in Van Slype's Critical Report): Multiple-choice questionnaire + Cloze test (every fifth word) + clarity measurement + time measurement.
Sinaiko (in Van Slype's Critical Report): Rating of sentences read on a 3-point scale.
Carroll (ALPAC report): rating of sentences read on a 9-point scale.
Carroll & Bishop (in Van Slype's Critical Report): rating of sentences on a 7-point scale.
Leavitt (in Van Slype's Critical Report): rating of texts read on a 9-point scale.
Van Slype (in Van Slype's Critical Report): rating of sentences read in their context on a 4-point scale.
Vauquois (in Van Slype's Critical Report): rating of sentences read on a 2-point and 3-point scale.
Pfafflin (in Van Slype's Critical Report): Rating of sentences read on a 3-point scale.
Vanni & Miller (2001, 2002): "Do you get it?" - snap judgement rating of sentences on scale from 0 to 3.
Niessen, Och, Leusch and Ney, 2000 measure syntactic errors with an automated string edit distance metric, which according to them can also be used as a measure of readability. See also Wellformedness (2.2.1.3/186).
Somers' use of cloze test (Somers and Wild, 2000).
READING TIME is another metric that has been used to measure readability. It is defined as the time required to read and understand a text, or to realize its unintelligibility, but not to memorize it (Van Slype). The following metrics based on reading time are from Van Slype's Final Report):
B.H.Dostert: by asking final users to state what percentage of additional time they require to read MT, as compared to an original in their own language.
J.B. Carroll: by measuring the time spent by the evaluator in reading each sentence of the sample.
G. van Slype: by measuring the time spent by the evaluator in reading each text of the sample.
Pfafflin and Orr (both quoted by T.C. Halliday): by measuring the response time to a multiple-choice questionnaire.
H.W. Sinaiko: by measuring the time necessary for the execution of the cloze test.
Readability is intended to be a metric applied at the sentence-level. This is in contrast to the opinion expressed by Battelle: "Like the MT evaluation methods based on the factor of comprehensibility, evaluation methods based on the concept of readability must consider, if not the whole, then at least a sizable segment of the translated material. This requirement is due to the fact that the method, although called readability method, measures the appropriate overall contextual cohesiveness" (Battelle report).
Readability is a quality of the output that can be measured independently of the source language.
Cloze tests can be used either at sentence-level or cross-sentence level.
This quality has been merged with clarity, which was a separate taxon in earlier versions of this taxonomy.
The extent to which the text as a whole is easy to understand. That is, the extent to which valid information and inferences can be drawn from different parts of the same document.
Comprehensibility reflects the degree to which a complete translation can be understood (whereas intelligibility is based on the general clarity of the translation, whether this is considered in its entirety or by segments out of context). (Halliday in Van Slype's Critical Report).
Subjective evaluation of the degree of comprehensibility and clarity of the translation. (Van Slype in Van Slype's Critical Report).
Halliday (in Van Slype's Critical Report): Noise test
Leavitt (in Van Slype's Critical Report): Multiple-choice questionnaire.
Orr (in Van Slype's Critical Report): Multiple-choice questionnaire.
Sinaiko (in Van Slype's Critical Report): Knowledge test.
This has also been called comprehension or intelligibility.
Metrics bearing on the readability of single sentences, as opposed to the comprehensibility of the text as a whole, have been moved to the Readability feature (2.2.1.1.1.1/172)
The coherence of a text is the degree to which the reader can describe the role of each individual sentence (or group of sentences) with respect to the text as a whole. Theories such as Rhetorical Structure Theory (Mann and Thompson, 1988) attempt to formalize coherence using a set of inter-segment relations (such as Cause, Solutionhood, Elaboration) that express the internal document structure.
Measurement of the total contextual coherence (T.C. Halliday in Van Slype's Critical Report).
Measure degree to which roles of each discourse unit can be identified with respect to a gold standard.
Vanni & Miller (2001, 2002), for example, measure this feature by counting the total number of sentences in the machine translated text to which RST labels can be assigned. See Mann & Thompson
It has been asserted that the quality of a translation can be assessed by its level of coherence without comparing it to the original text. Once a sufficiently large sample is available, the probability that the translation should be at the same time coherent and totally wrong is very weak. (Wilks in personal communication, 1992, also cited in Van Slype's Critical Report).
According to the definition the assessment of coherence can be done by a monolingual evaluator, whereas any judgement on the correctness of the translation necessarily involves making use of a bilingual evaluator. (Wilks in Van Slype's Critical Report).
Cohesion of a text refers to lexical chains and other elements -- for example lexical chains, anaphora, ellipsis -- that link individual units across sentences.
Does the system render cohesive units appropriately for the target language?
Special issue of MT journal on Anaphora, 2001.
Cohesion is particularly interesting for translation between languages that have different requirements for structuring and managing redundant information. For example, Asian languages make frequent use of ellipsis and zero-pronouns which often must be resolved on translation into languages where such use is not licensed.
Cohesion is also important when the translated text is intended for subsequent summarization (see Information extraction / Summarization, 1.3.1.2/115).
Qualities of the translation that must be evaluated on the basis of both the source language and the output of the system in the target language.
Suitability of source-to-target mapping to a particular task.
Coverage of cross language phenomena concerns the ability of the system to deal satisfactorily with the commonly recognized differences between the source and the target languages, with or without taking into account the presence or absence of these phenomena in any particular corpus.
By use of a set of test patterns - these should be in the form of simple source language patterns that are theory neutral, that is, descriptive in pedagogical terms rather than in terms of a particular syntactic theory whose principles could obscure the issue. For a number of European languages, such test suites are available as a product of the TSNLP project, which focused mainly on syntactic phenomena, and theDiET project. The Japanese MT research community has also produced such test suites as part of the Jeida project.
Commercial MT companies should also all have similar test suites: Logos and Systran both have test suites of this type. IBM has relevant test suites that were presented to the research community at LREC2000 in Athens and ACL-2001 in Toulouse.
In order to arrive at a measurement, test suites of this type can be used, with either a correct/incorrect verdict for each sentence in the test suite, a percentage correct for each sentence (as long as the notion of "percentage correct is well-defined), or a (3 to 10 point) scale of correctness for each sentence. The agregate measure could be the percentage of sentences correct, the percentage of linguistic phenomena covered, or an aggregate measure of linguistic phenomena covered, weighted for phenomena important to the language pair and task of interest.
It is also possible to use word error rate as a measurement, along the lines of automatic scoring of insertions, deletions, and substitutions relative to a gold standard (Niessen, Och, Leusch, and Ney, 2000), or as described in Vanni & Miller, 2002) and (Vanni & Miller, 2001).
Whereas TSNLP and some other test suites of this type focus mainly on syntactic phenomena, test suites for general cross-language coverage should ideally address other cross-language phenomena as well: idioms, lexical and conflational divergences, etc.
Each commercial MT company should have such a test suite, which they may use for regression testing or for testing of improvements to the system. Ideally, in order to test systems from developers A and B, a test set covering the union of the phenomena covered by the two test suites should be used.
Coverage refers to the ability of the system to deal satisfactorily with linguistic phenomena, both generally addressing known cross-language phenomena and specifically addressing phenomena in a corpus of interest.
Coverage of corpus-based problematic phenomena concerns the ability of the system to deal with the particular challenges presented by a corpus of interest.
By constituting a representative corpus and submitting it to the system in order to observe what errors occur.
Given a test suite of representative phenomena specific to the corpus of interest, low-level and aggregate measurements like those described in Cross-language phenomena (2.2.1.1.2/502) can be used.
Subjective human scoring on a 10-point scale.
This is a subjective evaluation of the correctness of the style of each sentence (Evaluation of the 1978 Version of the SYSTRAN English-French Automatic system of the Commission of the European Communities. Georges van Slype). This quality is also commonly referred to as "register" and includes degree of formality, forcefulness and bias as exhibited through both lexical and morpho-syntactic choices.
Van Slype (in Van Slype's Critical Report): Evaluation of sentences on a 4-point scale.
String edit distance (Niessen, Och, Leusch, & Ney 2000).
The capability of the software product to provide the right or agreed results or effects with the needed degree of precision (ISO 9126: 2001, 6.1.2).
Accuracy and its sub-qualities are established by reference to the source language text.
Subjective evaluation of the degree to which the information contained in the original text has been reproduced without distortion in the translation (Van Slype).
Measurement of the correctness of the information transferred from the source language to the target language (Halliday in Van Slype's Critical Report).
Carroll (in Van Slype's Critical Report): Rating of sentences read out of context on a 9-point scale.
Crook and Bishop (in Van Slype's Critical Report): Rating on a 25-point scale.
Halliday (in Van Slype's Critical Report): Assessment of the correctness of the information transferred.
Leavitt (in Van Slype's Critical Report): Rating of text units read on a 9-point scale.
Miller and Beebe-Center (in Van Slype's Critical Report): Rating of a text on a 100-point scale.
Miller and Beebe-Center (in Van Slype's Critical Report): Shannon measurement of the quality of information transferred.
Sinaiko (in Van Slype's Critical Report): Re-translation.
Van Slype (in Van Slype's Critical Report): Rating of sentences read on a 4-point scale.
White and O'Connell (in DARPA 94): Rating of 'Adequacy' on a 5-point scale.
Bleu evaluation tool kit (in Papineni et al. 2001): Automatic n-gram comparison of translated sentences with one or more human reference translations.
Rank-order evaluation of MT system: correlation of automatically computed semantic and syntactic attributes of the MT output with human scores for adequacy and informativeness, and also fluency. Hartley and Rajman 2001 and 2002.
Automated word-error-rate evaluation (in Och, Tillmann and Ney, 1999).
Automated metric using head transducers (Alshawi et al, 2000).
loss of information (silence) - example: word not translated
interference (noise) - example: word added by the system
distortion from a combination of loss and interference - example: word badly translated
Detailed analysis of the fidelity of a translation is very difficult to carry out, since each sentence conveys not a single item of information or a series of elementary items of information, but rather a portion of message or a series of complex messages whose relative importance in the sentence is not easy to appreciate.
Some automated metrics assume a fidelity evaluation as a human ground truth, or are relevant to fidelity evaluation.
Capability of the system to produce from a given input, and at a given point in time, the same output
Count the number of alternative translations for a given input unit.
Consistency is particularly important for developers and for the translation of technical documentation.
Filatova, 2000.
Names should be transliterated or translated (e.g. 'London' / FR: 'Londres') as appropriate.
This characteristic becomes very important in Assimilation tasks (1.3.1./113), inluding Information Extraction.
Percentage of phenomena correctly treated.
List of error types.
Average string edit distance per sentence or for all tokens in the text.
Flanagan, 1994. (See also the LOGOS error list in the same AMTA proceedings).
Loffler-Laurian, 1983 (in French).
See also Arnold et al, eds., 1993 ('Machine Translation' 1993 vol. 8:1-2, special issue on evaluation).
White and O'Connell, 1994 (DARPA measures): 5-point scale of syntactic correctness.
ALPAC: 5-point scale of syntactic correctness.
Percentage of phenomena correctly treated.
List of error types.
Average string edit distance per sentence or for all tokens in the text.
Flanagan, 1994. (See also the LOGOS error list in the same AMTA proceedings).
Loffler-Laurian, 1983 (in French).
See also Arnold et al, eds., 1993 ('Machine Translation' 1993 vol. 8:1-2, special issue on evaluation).
Percentage of inflections correctly treated.
List of error types.
Average string edit distance per sentence or for all inflectable tokens in the text.
(Note 1 from ISO) Aspects of suitability, changeability, adaptability and installability may affect operability.
(Note 2 from ISO) Operability corresponds to controllability, error tolerance and conformity with user expectations as defined in ISO 9241-10.
(Note 3 from ISO) For a system which is operated by a user, the combination of functionality, reliability, usability and efficiency can be measured externally by quality in use.
(Note 1 from ISO) Resources may include other software products, the software and hardware configuration of the system, and materials (e.g., print paper, diskettes, etc.)
For different types of translation work the importance of this characteristic is different (see (1.3) Characteristics of the translation task (112)).
The capability of the software product to provide appropriate response and processing time and throughput rates when performing its function under stated conditions.
This characteristic can be divided into the following sub-characteristics: (a) production time for complete translation; (b) production time for preliminary translation, or input-to-output speed.
The translation production time, i.e. the time between a request for a translation and reception thereof has been used as an evaluation criterion.
B.H. Dostert and Z.L. Pankowicz in Van Slype's Final Report.
In the context of machine translation systems this characteristic is interpreted as referring to production time.
Note 1 from ISO: Implementation includes coding, designing and documenting changes
Note 2 from ISO: If the software is to be modified by the end user, changeability may affect operability
In the particular case of MT systems, this refers to ensuring that improvement in one area does not result in degradation elsewhere.
Note 1 from ISO: For example, the replaceability of a new version of a software product is important to the user when upgrading..