By: Sergio Alasia
As stated by the European Telecommunications Standards Institute (ETSI), the main purpose of standardization is to allow interoperability and freedom of choice for buyers in a multi-vendor, multi-network, multi-service context. In particular, Language Service Providers (LSPs) cannot afford to allow their increasingly demanded services not to meet at least some unified industry standards.
LSPs mainly deal with text files, so industry standards should focus on text and its digital representation. In fact, interoperability in the language industry should allow to basically create and process documents in different environments and transfer them by different means without losing any properties. So far, this has not been always possible, since market leaders’ policies have often stood in the way of interoperability. However, in the last decade there has been an increasing effort in that direction, and standards are starting to play a constructive role in the evolution of language technologies.
For clarity’s sake, this article will consider three main blocks of standards that directly concern the translation industry:
- Character encoding standards
- File formats
- Process standards
Character encoding standards
Character encoding standards determine how text is represented as an underlying digital code, in order to be transmitted from one computer system to another. There are plenty of different character sets, but few of them are universally supported. Therefore, the use of non-standard character sets may lead to unreadable texts when documents are transferred from one system to another. The typical result is a poorly displayed text, garbled with question marks, squares and other strange characters.
This is especially problematic in the case of multilingual texts, where different alphabets or even letters and ideograms must coexist. A Russian-English glossary, for instance, should be encoded in a way that both the Cyrillic and the Latin alphabets are readable by any computer, regardless of the system’s locale. One solution to this question is the use of the Unicode standard, which provides a unique code point for every single character or ideogram of almost all written languages. UTF-8 and UTF-16 are two of the most complete and widespread Unicode-based character sets. Unicode’s capability of representing and handling text expressed in most of the world’s writing systems has determined its extensive use in web pages and therefore the language service industry as well.
File formats define the inner structure of files, so that the appropriate software application can properly load, open, process and save them. Companies should use standard formats for written files containing vital information, in order to keep the ownership of such information at all times. On the contrary, if there is only one or few commercial applications available for a company’s documentation authoring and translation, a risky relationship of dependency takes place. On top of not adhering to the standards, the software house selling that application may force buyers to update without allowing backwards compatibility. They may even abandon that application, or worse close down completely their business, leaving users without updates or support at all.
Since the punched cards days end users have been witnessing many of such lock-in practices by major software houses in order to gain or keep their market share, regardless of the quality of their products. In fact, it was the lack of common standards for word processors in mid 1990s what caused a not painless mass migration of WordPerfect users to Microsoft Word, while Windows was supplanting DOS as the foremost operating system.
The language niche is absolutely no exception, with SDL Trados nowadays as the market leader and de facto standard, but far from being the best translation software, at least in terms of compatibility. First of all it is based on Microsoft’s .NET technology, therefore leaving out all Mac and Linux users, and it does not allow older versions to open files created by newer versions, and only recently it improved its cumbersome and cluttered graphic interface, reminiscent of the early 1990s. Most importantly, they have lately proven to treat their paying users as beta testers, releasing a buggy version of their Studio 2011 suite in September 2011, followed by the first Service Pack (as large as 347 Mb) only three months later.
In the translation industry, the standardization of file formats is especially required for the different tagged text files, which are usually intermediate and auxiliary subproducts of the translation process, when it is carried out using a Computer Assisted Translation (CAT) tool. CAT tools usually are capable of processing at least three types of files: translation memories, bilingual texts and term bases. If the inner structure of such files follows shared specifications that are expressly intended to allow them to be shared among different environments, a company’s multilingual documentation will not depend on just one or a few vendors.
Some of the standards, specifically designed for the localization process, had been regulated by the Localization Industry Standards Association (LISA) until its demise on Feb 28th 2011. Later the same year, the European Telecommunications Standards Institute started a Special Interest Group for localization (LIS), currently working on TBX, TMX, SRX, GMX-V and xml:tm standards.
Translation memory files
Translation memories are usually large database files which contain previously translated texts, their formatting and other properties. Some of the properties are set by default (e.g. source and target language, date, time, ID of the person or software that performed the translation, etc.), while others can be added as custom attributes. Each CAT tool has its way to store translation memories, but it is extremely important for all language providers to share TMs in order to perform their tasks. Translation Memory eXchange (TMX) is an XML-compliant format allowing to represent the database structure, designed for the interchange of translation memories among different CAT tools.
Whatever the format of the source file, in most cases translation and related processes are carried out on text-only extractions which include tags and placeholders to maintain the original printing or displaying layout. They also support the coexistence of sentences and their translation, side by side. The XML Localisation Interchange File Format (XLIFF) provides a unified structure used for bilingual documents. XLIFF is used as a “bridge” format which gives the appropriate structure to the extracted text. Specific elements and attributes provide the means to define the properties of each pair of segments (source and target), such as source and target language, the extraction tool, etc. Unlike the above formats, the XLIFF standard is being developed by the OASIS consortium.
Gettext Portable Object (PO) is also a multilingual format and is designed specifically for the software localization industry. PO files contain a field for each string to translate and a field to enter the translation.
Term-Base eXchange (TBX), Universal Terminology eXchange (UTX) and Open Lexicon Interchange Format (OLIF) are three XML-compliant formats specifically designed for terminological and lexical data. The three of them support glossaries for both human translation and machine translation. They store pairs of source-target terms, as well as other terminology data, including type of word, gender, number and more detailed lexical information.
Other file types
Other file types taking part in the computer assisted translation process have their standard formats as well:
The Segmentation Rules eXchange (SRX) format is used to define segmentation rules. Segmentation is the operation that allows to divide the text into chunks, called segments, that can be translated one by one. When a program needs to segment a document, rules are needed to determine where a segment ends and the next one starts. In most cases it will consider a full stop as a segment end mark, but not when it is a dot within an Internet address, or for acronyms, for instance.
Global information management Metrics eXchange or GMX is a collection of standards which intend to provide common means to measure quantitative aspects of a document, like word counts, complexity, etc. When a translation agency receives a job request, it needs to estimate the whole translation work in order to quote the project. Translation quotes for exactly the same text can vary a lot, because every translation company measures in a different way the complexity and length of a text. With the development and integration of the GMX standards, the translation industry will benefit from verifiable and defined metrics applied to text documents.
The growth of the translation market over the past decade has determined a great need to develop translation service quality standards. On one hand, the demand is growing with the increasing volume of written information and the number of clients, mostly not familiar with concepts like localization, internationalization and globalization. On the other hand, the Internet is making it easier than ever to start and run a translation business with very little investment, so much so that many new and unexperienced providers are crowding the marketplace.
As a result, governments and other institutions like translation associations have promoted the introduction of quality standards to formally describe all the steps involved in delivering a satisfactory translation service. Whatever the final product, i.e. a commercial contract, documentary subtitles, a product catalogue, a multilingual web page, etc. standards benefit end customers by providing a framework within which they can weigh their experience with their language service provider against a recognized and unbiased criterion.
Since it is very hard to agree on a unique definition and measurement of written translation quality, process standards mainly focus on the overall quality of a traditional full translation workflow, from the service request to the output delivery. In fact, they do not provide specific criteria for translation or project quality, as they are highly subjective. They rather set out parameters that LSPs should consider prior to start a translation project (human resources, project analysis and quoting, customer’s specifications and communication), during its execution (terminology management, translation, editing, formatting, proofreading, and quality control) and after delivery (translation memory maintenance, feedback tracking).
Unlike many manufacturers and service suppliers have the ISO 9001 as one main international certification available, translation industry’s best practices are defined in several quality standards, depending on the geographical location. In Europe, EN 15038 standard aims at unifying the terminology of translation business as well as defining best practices for the buyer-seller relationship. In North America, Canada’s Language Industry Association (AILIA) has contributed to the development of the National Standard for Translation Services CAN CGSB 131.10-2008, adapted from Europe’s EN 15038. In the US, the American Translators Association (ATA)endorsed the ASTM F2575-06 Standard Guide for Quality Assurance in Translation.
Standardization has many enemies in the translation industry. One evidence is that a telecom-oriented body like ETSI came forward to continue developing former LISA’s standard formats, not the industry’s stakeholders themselves, through their numerous organizations like EUATC, FIT, GALA, ELIA, etc. because they appear to lack the will to agree upon a unified set of standards. In fact, so far language industry has been passively subjected to evolving technology, instead of growing with it and leading its development. Moreover, there is an extensive lack of consistency because the initiative is left to a small group of organizations and vendor lock-in practices are a threat to the freedom of the market.
However, standards are already playing and will play a crucial role in the documentation and translation industry, and their progressive introduction is key to the evolution of translation technology. Although they do not cover all the aspects of translation services yet, standards offer an accepted and acceptable framework to implement better quality processes at all levels. Because they help improve the overall translation management, all of the stakeholders, including software developers, LSPs and final customers, are responsible to endorse standardization and enhance interoperability.
In fact, many software developers are adopting industry standards to count on a template to design compliant programs. As a result, LSPs of all sizes can free themselves from the restrictions of commercial software. On the other hand, by applying process standards, large LSPs can select their translation vendors based on their real qualifications, since having purchased a certain software licence should not be a condition for hiring a language provider.
More importantly buyers can benefit from an improved transparency in the translation market, regardless of the role that each party plays, and a better communication with LSPs to create flawless specifications. In other words, standards help customers get the best translation service because they promote competition and they make it easier for them to choose the most suitable LSP to fit their needs. They should therefore commit to hire suppliers who follow the standards.