Wednesday, January 18, 2017

Objective Assessment of Machine Translation Technologies

Here are some comments by John Tinsley, CEO, Iconic Translation Machines that are a response to the Lilt Evaluation, CSA comments on this evaluation, and my last post, on the variety of problems with quick competitive quality evaluations. 

He makes the point about how the best MT engines are tuned to a specific business purpose very carefully and deliberately e.g. the MT systems at eBay and all the IT domain Knowledge Bases translated by MT. None of them would do well in an instant competitive evaluation like the one Lilt did, but they are all very high-value systems at a business level. Conversely, I think it is likely that Lilt would not do well in translating the type of MT use-case scenario that Iconic specializes in since they are optimized for other kinds of use cases where active and ongoing PEMT is involved (namely typical localization).

These comments describe yet another problem with a competitive evaluation of the kind done by LiltLabs. 

John explains this very clearly below and his statements hold true for others who provide deep expertise based customization like tauyou, SYSTRAN, and SDL.  However, it is possible that the Lilt evaluation approach could be valid for instant Moses systems and for comparisons to raw generic systems. I thought that these statements were interesting enough that they warranted a separate post.

Emphasis below is all mine.



The initiative by Lilt, the post by CSA, and the response from Kirti all serve to shine further light on a challenge we have in the industry that, despite the best efforts of the best minds, is very difficult to overcome. Similar efforts were proposed in the past at a number of TAUS events, and benchmarking continues to be a goal of the DQF (though not just of MT).

The challenge is in making an apples to apples comparison. MT systems put forward for such comparative evaluations are generally trying to cover a very broad type of content (which is what the likes of Google and Microsoft excel at). While most MT providers have such systems, they rarely represent their best offering or full technical capability.

For instance, at Iconic, we have generic engines and domain-specific engines for various language combinations and, on any given test set may or may not outperform another system. I certainly would not want our technology judged on this basis, though! 

From our perspective, these engines are just foundations upon which we build production-quality engines.

We have a very clear picture internally of how our value-add is extracted when we customise engines for a specific client, use case, and/or content type. This is when MT technology in general, is most effective. However, the only way these customisations actually get done are through client engagements and the resulting systems are typically either proprietary or too specific for a particular purpose to be useful for anyone else.

Therefore, the best examples of exceptional technology performance we have are not ones we can put forward in the public domain for the purpose of openness and transparency, however desirable that may be.

I've been saying for a while now that providing MT is a mix of cutting-edge technology, and the expertise and capability to enhance performance. In an ideal world, we will automate the capability to enhance performance as much as possible (which is what Lilt are doing for the post-editing use case) but the reality is that right now, comparative benchmarking is just evaluating the former and not the whole package.

This is why you won't see companies investing in MT technology on the basis of public comparisons just yet.


These comments are also available at:  


Tuesday, January 17, 2017

The Trouble With Competitive MT Output Quality Evaluations

The comparative measurement and quality assessment of the output of different MT systems is a task that has always been something that is difficult to do right. Right, in this context means, fair, reasonable, and accurate. The difficulty is closely related to the problems of measuring translation quality in general, that we discussed in this post. This difficulty is further aggravated when evaluating customized and/or adapted systems, since doing this requires special skills and real knowledge of each MT platform, in addition to time and money. The costs associated with doing this properly make it somewhat prohibitive.

BLEU is the established measurement of choice, we all use, but it is easy to deceive yourself, deceive others, and paint a picture that has the patina of scientific rigor, yet be completely biased and misleadingly false. BLEU, as we know is deeply flawed, but we don't have anything better, especially for longitudinal studies, even though if you use it carefully, it can be useful in providing some limited insight in a comparative evaluation.

In the days of the NIST competitive evaluations, the focus was on Chinese and Arabic to English (News Domain) and there were some clear and well-understood rules on how this should be done to enable fair competitive comparisons. Google was often a winner (i.e. highest BLEU score), but they sometimes "won" by using systems that took an hour to translate a single sentence, because they evaluated 1000X as many translation candidate options as their competitors, to produce their best one. Kind of bullshit, right? More recently, we have the (WMT16) that attempts to go beyond the news domain, does more human evaluations, evaluates PEMT scenarios, and again controls the training data used by participants, to attempt to fairly assess the competitors. Both of these structured evaluation initiatives provide useful information if you understand the data, the evaluation process, the potential bias, but both are also flawed in many ways, especially in the quality and consistency of human evaluations.

One big problem for any MT vendor in doing output quality comparisons, with Google, is that for a test to be meaningful, it has to be with something that the Google MT system does not already have in its knowledge database (training set).  Google crawls news sites extensively (and the internet in general) for bilingual text (TM) data, so the odds of finding data they have not seen are very low. If you give a college student all the questions and the answers that are on the test, before they take the test, the probability is high that they will do well on that test. This is why Google generally scores better on news domain tests against most other MT vendors, as they likely have 10X to 1,000X the news data that anybody except for Microsoft and Baidu has. I have also seen MT vendors and ignorant LSPs show off unnaturally high BLEU scores, by having an overlap between the training and test set data. The excitement dies quickly, once you get to actual data you want to translate, that the system has not seen before.

Thus, when an MT technology vendor tells us that they want to create a lab to address the lack of independent and objective information on quality and performance, and create a place where “research and language professionals meet,”one should be at least a little bit skeptical, because there is a conflict of interest here, as Don DePalma pointed out. But, after seeing the first "fair and balanced" evaluation from the labs, I think it might not be over-reaching to say that this effort is neither fair nor balanced, except in the way that Fox News is. At the very least, we have gross self-interest pretending to be in the public interest, just like we now have with Trump in Washington D.C. But sadly, in this case, they actually point out that, even with customization/adaptation, Google NMT outperforms all the competitive MT alternatives, including their own. This is like shooting yourself in the hand and foot at the same time with a single bullet!

A List of Specific Criticisms

Those who read my blog regularly know that I regard the Lilt technology favorably, and see it as a meaningful MT technology advance, especially for the business translation industry. The people at Lilt seem to be nice, smart, competent people, and thus this "study" is surprising. Is this deliberately disingenuous, or did they just get really bad marketing advice to do what they did here?

Here is a listing of specific problems that would be clear to any observer who did a careful review of this study and its protocol.

Seriously, This is Not Blind Data.

The probability of this specific data being truly blind data is very low. The problem with ANY publicly available data is, that it has a very high likelihood of having been used as training data by Google, Microsoft, and others. This is especially true for data that has been around as long the SwissAdmin corpus has been. Many of the tests typically used to determine if the data has been used previously are unreliable, as the data may have been used partially, or only in the language model. As Lilt says:  "Anything that can be scraped from the web will eventually find its way into some of the (public) systems" and any of the things I listed above happening, will compromise the study. If this data or something very similar is being used by the big public systems, it will skew the results and result in erroneous conclusions. How can Lilt assert with any confidence that this data was not used by others, especially Google? If Lilt was able to find this data, why would Google or Microsoft not be able to as well, especially since the SwissAdmin corpus is described in detail in this LREC 2014 paper

Quality Evaluation Scoring Inconsistencies: Apples vs. Oranges

  • The SDL results and test procedure seems to be particularly unfair and biased. They state that, "Due to the amount of manual labor required, it was infeasible for us to evaluate an “SDL Interactive” in which the system adapts incrementally to corrected translations." However, this unfeasibility does not seem to prevent them from giving SDL a low BLEU score.  The "adaptation" that was conducted was done in a way that SDL does not recommend for best results, thus publishing such sub-optimal results is rude and unsportsmanlike conduct. Would it not be more reasonable to say it was not possible, and leave it blank?
  • Microsoft and Google released their NMT systems on the same day, November 15, 2016. (Click on the links to see). But Lilt chose to only use the Google NMT in their evaluation.
  • SYSTRAN has been updating their PNMT engines on a very regular basis and it is quite possible that the engine tested was not the most current or best-performing one. At this point in time, they are still focusing on improving throughput performance, and this means that lower quality engines may be used for random, free, public access for fast throughput reasons. 
  • Neither SYSTRAN nor SDL seems to have benefited from the adaptation, which is very suspicious, and should they not be given an opportunity to show this adaptation improvement as well?
  • Finally, one wonders how the “Lilt Interactive” score is processed. How many sentences have been reviewed to provide feedback to the engine? I am sure Lilt took great care to put their own best systems forward, but they also seemed to have been less careful and even seem to have executed sub-optimal procedures with all the others, especially SDL. So how can we trust the scores they come up with?

Customization Irregularities

This is still basically news or very similar-to-news domain content. After making a big deal about using content that "is representative of typical paid translation work" they basically choose data that is heavily news-like. Press releases are very news-like and my review of some of the data suggests it also looks a lot like EU data, which is also in the training sets of public systems. News content is the default domain that public systems like Google and Microsoft are optimized for, and it is also a primary focus of the WMT systems. And for those who scour the web for training data, this domain has by far the greatest amount of relevant publicly available data. However, in the business translation world, which was supposedly the focus here, most domains that are relevant for customization are exactly UNLIKE News domain. The precise reason they need to develop customized MT solutions is because their language and vocabulary are different, from what public systems tend to do well (namely news). The business translation world tends to focus on areas where there is very little public data to harvest, either due to domain-specificity – medical, automotive, engineering, legal, eCommerce etc. or due to company-specific terminology. So, basically testing on news-like content does not say anything meaningful about the value of customization in a non-news domain. What it does say is that public generic systems do very well on news, which we already knew from years of WMT evaluations which were done with much with more experimental rigor and more equitable evaluation conditions.

Secondly,  the directionality of the content matters a lot. In “real life”, a global enterprise generates content in a source language where it is usually created from scratch by native speakers of that language and needs it translated into one or more target languages. Therefore, this is the kind of source data that we should test if we are trying to recreate the localization market scenario.  Unfortunately, this study does NOT do that (and to be fair this problem infects WMT and pretty much the whole academic field – I don’t mean to pick on Lilt!). The test data here started out as native Swiss German, and then was translated into English and French. In the actual test conducted, it was evaluated in the English⇒French and EnglishGerman direction. Which means that the source input text was obtained from (human) translations, NOT native text. This matters. Microsoft and others have done many evaluations to show this. Even good human translations are quite different from true native content. In the case of English⇒French, both the source and the reference is translated content. 

There is also the issue of questionable procedural methodology when working with competitive products. From everything I gathered in my recent conversations with SDL, it is clear, that adaptation by importing some TM into Trados is a sub-optimal way to customize an MT engine in their current product architecture. It is even worse when you try and jam a chunk of TM into their adaptive MT system, as Lilt also admitted. One should expect very different, and sub-optimal outcomes from this kind of an effort since the technology is designed to be used in an interactive mode for best results. I am also aware that most customization efforts with phrase-based SMT involves a refinement process, sometimes called hill-climbing. Just throwing some data in, and taking a BLEU snapshot, and then concluding that this is a representative outcome for that platform is just wrong and misleading. Most serious customization efforts require days of effort at least, if not weeks to complete, prior to a production release.

Another problem when using human translated content as source or reference is that in today’s world, many human translators start with a Google MT backbone and post-edit. Sometimes the post-edit is very light. This holds true whether you crowd-source, use a low-cost provider such as unBabel (which explicitly specifies that they use Google as a backbone), or a full-service provider (which may not admit this, but that is what their contract translators are doing with or without their permission). The only way to get a 100% from-scratch translation is to physically lock the translator in an internet-free room! We already know for the multi-reference data sets, that there are many equally valid ways to translate a text. When the “human” reference is edited based on Google, the scores naturally favor Google output. 

Finally,  the fact that the source data starts as Swiss German, rather than regular German may also be a minor problem. The differences between these German variants appear to be most pronounced when it is spoken rather than written, but Schriftsprache (written Swiss German) does seem to have some differences with standard high German. Wikipedia does state that: "Swiss German is intelligible to speakers of other Alemannic dialects, but poses greater difficulty in total comprehension to speakers of Standard German. Swiss German speakers on TV or in films are thus usually dubbed or subtitled if shown in Germany."

 Possible Conclusions from the Study

All this suggests that it is rather difficult for any MT vendor to conduct a competitive evaluation in a manner that would be considered satisfactory and fair to, and by, other MT vendor competitors. However, the study does provide some useful information:

  • Do NOT use News domain or news-like domain if you want to understand what the quality implications are for "typical translation work".  
  • Google has very good generic systems, which are also likely to be much better with News domain than with other specialized corporate content.
  • Comparative quality studies sponsored by an individual MT vendor are very likely to have a definite bias, especially on comparing customized systems.
  • According to this study, if these results were indeed ACTUALLY true, there would little point to using anything other than Google NMT.  However, it would be wrong to conclude that using Google would be better than properly using any of the customized options available since except for Lilt, we can presume they have not been optimally tuned. Lilt responded to my post comment on this point saying, "On slightly more repetitive and terminology-heavy domains we can usually observe larger improvements of more than 10% BLEU absolute by adaptation. In those cases, we expect that all adapted systems would outperform Google’s NMT."
  • Go to an independent agent (like me or TAUS) who has no vested interest other than to get accurate and meaningful results, which also means that everybody understands and trusts the study BEFORE they engage. A referee is necessary to ensure fair play, in any competitive sport as we all know from childhood.
  • It appears to me (only my interpretation and not a statement of fact) that Lilt's treatment of SDL was particularly unfair. In the stories of warring tribes in human literature, this usually is a sign that suggests one is particularly fearful of an adversary.  This intrigued me, so I did some exploration and found this patent which was filed and published years BEFORE Lilt came into existence.  The patent summary states: "The present technology relates generally to machine translation methodologies and systems, and more specifically, but not by way of limitation, to personalized machine translation via online adaptation, where translator feedback regarding machine translations may be intelligently evaluated and incorporated back into the translation methodology utilized by a machine translation system to improve and/or personalize the translations produced by the machine translation system."  This clearly shows that SDL was thinking about Adaptive MT long before Lilt. And, Microsoft was thinking about dynamic MT adaptation as far back as 2003. So who really came up with the basic idea of Adaptive MT technology? Not so easy to answer, is it?
  • Lilt has terrible sales and marketing advisors if they were not able to understand the negative ramifications of this "study", and did not try to adjust it or advise against publicizing it in its current form. For some of the people I talked to in my investigation, it even raises some credibility issues for the principals at Lilt.

 I am happy to offer Lilt an unedited guest post on eMpTy Pages if they care to, or wish to, respond to this critique in some detail rather than just through comments. In my eyes, they attempted to do something quite difficult and failed, which should not be condemned per se, but it should be acknowledged that the rankings they produced are not valid for "typical translation work".  We should also acknowledge that the basic idea behind the study is useful to many, even if this particular study is questionable in many ways. I could also be wrong on some of my specific criticisms, and am willing to be educated, to ensure that my criticism in this post is also fair. There is only value to this kind of discourse if it furthers the overall science and understanding of this technology, and my intent here is to question experiment fundamentals, and get to useful results, not bash on Lilt. It is good to see this kind of discussion beginning again, as it suggests that the MT marketplace is indeed evolving and maturing.


P.S. I have added the Iconic comments as a short separate post here to provide the perspective of MT vendors who perform deep, careful, system customization for their clients and who were not included directly in the evaluation.

Thursday, January 12, 2017

The Missed Opportunity of Translation Management Systems

Recently, I had been having a discussion with a TMS vendor about potential new directions for  Translation Management Systems to evolve in, when Luigi Muzii, in an independent and unrelated interaction, suggested this new post based on some of his own observations about TMS systems. So based on this unplanned synchronicity, I felt it would be good to air this. I invite others, especially from the TMS vendor community to come forward with their perhaps more informed views.

It has seemed to me that TMS systems in general, focus largely on the translation management problems of yesteryear, rather than the current ones, let alone emerging needs where there is a much greater need for intelligent data technology, beyond translation memory. The data in most TMS systems is in what I would call a dumb repository. The data is tagged by project and customer, but rarely by more extensive metadata that would be valuable to leverage it in future translation projects. No TMS today can help you extract, for example, all the TM related to hybrid automobiles in your total data repository or identify key terminology for hybrid automobiles from the total repository. Look at what is possible with Linguee.  Why are these TMS systems not able to do this, at least on your own data? Data could be catalogued at various levels of granularity by PMs who are not chasing errant freelancers down, so relevant chunks can be retrieved when needed. This is what I mean by intelligent data, which understands the relative importance of phrases, some semantics and even relative overlap of ALL data resources with new project work. It should be able to respond to reasonable questions that involve a combination of words/phrases as any supportive automation should. The kinds of tools we use will confine us to certain kinds of work (software and documentation translation), if you want to solve big translation problems like eBay, Microsoft, Facebook is doing, you are going to need better tools. 

MT involves the use and management of large volumes of translation memory and other linguistically relevant data and it would be very useful for TMS systems to provide linguistic data analysis, manipulation, and transformation capabilities. Many of these corpus tools already exist in open source as Juan Rowda has pointed out, but it would be good, and of great value to any global enterprise and large agency users, to have these kinds of tools closely integrated so that they can easily interact with large-scale TM repositories.  The future TMS system needs to be much more useful to, and supportive of, large-scale translation projects that involve tens of millions of words, and both MT and post-editor feedback, as these kinds of projects, will become much more prevalent. No TMS tools I know would be helpful to do normalization and standardization of MT training data even though they keep all of the TM. It would also be very useful for next-gen systems to also provide a sense for the relative similarity or dissimilarity of subsets of data

It also seems somewhat archaic that TMS systems still need to break large TEP projects into smaller translation packages to hand off to individual freelancers and does not have a more elegant form of collaborative translation where this disassembly and re-assembly is NOT necessary. SmartCAT is refreshing and interesting because it solves this very basic problem and is also free. The translation collaboration outlook is still pretty bleak but I think better collaboration is one area that TMS vendors should really focus on and make more easily implementable. 

 I don't speak German so not really sure what this says, but I really like the question being asked.😁

What are the most important areas for Translation management systems to develop in future?  Here is a list I came up with as potential areas for improvement, which is easy for me to dream up since I don't have to actually do it.
  • Better database management so that systems can handle tens of millions of segments and can slide and dice these segments in every which way as needed.
  • Data anonymization and randomization capabilities so that TM can be leveraged for new kinds of translation problems without compromising data security or privacy.
  • NLP capabilities to help understand data characteristics at a corpus level and do the kinds of things that Linguee and Google allow you to e.g. predictive and auto-guessing words and phrases.
  • Translation collaboration that reduces the burden on project managers and actually frees them from policing roles to handling policy and overall quality driving processes.
  • Better integration with MT so corpus can be modified and prepared to raise quality of results and so that PEMT can be better integrated into advanced learning processes
  • More Flexible Workflow Design capabilities so that every client is not forced into a one size fits all mode of operation.
  • More automation of basic project management functions so human involvement is focused only on special exception situations.
  • More measurement metrics on job efficiency and expanding Key Performance Indicators by JobID, PM, language, client etc..
 The emphasis below in Luigi's post is all mine.

I have observed that people in the translation community, no matter how high they rank in their functional area or organization, are generally reactive rather than proactive. The latest financial events where Private Equity is moving in (i.e. Moravia, ULG, and LIOX) are a further confirmation of this attitude, although the recent history of the industry should have enough to tell, for any unbiased reader.

This reactive attitude results in a situation that industry outsiders are forced to introduce innovation into a field other than their core business because of the inability of their translation business partners to do it. And thus we see that these outsiders end up driving the "translation industry" forward and eventually the industry is ruled by customers rather than the original industry players.

While basic business and production processes have remained nearly unaltered for centuries,  technological development has been brought in and implemented following the request, when not actually the directive, of the most affluent, and tech-knowledgeable customers. In most cases, the most significant advancements, and indeed improvements, have been pioneered by those customers; technology providers have followed suit, rarely with striking results. In almost three decades, progress in translation has mostly consisted of some trivial outcome of the amelioration frenzy of software applications exploiting long-established technologies, with the claim, every time, to revolutionize the industry and the profession.

An examination of the translation management systems (TMS) in the market is quite representative of this behavior. TMSs are supposed to be the equivalent of workflow management systems (WFMSs), which date back to the early 1990s if you look at the broader business market. Curiously, or possibly not, TMSs generally lack most of the features of a typical mature full-fledged WFMS that can be found in virtually any business of some significance.

Lionbridge’s current CEO, Rory Cowan, summarized the at best, disappointing, NASDAQ experience of the company he has been leading for quite a few years, with a statement that is as simple as it is sad: “The U.S. markets really do not appreciate the translation industry.”

As a matter of fact, none of the top-ranking translation companies is an innovation champion, nor is it yet remotely comparable, financially, technologically, structurally, and operational sophistication, to even mid-level companies in most other service industries.

Typically, a WFMS is an application framework for building, managing and automating, as much as possible, a set of tasks forming one or more business processes. Some tasks may require human intervention, such as the development of content and/or its approval. Also, the system must allow users to introduce new tasks or describe new processes and typically comes with some form of process flow designer to create new workflow processes and applications through it.

On this basis, WFMS must allow users to define different workflows for different types of jobs or processes, associate one individual or group for any specific task at each stage in a workflow and establish dependencies for process management from top to lower level.

WFMSs should also allow for application integration to form a middleware framework.

Finally, WFMS should also consist of a project management module to help plan, organize, and manage estimation and planning, scheduling, cost control and budget management, resource allocation, collaboration, communication, decision-making, quality management and administration. Similarly, any TMS should enable translation businesses and customers to control their projects and monitor every task according to an established and agreed upon workflow.

A TMS is supposed to allow for automating all or at least most of the translation process while maximizing its efficiency by having repeatable and non-essential work be done by software. However, most state-of-the-art TMSs only provide some basic translation management functions. The thing is that no two people in the translation community would agree on what exactly the creative part of the process is, that can’t be left to machines.  And sadly, no two linguists can even come to an agreement and be able to describe and define the special skills and abilities that define and dignifies themselves as knowledge professionals.

Contrary to what happens in almost every other type of business, no TMS still exists with all the features described above. Specifically, all lack a comprehensive graphical workflow designer. Only one or two provide some very basic capabilities.

So, forget about Leibnitz[*].

As a matter of fact, most players in this industry are generally oblivious or at least uncomfortable with simple diagrams like activity diagrams and even flowcharts, which are graphical representations of workflows or processes. Although pertaining at large to the field of software engineering, modeling languages have long been established this approach as a standard way to visually describe the business and operational step-by-step activities in a system. However, most translation people are still incapable of envisaging a slightly more articulate process than the typical middleman-based agency model.

If this were not bad enough, translation project management is way far afield from real project management, and, not surprisingly, only a small bunch of translation/localization project managers hold a PMI certification.

The supposed and long-touted peculiarity of translation, together with the utter and thick rejection of any real business focus, despite the beloved acquaintance with rates and any relevant discussion related to these rates, is possibly a reason why TMSs are so estranged to WFMS and ERP.

To allow users always to find the best solution and resource as fast as possible a TMS should first be a central repository for language data and the vendor database. The aim of a TMS should be helping users cut out middlemen from the procurement process while enabling project managers to plan the best schedule and budget and pick the best resources. In such a way, TMSs are also expected to reduce overhead and hence costs and time.

Therefore, any current TMS with one-fits-for-all process model is somewhat problematic at least. Customers should be able to design their workflow and control where, when and how much human involvement is requested or needed for differing project needs.

Indeed, TMSs might have a dramatic impact when collaborative translation is effectively implemented, through parallelized process, but, for the same reasons why workflow graphical design features are almost absent from TMSs, real collaborative translation features are yet to come too.

Finally, most TMSs have already moved or are being moved into the cloud; however, while reducing capital expenditure and allowing for faster software deployment, the SaaS model presents some serious drawbacks in lower flexibility and higher customization costs. Of course, the features offered by TMSs depend on the level of sophistication, but the technical intricacies of such TMSs preclude them to be adopted by most LSPs and even by most customers.

TMSs might have been the means to lead customers rather than letting them dictate, but the chance has been wasted. So far.

[*] It is unworthy of excellent men to lose hours like slaves in the labor of calculation which could safely be relegated to anyone else if machines were used.


Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm. He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization related work.

This link provides access to his other blog posts.