How does the system prepare the text for Machine Translation processes?


Sometimes my files contain many tags, and the result of the translation generated by my machine translation provider is not great. What can I do to improve the results?


Markup handling and placement is a very complex topic for which results are always best when dealing with small volumes and generic HTML-based markup.

In Wordbee Translator, once a document is added to a project and marked for translation, the text gets extracted from the file using the rules defined in the file format configuration. This set of rules relies on RegEx and other segmentation mechanisms. This process parses the file to get all text that requires translation and creates a structure of translation units called segments.

As a result:

  • Each segment is a machine-readable string that any engine can process. These strings contain the text in the language processed and markup. The markup defined by extraction rules can be of different types (custom, HTML-compatible), which makes the string unique.

  • Because of the nature in which markup can be defined in the text extraction rules, the system needs to prepare the initially extracted string further to make it compatible with machine-related processes that can happen outside Wordbee Translator. The way the text and the markup are generated will have an impact on any machine-related processes.

Processing of segments (before and after machine translation)

When the text parsed using a file format configuration is prepared for machine translation, the following processing is applied to the segments:

  1. The text in the source language of the segment and its markup are further prepared to maximize the chances of getting the integrity of the content translated by the MT provider. The markup in the string is further converted into generic HTML markup to make it machine compatible.

  2. The new converted string is sent to the MT provider chosen, as per MT profile configuration.

  3. Once the MT provider generates the MT output, the Wordbee Translator verifies if the markup obtained in the output is valid as per the initial MT request. It checks if the translation generated by the MT provider has done the following:

    1. returned all markup

    2. the markup was correctly placed
      Wordbee Translator has several mechanisms that allow you to "roughly" fix any major markup issues. These aim at doing accurate translations and preventing problems when reconstructing the file with all translations, such as making the file readable in the first place.

  4. Finally, once the machine translation output is available and validated, the system needs to convert the HTML-based markup back to the style initially parsed in Wordbee Translator.


If the MT profile selected sends the text to 'Microsoft MT,' once translations are provided by the end MT engine, the system needs to convert back the markup handed over by Microsoft to what it originally was when the file was marked for online translation. The translation provided by Microsoft needs to be further processed to convert and place the markup accordingly. If things went well, there is none to little difference between these markups.

In a nutshell
Your RegEx markup goes to Microsoft and back to the Wordbee Translator platform. As a result, the Wordbee platform sometimes has to "fix" the markup. Unfortunately, fixing is never perfect, so you might occasionally see incorrectly placed markup.



Copyright Wordbee - Buzzin' Outside the Box since 2008