Machine translation, internationalization, and the Open Review Toolkit

 

The Open Review Toolkit is designed to help create better books, higher sales, and increased access to knowledge.  All of these goals—especially increased access to knowledge—could be advanced if all books could be published in all languages simultaneously. Unfortunately, that’s not possible yet. But, machine translations can help us move in the right direction, and so the Open Review Toolkit has excellent support for hosting the same manuscript in many languages. In this blog post, I’ll describe our experience machine translating my new book Bit by Bit Social Research in the Digital Age into more than 100 languages. The process took just a few days and cost about 30 US dollars per language.

Bit by Bit is for social scientists who want to do more data science, data scientists who want to do more social science, and anyone interested in the hybrid of these two fields. I wrote Bit by Bit in English because that’s my native language. While it is probably true that much of the target audience for Bit by Bit reads English, the world is a really, really big place. There are almost certainly a lot of people that would like to read Bit by Bit in their native language. So, in order to increase access, we used Google Translate to machine translate Bit by Bit into more than 100 languages. They are:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani,  Basque, Belarusian, Bengali, Bosnian, Bulgarian, Catalan, Cebuano, Chichewa, Chinese Simplified, Chinese Traditional, Corsican, Croatian, Czech, Danish, Dutch, Esperanto, Estonian, Filipino, Finnish, French, Frisian, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar (Burmese), Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scots Gaelic, Serbian, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sudanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Xhosa, Yiddish, Yoruba, Zulu

One may wonder about the quality of these translations.  To be honest, as far as I can tell, they nearly as good as a human translation.  But, I expect that machine translation will continue to improve, and the cost will likely continue to decrease.  Right now the cost is about 30 US dollars to translate the entire book from English to another language.  So, for just 150 US dollars an author of a book manuscript in English could translate it into the other 5 official languages of the UN (Arabic, Chinese, French, Russian, and Spanish), which would make it accessible to about 3 billion readers (based on estimates published in Ethnologue).  Given the likely increase in quality, decrease in cost, and the large number of non-English speakers in the world, it is hard for me to imagine that machine translation will not play some role in academic publishing in the future.

The Open Review Toolkit has a lot of built in support for doing machine translations and hosting them. Here’s how the plumbing works right now, as described by Luke Baker, the lead developer of the Open Review Toolkit.

The process starts with the English version of the book in HTML format. This is one large HTML file and the output of pandoc. From there we do the following:

1. Add “notranslate” classes to particular sections in the HTML that we don’t want Google to translate (e.g., mathematical equations).
2. Next we send chunks of that large HTML file to the Google Translate API. The chunks work around limits enforced by the Google Translate API. Typically, we’re translating a paragraph at a time.
3. Finally, we merge all the chunks into a single HTML file for the new language. This file then goes through the regular Open Review Toolkit process to convert it to web pages.

When dealing with the Google Translate API, we cache responses to the various chunks so that we can skip translating a chunk if we already have a translation for that chunk. Additionally, there are a variety of quotas in place when using the Google Translate API. Occasionally, we will see errors if we translate too many characters in a short amount of time. The script will sleep 30 seconds before trying the failed chunk again. Additionally there’s a per-day quota that is dealt with manually (estimating how many languages we can translate per day and stopping after that).

This process may sound a bit clunky but as Google Translate and the Open Review Toolkit continue to improve, I expect that more and more books will be using machine translation to reach an even larger audience.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s