Translating a game while keeping the dubbing in mind

Or, to be more specific, “translating a game while keeping the mouth movement of the original voice acting in mind”


Unsurprisingly, this is a topic that’s not very well-known. Bear in mind that TV shows and movies also do something similar, but since I’ve never worked in these industries, I’ll keep the focus on video games.

So, here’s the thing: it doesn’t matter the budget of a game, nowadays it’s expected for the lips to sync with the text when a game is voiced and feature close-ups of the characters talking. And, for the most part, it works ok in the original language or English, because the mouth movement is animated with them in mind, be it manually or with motion capture. But what to do when a game is going to be translated and dubbed into foreign languages? It’s not financially feasible to redo all the mouth animations for every language. What should be done?

The answer usually is: leave it to the translators.

Let’s start by keeping in mind that no translation for dubbed content will ever be perfect, there are always trade-offs to be made. In most cases, in video games, we have different “grades” for different contexts signaling how much slack we have for the trade-offs, with most of these being related to the timing of the sentence, because a dub has to be the same time (or close to it) that the original was.

Ok, so, the worst of all is the “sound sync” lines, so called because they are usually in very important scenes, with the character’s face being shown up close (be it an in-engine rendered cutscene, an anime-style previously-drawn cutscene, etc), so the duration of the sentences have to match the opening of the character’s mouth as much as possible, including the pauses mid-sentence.

After that, we have the lines with a bit more wiggle room, usually indicating a range with a percentage, showing how much the sentence can vary in length. It’s something like “-/+ 15%”, for example, with some cases allowing up to 30%. These scenes, as the one before, also show the character’s face (and their moving mouth), but they either aren’t made to be that close up to the camera or aren’t made to realistically simulate the human mouth movement. Think about fighting games’ introductory lines, or 3D RPGs where the dialogues don’t zoom in on the characters’ faces (like Witcher, for example).

And then we have lines with no specific limit, but that should be as close as possible to the original because they might be affected by various environmental effects or things like that, without focusing on anyone’s face. This would be mostly things like barks in action games, for example.

Through all of this, we have to cut words out, simplify expressions, move things around in a dialogue, and do our best to adapt jokes, all to make sure the text is intelligible and fits the time limitation imposed by the dialogues, so when dubbed, it makes sense and doesn’t cut or sound weird.

Ok then, that seems hard enough, but doable. Now comes the worst part: in scenes with faces’ close-ups, the phonemes from both languages, source and target, should match as much as possible, keeping the illusion the scene was made for the language it’s being translated into. It’s… a very unfun process. Sometimes, these are called “lip syncs”, and it should be clear why.

Bear in mind: we don’t have to match every phoneme, that would be impossible, but there are a few “high-priority” ones that should match as much as possible, because otherwise “the mouth doesn’t follow the sound”. We all must’ve seen some bad dubbing in B movies or shows by now, so I think it’s clear what I mean here.

So, bilabial consonants are the biggest priority, because the mouth visually closes and opens with them. In English (and Portuguese), these are the [p], [m], and [b]. Then we have the labiodental consonants, which are the [f] and [v], but we don’t need to match them perfectly — although the animation + text feels better when we do. After that, things get hard to distinguish if you don’t read lips, so we can usually use almost anything. On the vowel front, we also do our best to match the vowel openings, especially in short words, but given how phonemes combine and make words up, it’s not always possible.

While it’s us, the translators, who take care of the text, the voice actors also have to do their part to match things up, use the right intonation, extend vowels when needed, and make smart use of prosody to match everything.

I’m not gonna lie, it’s a very tiring process, and it’s very easy to slip up and ship either a bad translation or a bad “mouth sync” between voice and animation. There are some software out there, trained through neural networks (one of the fairest uses of “AI”, I would say), that can match phonemes with mouth movement, and can be used for multiple languages without consuming too much time or resources from the devs. It’s awesome, and it completely frees the translators to focus only on the text itself, forgoing all the phoneme-matching and leaving it up for the machine to do everything while rendering the scene. It’s not as good as people manually editing the movement, but it’s not horrible either, and plenty of games use it already. I really, really hope this tech catches on because, despite all the work and extra time it all takes, translators are not paid extra to do the time/phoneme matching.

Now, one last thing that might be relevant: everything is done without having access to the scenes themselves, as the game is usually in production and the scenes don’t exist yet. If we are lucky, we get access to some mockups, but most of the time even the English voice-acting isn’t done yet, so everything is done blindly and just “trusting the process”.

If you ever played a game that’s not in English or in its native language, and the dubbing was killer, take some time to praise not only the voice actors, but also the translation team.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.