View Full Version : Faceposing in languages other than English

03-04-2008, 06:39 PM
I am trying to create lip synching using Faceposer, and I've run into an obscure problem.

From what I can see, to automatically extract phoneme information, Faceposer runs the sound file through the MS SAPI speech recognizer, which lets it know exactly where each phoneme starts and ends, the text entered is mostly there to support the recognizer. From then on, the process of properly altering facial expressions is essentially a lookup table.

That obviously works fine for English, because the English speech recognizer is bundled with SAPI. It might work for Japanese and Chinese as well, for which Microsoft also offers recognizers on the same page, though I haven't tried, since I don't speak either of those languages.

However, I need to lip synch a language other than English, namely Russian. I could, probably, do it all manually, however, with the amount of speech I have planned the workload involved will be extremely prohibitive. Yet, I am certain this is somehow possible, because Episode 2 is dubbed completely into Russian, with proper lip synching, even though earlier games aren't. I doubt it was all done manually. (Portal is dubbed too, but I don't think there's a visible speaking mouth anywhere in that game...)

Unfortunately, from what I can see after hours of googling, a SAPI-compliant recognizer for Russian might not even exist, at least, I have not seen any on offer for any kind of specific money, let alone for free.

So, how?

03-06-2008, 07:24 AM
So it screws up with Cyrillic text? Try transliterating it into phonetic English, although that isn't the most efficient method.

Out of curiosity, can you extract one of the Russian sound files and load it into Faceposer? Is the sentence text in English?

03-06-2008, 09:55 AM
Oh, you wish it would just screw it up, it treats it as european accented characters in single-byte ISO-8895-1 first, which results in very cute gibberish... After studying the pieces of Faceposer-related code in the SDK, here's what actually happens when you press the redo extraction button:

1. It conjures up an instance of SAPI 5.1 speech recognizer for the language the Windows interface is written in, (:rolleyes:) though you can force it to select a specific language code with a command line switch. English is always the default and will get used if no more appropriate recognizer is found.
2. The recognizer, given the text that is assumed to be spoken as a guide, does it's magic, and spits out a list of phonemes and their locations, which is then converted to a list that is written into the wav file as an extra chunk marked VDAT, which contains the list in a fairly simple plaintext format.
3. Faceposer reads this list, allows you to edit it and move the phonemes around, and records it back with emphasis data once you tell it to "commit".

Basically, Faceposer contains no phoneme recognition technology whatsoever, it's all indeed done by Microsoft's recognizer. For a language for which you don't have a recognizer, your best bet is to transliterate it somehow and hope that the English recognizer can do it's magic. Unfortunately, it doesn't even work with every English speaker, so transliteration is pretty much pointless -- I got no useful results this way at all. Russian is just too different.

There are actually two recognizers out there which claim to understand Russian, however, these are intended to be only sold to telecoms for ridiculous amounts of money, so they might as well not exist.

The Russian dub data files, suprisingly, contain no sentence text at all, just a [Textless] tag. :) In fact, it appears that they either actually placed all the phonemes manually, or, more likely, used some external tool to write a VDAT chunk. The latter is hinted at by the fact that all of the phonemes are not split between words like in English files, instead being bunched together in a single WORD record, and the first phoneme, instead of starting at the time when sound actually starts in the file, is always a <sil> phoneme (denoting a silence pause) that starts at 0.000 seconds from beginning. I have a suspicion they might have made their own tool based on a SAPI 4.0 recognizer which actually seems to exist, (it's a longer story) however, I don't have the expertise required to replicate that just yet. Well, one could, probably, make a SAPI 4.0-based phonemeextractor.dll, (or even a 5.3 based dll for those Vista users who have problems with automatic recognition) however, nobody's giving me one, that's for sure...

I'm currently thinking of writing a script which would do preliminary phoneme splitting by converting text to phonemes with a large phonetic dictionary, and, assuming all phonemes are of equal length and given pauses as special symbols in text, attach them to reasonably appropriate places (and write transliterated text into the file too). This way, I'll only have to move them around manually instead of fishing for each one all the time. Unfortunately, Valve's phoneme codes don't quite correspond easily to any phonetic transcription standard I'm familiar with, ("ae" I can understand, and "r" is also obvious, but what does "r2" really mean?) and I'm not yet sure where they're getting them from, I'd have to study the SAPI documentation much deeper to figure it out.

In short, urgh.

03-07-2008, 12:16 AM
The numbered phonemes are variants; they look different in phonemes.txt, and thus phonemes.vfe. However, it's worth noting that "g" and "g2" look exactly the same.

These two (http://www.cloudgarden.com/JSAPI/docs/SAPI5Pronunciations.html) pages (http://www.cloudgarden.com/JSAPI/docs/SAPI4Pronunciations.html) have phoneme tables and descriptions. After an initial gander: I noticed the following differences:

Valve's "c" is SAPI's "k"
Valve/SAPI4's "hh" is SAPI5's "h"
Valve/SAPI4's "nx" is SAPI5's "ng"

There's also an American English dictionary here (http://www.speech.cs.cmu.edu/cgi-bin/cmudict) that uses similar phonemes.

03-08-2008, 02:14 AM
After an initial gander: I noticed the following differences:

What's interesting is that looking at the code, it doesn't appear that phonemeextractor.dll is altering phonemes in any way after getting them from SAPI, and what it's getting is SAPI 5.1's output...

Going by what they look like when pronounced won't work though -- most visemes correspond to more than one phoneme.