Although Deepsync has had great success cloning voices in the English language. In contrast, cloning native languages in high-definition was much more challenging.
It was very important for us to get the characters, pronunciation, and punctuation right. Not only right but so perfect that when listening to a cloned audio in Hindi, it should make the listener feel like they are hearing a real human voice.
When we started our voice cloning journey with the English language, we ensured that the cloned audio sounded exactly similar, and felt natural as to the original, also keeping intact the subtleness of a real human voice.
To perfect the English cloned audio, we interacted with numerous podcasters and Enterprise hosts and cloned their voices, taking feedback from them on any aspect they could illuminate to help improve our solutions.
Furthermore, we also took notice of other companies cloning voices so that we can understand the areas that need improvement to produce a perfectly cloned audio.
Thanks to our continuous learning and the feedback we received, we were able to produce natural-sounding cloned voices in English that were almost impossible to distinguish from the original.
English cloned audio example:
Note: The cloned audio you're listening to in this video was derived from audio data of 3 hours and 10 minutes. Despite having less audio data, we were able to match the cloned audio perfectly to the original.
As soon as we successfully cloned the English language in studio quality, our attention turned to the Hindi language, but the road ahead wasn't easy.
Because Hindi is a much more diverse and complex language than English, we didn't have a good study, example, or place where we could find out whether perfect voice replication in Hindi was possible. As no one had ever perfected or mastered artificial voice in the Hindi language before us.
But with a strong determination to achieve our vision that creators and Enterprise hosts can produce compelling audio content in Hindi using their cloned voices, our journey began with the goal of cloning the Hindi language.
And today we are delighted to announce that Deepsync has mastered and advanced voice cloning in the Hindi language, being the first to achieve high fidelity and quality. Without any examples or feedback, our team perfected cloning a language that is 3rd most spoken language in the world.
Hindi cloned audio example:
Trust us mastering this language wasn’t easy at all, now let’s dive deep into understanding why the Hindi language was much more difficult to clone compared to the English language.
Hindi Is a Diverse and Complex Language
English and other languages using Latin script are non-phonetic languages, so the general trend in recent research has been to first translate the words into their phonetic equivalents, which represent how the words are pronounced, generally using ARPABET or IPA.
These phonetic tokens form the basis of the other features, such as pitch, energy, duration, etc. that are extracted from the raw audio. These features are then served to our proprietary text-to-spectrogram model.
The phonetic nature of Hindi (and other languages in Devanagari script) makes them significantly different from English as well as other languages. As a result, they are pronounced exactly as they are written.
In the Devanagari script, the orthography reflects the pronunciation of the language. There is no concept of letter case in the script, unlike in the Latin alphabet. As it's written from left to right, it has symmetrical rounded shapes within squared outlines, and its top of full letters is adorned with a horizontal line called a ‘shirorekhā’.
14 vowels and 33 consonants make up Devanagari. You can write and speak the consonants and vowels separately and together. Vowels (called 'swar') in Hindi have a short form (called 'maatra') they are used with consonants to give the consonants different verbal intonations.
The 'halant' character is also commonly used in Hindi and other languages of Devanagari script to denote the partial pronunciation of a consonant.
Hindi punctuation uses commas (called Alp-Viraam in Hindi), semicolons (called Ardh-Viraam in Hindi), question marks, colons, exclamation marks, and apostrophe marks quite similarly to English punctuation. Rather than a period, the full stop is represented by a standing line, called 'purn-viraam'.
Dataset preparation For The Hindi Language
As the script is represented in Unicode, its symbols include consonants, vowels, their short forms, punctuation marks, and halants, resulting in a larger vocabulary than the English language.
The lack of an open-source database for Hindi audio led us to use a private dataset consisting of approximately 18 hours of audio that narrated various novels in their entirety. To map the natural pauses, audio was split into variable-length segments from 1.5 seconds to 13 seconds.
Afterward, we used ASR's existing cloud services to extract text from the audio. The cloud service provided normalized text (lexical representation, free of special characters, numbers, and punctuation) as well as standard text (a text that contains numbers and punctuation).
To model the punctuations along with the normalized text, we developed a script that mapped the punctuations between the normalized and non-normalized text, thus resulting in normalized text with all the punctuation expressed.
Using Montreal Forced Aligner, we mapped the raw audio phonemes to their respective durations and then mapped them to their respective frames in the mel-spectrograms.
We now had a dataset containing both the raw audio and the phonetic representation, as well as the duration of all these phonemes, punctuations, and non-punctuation natural pauses.
By using these tokens as the base, we were able to extract other features from the audio similar to the English model.
As soon as we mapped the Hindi character symbols for the Encoder in the initial embedding layer, the model behaved agnostically to the language and produced well-formed mel spectrograms with pauses and punctuation that were highly accurate.
Inference
The inference process begins with preprocessing. We remove non-Unicode symbols, normalize the numerical values to their Hindi counterparts, replace spaces with word separators for the model, and add the BOS token (Beginning of Sentence) to make it sound more natural. To create raw waveforms, the Mel-spectrogram output is fed into the vocoder model which gives high-quality audio in return.
The Path Ahead
In our AI journey, we've come a long way, and we're proud that we reached this milestone, which has yet to be accomplished by anyone else with this high audio fidelity.
This achievement is just the tip of the iceberg. In the coming months, we are going to launch some new features like expression and translations which in itself will revolutionize the cloned audio production process and how the cloned audio sound and feel on a whole new level.
Translation enables users to produce audio in another language without needing to provide their audio data in that language. Deepsync will only require users to provide audio data that could be in any language. Based on the given data, users will be able to create audio content in multiple languages.
Imagine a journalist who only knows English can produce high-definition podcasts in Telugu or Malayalam?
Yes, you read it right, a lot more new changes and innovations are on the way. Stay tuned and experience how AI will revolutionize the audio production process.
With Deepsync, produce audio content in your regional language, audio that captures your essence, accent, and tone. Express your feelings in your native language without even speaking a word.
You can schedule a live demo, with us at your convenience to help us answer your questions by clicking on the image below
Glossary
Orthography: the representation of the sounds of a language by written or printed symbols.
Mapping: any prescribed way of assigning to each object in one set a particular object in another (or the same) set. Mapping applies to any set: a collection of objects, such as all whole numbers, all the points on a line, or all those inside a circle.
Mel spectrogram: Mel spectrogram is a spectrogram that is converted to a Mel scale. A spectrogram is a visualization of the frequency spectrum of a signal, where the frequency spectrum of a signal is the frequency range that is contained by the signal.
Vocoder: A vocoder is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption, or voice transformation.
Inference: In the ML domain, `inference` refers to generating required outputs from a machine learning model or pipeline.
Consonant: A consonant is a speech sound that is not a vowel. It also refers to letters of the alphabet that represent those sounds: Z, B, T, G, and H are all consonants.
Encoder: In general an encoder is a device or process that converts data from one format to another. In position sensing, an encoder is a device that can detect and convert mechanical motion to an analog or digitally coded output signal.
Phonetic representation: It is a method of representing how a certain word is pronounced and articulated depending on various factors like accent, region, etc.