Model Versions
Background
Resemble’s artificial intelligence and machine learning research team is continually making state-of-the-art improvements with new techniques for voice cloning, audio synthesis and voice conversion.
To ensure access to the latest and most powerful models, Resemble’s platform provides multiple generations of model versions. The purpose of this document is to provide a high level overview of the model versions available on the platform, their associated feature scope and availability within customer plans.
Models
Text To Speech
Version Name | Version Code | Description | Dataset Requirements | Streaming Support | Release Date |
---|---|---|---|---|---|
Resemble Legacy TTS | tts-legacy | Resemble’s initial TTS model offering a balance of speed and quality. | 1+ minutes | Yes ✅ | Q2 2021 |
Resemble Enhanced TTS V1 | tts-v1 | Resemble’s first generation of enhanced text-to-speech offering industry state-of-the-art naturalness. | 10+ minutes | No 🚫 | Q2 2023 |
Resemble Enhanced TTS V2 | tts-v2 | Resemble’s second generation of enhanced text-to-speech offering state-of-the-art naturalness and lower latency for time-to-first sound. | 30+ minutes | Yes ✅ | Q3 2023 |
Resemble Enhanced TTS v3 | tts-v3 | Resemble’s latest offering of text-to-speech providing an exceptional balance low latency while avoiding compromise in state-of-the-art naturalness of cloned voices. | 10+ minutes | Yes ✅ | Q4 2023 |
The table below provides detailed breakdown of performance statistics and limitations associated with the models.
Version Name | Version Code | Latency / TTFS* | Character Limits | Limitations |
---|---|---|---|---|
Resemble Legacy TTS | tts-legacy | 170ms | Maximum 3000 characters. | N/A |
Resemble Enhanced TTS V1 | tts-v1 | 2000-3000ms | Maximum 280 characters. | SSML Tags Not Supported: <prosody> , <emotion> , <phonemes> , <substitutions> , <emphasis> , <say-as> . Timestamps not supported. Resemble Fill not supported. |
Resemble Enhanced TTS V2 | tts-v2 | 580ms | Maximum 1000 characters. | SSML Tags Not Supported: <prosody> , <emotion> , <phonemes> , <substitutions> , <emphasis> , <say-as> . Timestamps not supported. Resemble Fill not supported. |
Resemble Enhanced TTS v3 | tts-v3 | 350ms | Maximum 1000 characters. | SSML Tags Not Supported: <prosody> , <emotion> , <phonemes> , <substitutions> , <emphasis> , <say-as> . Timestamps not supported. Resemble Fill not supported. |
* Time-to-first-sound - the metrics reported are best case scenario, various factors can affect end user latency such as: load times, cold boot, network latency, and more
Speech To Speech
Version Name | Version Code | Description | Dataset Requirements | Streaming support | Release Date |
---|---|---|---|---|---|
Resemble Legacy STS | sts-legacy | Resemble’s initial speech-to-speech model offering providing users the ability to convert speaker audio from one voice to another. | 10+ Minutes | Yes ✅ | Q2 2021 |
Resemble Core STS V1 | sts-v1 | Resemble’s first generation of core speech-to-speech functionality offering state-of-the-art speaker audio conversion with greater speed and accuracy. | 10+ Minutes | Yes ✅ | Q2 2023 |
Resemble Core STS V2 | sts-v2 | Resemble’s second generation of core speech-to-speech functionality offering all the benefits of sts-v1 with improved pitch tracking and 48kHz support. | 10+ Minutes | Yes ✅ | Q4 2023 |
Resemble Fill
Version Name | Version Code | Description | Dataset Requirements | Release Date |
---|---|---|---|---|
Resemble Fill (Audio Inpainting) | fill-v1 | Resemble’s flag ship audio inpainting model allowing users to inpaint audio recordings with novel audio. | 10+ minutes (initial TTS model training) | Q2 2021 |