Version: 2.0.0

Model Versions

Background

Resemble’s artificial intelligence and machine learning research team is continually making state-of-the-art improvements with new techniques for voice cloning, audio synthesis and voice conversion.

To ensure access to the latest and most powerful models, Resemble’s platform provides multiple generations of model versions. The purpose of this document is to provide a high level overview of the model versions available on the platform, their associated feature scope and availability within customer plans.

Models

Text To Speech

Version Name	Version Code	Description	Dataset Requirements	Streaming Support	Release Date
Resemble Legacy TTS	`tts-legacy`	Resemble’s initial TTS model offering a balance of speed and quality.	1+ minutes	Yes ✅	Q2 2021
Resemble Enhanced TTS V1	`tts-v1`	Resemble’s first generation of enhanced text-to-speech offering industry state-of-the-art naturalness.	10+ minutes	No 🚫	Q2 2023
Resemble Enhanced TTS V2	`tts-v2`	Resemble’s second generation of enhanced text-to-speech offering state-of-the-art naturalness and lower latency for time-to-first sound.	30+ minutes	Yes ✅	Q3 2023
Resemble Enhanced TTS v3	`tts-v3`	Resemble’s latest offering of text-to-speech providing an exceptional balance low latency while avoiding compromise in state-of-the-art naturalness of cloned voices.	10+ minutes	Yes ✅	Q4 2023

The table below provides detailed breakdown of performance statistics and limitations associated with the models.

Version Name	Version Code	Latency / TTFS*	Character Limits	Limitations
Resemble Legacy TTS	`tts-legacy`	170ms	Maximum 3000 characters.	N/A
Resemble Enhanced TTS V1	`tts-v1`	2000-3000ms	Maximum 280 characters.	SSML Tags Not Supported: `<prosody>`, `<emotion>`, `<phonemes>`, `<substitutions>`, `<emphasis>`, `<say-as>`. Timestamps not supported. Resemble Fill not supported.
Resemble Enhanced TTS V2	`tts-v2`	580ms	Maximum 1000 characters.	SSML Tags Not Supported: `<prosody>`, `<emotion>`, `<phonemes>`, `<substitutions>`, `<emphasis>`, `<say-as>`. Timestamps not supported. Resemble Fill not supported.
Resemble Enhanced TTS v3	`tts-v3`	350ms	Maximum 1000 characters.	SSML Tags Not Supported: `<prosody>`, `<emotion>`, `<phonemes>`, `<substitutions>`, `<emphasis>`, `<say-as>`. Timestamps not supported. Resemble Fill not supported.

info

* Time-to-first-sound - the metrics reported are best case scenario, various factors can affect end user latency such as: load times, cold boot, network latency, and more

Speech To Speech

Version Name	Version Code	Description	Dataset Requirements	Streaming support	Release Date
Resemble Legacy STS	`sts-legacy`	Resemble’s initial speech-to-speech model offering providing users the ability to convert speaker audio from one voice to another.	10+ Minutes	Yes ✅	Q2 2021
Resemble Core STS V1	`sts-v1`	Resemble’s first generation of core speech-to-speech functionality offering state-of-the-art speaker audio conversion with greater speed and accuracy.	10+ Minutes	Yes ✅	Q2 2023
Resemble Core STS V2	`sts-v2`	Resemble’s second generation of core speech-to-speech functionality offering all the benefits of `sts-v1` with improved pitch tracking and 48kHz support.	10+ Minutes	Yes ✅	Q4 2023

Resemble Fill

Version Name	Version Code	Description	Dataset Requirements	Release Date
Resemble Fill (Audio Inpainting)	`fill-v1`	Resemble’s flag ship audio inpainting model allowing users to inpaint audio recordings with novel audio.	10+ minutes (initial TTS model training)	Q2 2021

Model Versions

Background​

Models​

Text To Speech​

Speech To Speech​

Resemble Fill​

Background

Models

Text To Speech

Speech To Speech

Resemble Fill