Version: 2.0.0

Stream a clip

To view an example project using the Resemble API please see the resemble-ai/resemble-streaming-demo repository on GitHub.

This endpoint streams a new clip and returns the audio data (wav format) through a stream. With streaming, audio is generated sequentially and sent in chunks. The delay before receiving the first bits of audio is shorter than the time it takes to perform a full synchronous synthesis, regardless of the size of your query. Implementing streaming in your app will make it more responsive.

Additionally, streaming synthesis will yield timestamp information (albeit in a different format than synchronous synthesis) and exact audio duration in the first chunk of bytes streamed, prior to any audio bytes.

Alternatively, if you're on a Business Plan or higher, you can also stream your data from a websocket using our websocket API.

Careful

Notice that the request is sent to our synthesis servers instead of app.resemble.ai.

HTTP Request

POST YOUR_STREAMING_ENDPOINT

Your streaming endpoint can be seen in the Try it out section.

JSON Body	Type	Description
project_uuid	string	UUID of the project to which the clip should belong
voice_uuid	string	UUID of the voice to use for synthesizing
data	string	Content to be synthesized. At the moment, SSML is only partially supported.
precision	(optional) string	The bit depth of the generated audio. One of the following values: `PCM_32`, `PCM_24`, `PCM_16`, or `MULAW`. Default is `PCM_32`.
sample_rate	(optional) integer	The sample rate of the produced audio. Either `8000`, `16000`, `22050`, `32000`, or `44100`. Default is `22050`.

HTTP Response

A successful response contains bytes which make up a single channel PCM 16 wav file. It can be decoded and played back on the fly.

Try it out

API Key:

Streaming Endpoint:

Start Delay (ms):

Buffer Size:

JSON Body:

Wav encoding

We take advantage of the RIFF format to encode additional data in the header of our wavs, such as:

The size (in bytes) of the entire file
The number of audio samples
The sample rate
The times at which characters will be pronounced
The times at which phonemes will be pronounced

Typically, audio libraries will only parse the first three of these values. Therefore if you wish to obtain the audio timestamps, you will either have to use one of our SDKs or handle the decoding yourself, using our wav specification.

A typical wav file starts with a RIFF chunk, followed by a format chunk and finally a data chunk. The wav files we return will additionally have a cue chunk, a list chunk and ltxt chunks. These are located between the format and data chunk, and deliver the timestamps information. If you are interested in the full specification of the wav format, see here.

Below is the specification of the wav format that we use. Bytes are encoded in little-endian order. Integers are always unsigned and strings are encoded as ascii characters (with the exception of the ltxt chunks).

Size	Description	Value
4	RIFF ID	"RIFF"
4	Remaining file size (in bytes) after this read*	(file size) - 8
4	RIFF type	"WAVE"
4	Format chunk ID	"fmt " (note the space)
4	Chunk data size	16
2	Compression code	1 (corresponds to PCM)
2	Number of channels	1
4	Sample rate	8,000 - 48,000
4	Byte rate	16,000 - 96,000
2	Block align	2
2	Bits per sample	16

* If you have an older model, the file size will be 0xFFFFFFFF instead. Please contact us if this is your case, we will upgrade your model.

After having parsed these chunks, you will know the sample rate and the total size of the audio file in bytes. Do not use this size to approximate the remaining audio duration, the data chunk will give you the exact length of the audio.

Timestamps (cue, list & ltxt chunks)

Cue

Size	Description	Value
4	Cue chunk ID	"cue "
4	Remaining size of the cue chunk after this read	4 + `n_cue_points` * 24
4	Number of remaining cue points	`n_cue_points`

This chunk is then followed by n_cue_points cue points:

Cue points

Size	Description	Value
4	Cue point ID	0 - 0xFFFFFFFF
4	Unused	0
4	Unused	"data"
8	Unused	0
4	Sample offset	0 - 0xFFFFFFFF

Cue points are simply a list of time points (expressed as offsets in number of audio samples) which mark the start of graphemes or phonemes.

The graphemes and phonemes are registered in the list chunk.

List

Size	Description	Value
4	List chunk ID	"list"
4	Remaining size of the list chunk after this read	4 + (sum of ltxt chunk sizes)
4	Type ID	"adtl"

This chunk is then followed by ltxt chunks

LTXT

Size	Description	Value
4	LTXT chunk ID	"ltxt"
4	Remaining size of this ltxt chunk after this read*	20 + `text_length`
4	Cue point ID	0 - 0xFFFFFFFF
4	Length in number of samples	1 - 0xFFFFFFFF
4	Character type	"grph" OR "phon"
8	Unused	0
`text_length`	The UTF-8 encoded text with a "\0" termination character	*

* Note

The wav specification requires that all chunks be aligned on multiples of the block align value (always 2 in our encoding). Therefore if text_length is odd, you must skip an additional byte after having read the chunk.

Each LTXT chunk corresponds to a character or a phoneme (possibly with stress and duration characters). To get the starting position, take the cue point with the same ID as in the LTXT chunk and take its sample offset. To get the ending position, add the length in number of samples from the LTXT chunk to the starting position.

Graphemes are given first, in sequential order. They have the "grph" character type. Then, phonemes follow with the "phon" character type.

Audio data chunk

Finally, the audio data chunk follows.

Data

Size	Description	Value
4	Data chunk ID	"data"
4	Number of remaining audio samples * 2	`wav_length` * 2

The remainder of the wav file are the PCM 16 encoded audio bytes.

Stream a clip

HTTP Request​

HTTP Response​

Try it out​

Wav encoding​

Header & Format chunks​

Timestamps (cue, list & ltxt chunks)​

Audio data chunk​