Skip to main content
Version: 2.0.0

Stream a clip

To view an example project using the Resemble API please see the resemble-ai/resemble-streaming-demo repository on GitHub.

This endpoint streams a new clip and returns the audio data (wav format) through a stream. With streaming, audio is generated sequentially and sent in chunks. The delay before receiving the first bits of audio is shorter than the time it takes to perform a full synchronous synthesis, regardless of the size of your query. Implementing streaming in your app will make it more responsive.

Additionally, streaming synthesis will yield timestamp information (albeit in a different format than synchronous synthesis) and exact audio duration in the first chunk of bytes streamed, prior to any audio bytes.

Alternatively, if you're on a Business Plan or higher, you can also stream your data from a websocket using our websocket API.

Careful

Notice that the request is sent to our synthesis servers instead of app.resemble.ai.

HTTP Request

POST YOUR_STREAMING_ENDPOINT

Your streaming endpoint can be seen in the Try it out section.


JSON BodyTypeDescription
project_uuidstringUUID of the project to which the clip should belong
voice_uuidstringUUID of the voice to use for synthesizing
datastringContent to be synthesized. At the moment, SSML is only partially supported.
precision(optional) stringThe bit depth of the generated audio. One of the following values: PCM_32, PCM_24, PCM_16, or MULAW. Default is PCM_32.
sample_rate(optional) integerThe sample rate of the produced audio. Either 8000, 16000, 22050, 32000, or 44100. Default is 22050.

HTTP Response

A successful response contains bytes which make up a single channel PCM 16 wav file. It can be decoded and played back on the fly.

Try it out

API Key:
Streaming Endpoint:
Start Delay (ms):
Buffer Size:
JSON Body:

Wav encoding

We take advantage of the RIFF format to encode additional data in the header of our wavs, such as:

  • The size (in bytes) of the entire file
  • The number of audio samples
  • The sample rate
  • The times at which characters will be pronounced
  • The times at which phonemes will be pronounced

Typically, audio libraries will only parse the first three of these values. Therefore if you wish to obtain the audio timestamps, you will either have to use one of our SDKs or handle the decoding yourself, using our wav specification.

A typical wav file starts with a RIFF chunk, followed by a format chunk and finally a data chunk. The wav files we return will additionally have a cue chunk, a list chunk and ltxt chunks. These are located between the format and data chunk, and deliver the timestamps information. If you are interested in the full specification of the wav format, see here.

Below is the specification of the wav format that we use. Bytes are encoded in little-endian order. Integers are always unsigned and strings are encoded as ascii characters (with the exception of the ltxt chunks).

Header & Format chunks

SizeDescriptionValue
4RIFF ID"RIFF"
4Remaining file size (in bytes) after this read*(file size) - 8
4RIFF type"WAVE"
4Format chunk ID"fmt " (note the space)
4Chunk data size16
2Compression code1 (corresponds to PCM)
2Number of channels1
4Sample rate8,000 - 48,000
4Byte rate16,000 - 96,000
2Block align2
2Bits per sample16

* If you have an older model, the file size will be 0xFFFFFFFF instead. Please contact us if this is your case, we will upgrade your model.

After having parsed these chunks, you will know the sample rate and the total size of the audio file in bytes. Do not use this size to approximate the remaining audio duration, the data chunk will give you the exact length of the audio.

Timestamps (cue, list & ltxt chunks)

Cue

SizeDescriptionValue
4Cue chunk ID"cue "
4Remaining size of the cue chunk after this read4 + n_cue_points * 24
4Number of remaining cue pointsn_cue_points

This chunk is then followed by n_cue_points cue points:

Cue points

SizeDescriptionValue
4Cue point ID0 - 0xFFFFFFFF
4Unused0
4Unused"data"
8Unused0
4Sample offset0 - 0xFFFFFFFF

Cue points are simply a list of time points (expressed as offsets in number of audio samples) which mark the start of graphemes or phonemes.

The graphemes and phonemes are registered in the list chunk.

List

SizeDescriptionValue
4List chunk ID"list"
4Remaining size of the list chunk after this read4 + (sum of ltxt chunk sizes)
4Type ID"adtl"

This chunk is then followed by ltxt chunks

LTXT

SizeDescriptionValue
4LTXT chunk ID"ltxt"
4Remaining size of this ltxt chunk after this read*20 + text_length
4Cue point ID0 - 0xFFFFFFFF
4Length in number of samples1 - 0xFFFFFFFF
4Character type"grph" OR "phon"
8Unused0
text_lengthThe UTF-8 encoded text with a "\0" termination character*
* Note

The wav specification requires that all chunks be aligned on multiples of the block align value (always 2 in our encoding). Therefore if text_length is odd, you must skip an additional byte after having read the chunk.

Each LTXT chunk corresponds to a character or a phoneme (possibly with stress and duration characters). To get the starting position, take the cue point with the same ID as in the LTXT chunk and take its sample offset. To get the ending position, add the length in number of samples from the LTXT chunk to the starting position.

Graphemes are given first, in sequential order. They have the "grph" character type. Then, phonemes follow with the "phon" character type.

Audio data chunk

Finally, the audio data chunk follows.

Data

SizeDescriptionValue
4Data chunk ID"data"
4Number of remaining audio samples * 2wav_length * 2

The remainder of the wav file are the PCM 16 encoded audio bytes.