Stream a clip
To view an example project using the Resemble API please see the resemble-ai/resemble-streaming-demo repository on GitHub.
This endpoint streams a new clip and returns the audio data (wav format) through a stream. With streaming, audio is generated sequentially and sent in chunks. The delay before receiving the first bits of audio is shorter than the time it takes to perform a full synchronous synthesis, regardless of the size of your query. Implementing streaming in your app will make it more responsive.
Additionally, streaming synthesis will yield timestamp information (albeit in a different format than synchronous synthesis) and exact audio duration in the first chunk of bytes streamed, prior to any audio bytes.
Alternatively, if you're on a Business Plan or higher, you can also stream your data from a websocket using our websocket API.
Notice that the request is sent to our synthesis servers instead of app.resemble.ai
.
HTTP Request
POST YOUR_STREAMING_ENDPOINT
Your streaming endpoint can be seen in the Try it out section.
JSON Body | Type | Description |
---|---|---|
project_uuid | string | UUID of the project to which the clip should belong |
voice_uuid | string | UUID of the voice to use for synthesizing |
data | string | Content to be synthesized. At the moment, SSML is only partially supported. |
precision | (optional) string | The bit depth of the generated audio. One of the following values: PCM_32 , PCM_24 , PCM_16 , or MULAW . Default is PCM_32 . |
sample_rate | (optional) integer | The sample rate of the produced audio. Either 8000 , 16000 , 22050 , 32000 , or 44100 . Default is 22050 . |
HTTP Response
A successful response contains bytes which make up a single channel PCM 16 wav file. It can be decoded and played back on the fly.
Try it out
Wav encoding
We take advantage of the RIFF format to encode additional data in the header of our wavs, such as:
- The size (in bytes) of the entire file
- The number of audio samples
- The sample rate
- The times at which characters will be pronounced
- The times at which phonemes will be pronounced
Typically, audio libraries will only parse the first three of these values. Therefore if you wish to obtain the audio timestamps, you will either have to use one of our SDKs or handle the decoding yourself, using our wav specification.
A typical wav file starts with a RIFF chunk, followed by a format chunk and finally a data chunk. The wav files we return will additionally have a cue chunk, a list chunk and ltxt chunks. These are located between the format and data chunk, and deliver the timestamps information. If you are interested in the full specification of the wav format, see here.
Below is the specification of the wav format that we use. Bytes are encoded in little-endian order. Integers are always unsigned and strings are encoded as ascii characters (with the exception of the ltxt chunks).
Header & Format chunks
Size | Description | Value |
---|---|---|
4 | RIFF ID | "RIFF" |
4 | Remaining file size (in bytes) after this read* | (file size) - 8 |
4 | RIFF type | "WAVE" |
4 | Format chunk ID | "fmt " (note the space) |
4 | Chunk data size | 16 |
2 | Compression code | 1 (corresponds to PCM) |
2 | Number of channels | 1 |
4 | Sample rate | 8,000 - 48,000 |
4 | Byte rate | 16,000 - 96,000 |
2 | Block align | 2 |
2 | Bits per sample | 16 |
* If you have an older model, the file size will be 0xFFFFFFFF instead. Please contact us if this is your case, we will upgrade your model.
After having parsed these chunks, you will know the sample rate and the total size of the audio file in bytes. Do not use this size to approximate the remaining audio duration, the data chunk will give you the exact length of the audio.
Timestamps (cue, list & ltxt chunks)
Cue
Size | Description | Value |
---|---|---|
4 | Cue chunk ID | "cue " |
4 | Remaining size of the cue chunk after this read | 4 + n_cue_points * 24 |
4 | Number of remaining cue points | n_cue_points |
This chunk is then followed by n_cue_points
cue points:
Cue points
Size | Description | Value |
---|---|---|
4 | Cue point ID | 0 - 0xFFFFFFFF |
4 | Unused | 0 |
4 | Unused | "data" |
8 | Unused | 0 |
4 | Sample offset | 0 - 0xFFFFFFFF |
Cue points are simply a list of time points (expressed as offsets in number of audio samples) which mark the start of graphemes or phonemes.
The graphemes and phonemes are registered in the list chunk.
List
Size | Description | Value |
---|---|---|
4 | List chunk ID | "list" |
4 | Remaining size of the list chunk after this read | 4 + (sum of ltxt chunk sizes) |
4 | Type ID | "adtl" |
This chunk is then followed by ltxt chunks
LTXT
Size | Description | Value |
---|---|---|
4 | LTXT chunk ID | "ltxt" |
4 | Remaining size of this ltxt chunk after this read* | 20 + text_length |
4 | Cue point ID | 0 - 0xFFFFFFFF |
4 | Length in number of samples | 1 - 0xFFFFFFFF |
4 | Character type | "grph" OR "phon" |
8 | Unused | 0 |
text_length | The UTF-8 encoded text with a "\0" termination character | * |
The wav specification requires that all chunks be aligned on multiples of the block align value (always 2 in our encoding). Therefore if text_length
is odd, you must skip an additional byte after having read the chunk.
Each LTXT chunk corresponds to a character or a phoneme (possibly with stress and duration characters). To get the starting position, take the cue point with the same ID as in the LTXT chunk and take its sample offset. To get the ending position, add the length in number of samples from the LTXT chunk to the starting position.
Graphemes are given first, in sequential order. They have the "grph" character type. Then, phonemes follow with the "phon" character type.
Audio data chunk
Finally, the audio data chunk follows.
Data
Size | Description | Value |
---|---|---|
4 | Data chunk ID | "data" |
4 | Number of remaining audio samples * 2 | wav_length * 2 |
The remainder of the wav file are the PCM 16 encoded audio bytes.