Stream a clip
The Streaming API is currently in beta and is not available to all users. Please reach out to firstname.lastname@example.org to inquire more.
This endpoint streams a new clip and returns the audio data (wav format) through a stream. With streaming, audio is generated sequentially and sent in chunks. The delay before receiving the first bits of audio is shorter than the time it takes to perform a full synchronous synthesis, regardless of the size of your query. Implementing streaming in your app will make it more responsive.
Additionally, streaming synthesis will yield timestamp information (albeit in a different format than synchronous synthesis) and exact audio duration in the first chunk of bytes streamed, prior to any audio bytes.
Notice that the request is sent to our synthesis servers instead of
|project_uuid||string||UUID of the project to which the clip should belong|
|voice_uuid||string||UUID of the voice to use for synthesizing|
|data||string||Content to be synthesized. At the moment, SSML is only partially supported.|
A successful response contains bytes which make up a single channel PCM 16 wav file. It can be decoded and played back on the fly.
Try it out
We take advantage of the RIFF format to encode additional data in the header of our wavs, such as:
- The size (in bytes) of the entire file
- The number of audio samples
- The sample rate
- The times at which characters will be pronounced
- The times at which phonemes will be pronounced
Typically, audio libraries will only parse the first three of these values. Therefore if you wish to obtain the audio timestamps, you will either have to use one of our SDKs or handle the decoding yourself, using our wav specification.
A typical wav file starts with a RIFF chunk, followed by a format chunk and finally a data chunk. The wav files we return will additionally have a cue chunk, a list chunk and ltxt chunks. These are located between the format and data chunk, and deliver the timestamps information. If you are interested in the full specification of the wav format, see here.
Below is the specification of the wav format that we use. Bytes are encoded in little-endian order. Integers are always unsigned and strings are encoded as ascii characters (with the exception of the ltxt chunks).
Header & Format chunks
|4||Remaining file size (in bytes) after this read*||(file size) - 8|
|4||Format chunk ID||"fmt " (note the space)|
|4||Chunk data size||16|
|2||Compression code||1 (corresponds to PCM)|
|2||Number of channels||1|
|4||Sample rate||8,000 - 48,000|
|4||Byte rate||16,000 - 96,000|
|2||Bits per sample||16|
* If you have an older model, the file size will be 0xFFFFFFFF instead. Please contact us if this is your case, we will upgrade your model.
After having parsed these chunks, you will know the sample rate and the total size of the audio file in bytes. Do not use this size to approximate the remaining audio duration, the data chunk will give you the exact length of the audio.
Timestamps (cue, list & ltxt chunks)
|4||Cue chunk ID||"cue "|
|4||Remaining size of the cue chunk after this read||4 + |
|4||Number of remaining cue points|
This chunk is then followed by
n_cue_points cue points:
|4||Cue point ID||0 - 0xFFFFFFFF|
|4||Sample offset||0 - 0xFFFFFFFF|
Cue points are simply a list of time points (expressed as offsets in number of audio samples) which mark the start of graphemes or phonemes.
The graphemes and phonemes are registered in the list chunk.
|4||List chunk ID||"list"|
|4||Remaining size of the list chunk after this read||4 + (sum of ltxt chunk sizes)|
This chunk is then followed by ltxt chunks
|4||LTXT chunk ID||"ltxt"|
|4||Remaining size of this ltxt chunk after this read*||20 + |
|4||Cue point ID||0 - 0xFFFFFFFF|
|4||Length in number of samples||1 - 0xFFFFFFFF|
|4||Character type||"grph" OR "phon"|
|The UTF-8 encoded text with a "\0" termination character||*|
The wav specification requires that all chunks be aligned on multiples of the block align value (always 2 in our encoding). Therefore if
text_length is odd, you must skip an additional byte after having read the chunk.
Each LTXT chunk corresponds to a character or a phoneme (possibly with stress and duration characters). To get the starting position, take the cue point with the same ID as in the LTXT chunk and take its sample offset. To get the ending position, add the length in number of samples from the LTXT chunk to the starting position.
Graphemes are given first, in sequential order. They have the "grph" character type. Then, phonemes follow with the "phon" character type.
Audio data chunk
Finally, the audio data chunk follows.
|4||Data chunk ID||"data"|
|4||Number of remaining audio samples * 2|
The remainder of the wav file are the PCM 16 encoded audio bytes.