Direct Synthesis
Overview
When it comes to time-sensitive content delivery, using streaming synthesis is the fastest way to achieve the lowest time-to-first-sound. For documentation on streaming synthesis, see "Stream a clip".
For some applications, streaming is not an option. In these cases, the fastest time-to-first-sound can be achieved by sending synchronous requests directly to our synthesis servers. We call this direct synthesis. Summarized below are the steps required to use this API:
- Make a request to the direct synthesis endpoint. (Note: this is not the usual
https://app.resemble.ai/api/...
URL.) - Decode the base64 “audio_content” attribute sent back in the response.
- Use the audio data.
HTTP Request
1
2
3
4
5
6
7
8
9
10
11
12
curl --request POST "YOUR_SYNTH_ENDPOINT"
-H "Authorization: Bearer YOUR_API_TOKEN"
-H "Content-Type: application/json"
-H "Accept-Encoding: gzip, deflate, br"
--data '{
"voice_uuid": <Voice to synthesize in>,
"project_uuid": <Project to save to>,
"title": <Title of the clip>,
"data": <Text to synthesize>,
"precision": "MULAW|PCM_16|PCM_24|PCM_32 (default)"
"output_format": "mp3|wav (default)"
}'
Request Headers
Header | Value | Description |
---|---|---|
Authorization | Bearer YOUR_API_TOKEN | API token can be obtained by logging into the Resemble web application and navigating to the API section. |
Accept-Encoding | gzip, deflate, br | Either one of gzip, deflate, or br depending on the decompression algorithms your application supports. Omitting the Accept-Encoding header will disable compression. |
Request Body
Attribute | Type | Description |
---|---|---|
voice_uuid | string | The voice to synthesize the text in. |
project_uuid | string | The project to save the data to. |
title | string | The title of the clip. This is optional, default is to name the clip Direct Synthesis {some-uuid} |
data | string | The text or SSML to synthesize. |
precision | string | The bit-depth of the generated wav file (if using wav as the response type). Either MULAW, PCM_16, PCM_24, or PCM_32 (default). |
output_format | string | The output format of the produced audio. Either wav, or mp3. |
sample_rate | integer | The sample rate of the produced audio. Either 8000, 16000, 22050, 32000, or 44100 |
HTTP Response
{
"audio_content": <base64 encoded string of the raw audio bytes>,
"phoneme_timestamps": {
"phonemes": <string of phonemes>,
"end_times": float[],
"phoneme_chars": char[]
},
issues: string[],
success: boolean
}
Response Body
Attribute | Type | Description |
---|---|---|
audio_content | string | Base64 encoded string. When decoded it will contain the byte array containing the audio. |
phoneme_timestamps | object | Object containing phoneme_timestamp information. See section below for further information. |
Issues | string[] | Any issues pertaining to the synthesis response. |
Success | boolean | True if the response was successful, false otherwise. |
Audio Timestamps Object
Attribute | Type | Description |
---|---|---|
graph_times | string | A string containing all the phonemes pertaining to the synthesized audio. |
phon_times | float[] | An array of floats mapping 1 to 1 with the phoneme_chars. Each index represents the end time in the audio of the phoneme character at the same index in the phoneme_chars array. |
phon_chars | char[] | An array of characters mapping 1 to 1 with the end_times array. |
graph_chars | char[] | An array of characters mapping 1 to 1 with the end_times array. |