Skip to main content
Version: 2.0.0

Direct Synthesis


When it comes to time-sensitive content delivery, using streaming synthesis is the fastest way to achieve the lowest time-to-first-sound. For documentation on streaming synthesis, see "Stream a clip".

For some applications, streaming is not an option. In these cases, the fastest time-to-first-sound can be achieved by sending synchronous requests directly to our synthesis servers. We call this direct synthesis. Summarized below are the steps required to use this API:

  • Make a request to the direct synthesis endpoint. (Note: this is not the usual URL.)
  • Decode the base64 “audio_content” attribute sent back in the response.
  • Use the audio data.

HTTP Request

1 2 3 4 5 6 7 8 9 10 11 12 curl --request POST "YOUR_SYNTH_ENDPOINT" -H "Authorization: Bearer YOUR_API_TOKEN" -H "Content-Type: application/json" -H "Accept-Encoding: gzip, deflate, br" --data '{ "voice_uuid": <Voice to synthesize in>, "project_uuid": <Project to save to>, "title": <Title of the clip>, "data": <Text to synthesize>, "precision": "MULAW|PCM_16|PCM_24|PCM_32 (default)" "output_format": "mp3|wav (default)" }'

Request Headers

AuthorizationBearer YOUR_API_TOKENAPI token can be obtained by logging into the Resemble web application and navigating to the API section.
Accept-Encodinggzip, deflate, brEither one of gzip, deflate, or br depending on the decompression algorithms your application supports. Omitting the Accept-Encoding header will disable compression.

Request Body

voice_uuidstringThe voice to synthesize the text in.
project_uuidstringThe project to save the data to.
titlestringThe title of the clip. This is optional, default is to name the clip Direct Synthesis {some-uuid}
datastringThe text or SSML to synthesize.
precisionstringThe bit-depth of the generated wav file (if using wav as the response type). Either MULAW, PCM_16, PCM_24, or PCM_32 (default).
output_formatstringThe output format of the produced audio. Either wav, or mp3.
sample_rateintegerThe sample rate of the produced audio. Either 8000, 16000, 22050, 32000, or 44100

HTTP Response

"audio_content": <base64 encoded string of the raw audio bytes>,
"phoneme_timestamps": {
"phonemes": <string of phonemes>,
"end_times": float[],
"phoneme_chars": char[]
issues: string[],
success: boolean

Response Body

audio_contentstringBase64 encoded string. When decoded it will contain the byte array containing the audio.
phoneme_timestampsobjectObject containing phoneme_timestamp information. See section below for further information.
Issuesstring[]Any issues pertaining to the synthesis response.
SuccessbooleanTrue if the response was successful, false otherwise.

Audio Timestamps Object

graph_timesstringA string containing all the phonemes pertaining to the synthesized audio.
phon_timesfloat[]An array of floats mapping 1 to 1 with the phoneme_chars. Each index represents the end time in the audio of the phoneme character at the same index in the phoneme_chars array.
phon_charschar[]An array of characters mapping 1 to 1 with the end_times array.
graph_charschar[]An array of characters mapping 1 to 1 with the end_times array.