Speech to Speech
Overview
Summarized below are the steps required to use this API:
- Make a request to the low latency synthesis endpoint.
- Decode the base64 “audio_content” attribute sent back in the response.
- Use the audio data.
Example
Copy and paste the following into your terminal, swap out YOUR_API_TOKEN
with your actual API token, then hit enter:
1
2
3
4
5
6
7
8
9
10
curl --request POST 'https://f.cluster.resemble.ai/synthesize' \
-H 'Authorization: Bearer YOUR_API_TOKEN' \
-H 'Content-Type: application/json' \
-H 'Accept-Encoding: gzip' \
--data '{
"voice_uuid": "55592656",
"data": "<resemble:convert src=\"https://storage.googleapis.com/resemble-ai-docs-public-files/sts-donor-example.wav\"></resemble:convert>",
"sample_rate": 48000,
"output_format": "wav"
}'
HTTP Request
1
2
3
4
5
6
7
8
9
10
11
12
curl --request POST "https://f.cluster.resemble.ai/synthesize"
-H "Authorization: Bearer YOUR_API_TOKEN"
-H "Content-Type: application/json"
-H "Accept-Encoding: gzip, deflate, br"
--data '{
"voice_uuid": <Voice to synthesize in>,
"project_uuid": <Project to save to>,
"title": <Title of the clip>,
"data": <Text to synthesize>,
"precision": "MULAW|PCM_16|PCM_24|PCM_32 (default)"
"output_format": "mp3|wav (default)"
}'
Request Headers
Header | Value | Description |
---|---|---|
Authorization | Bearer YOUR_API_TOKEN | API token can be obtained by logging into the Resemble web application and navigating to the API section. |
Accept-Encoding | gzip, deflate, br | Either one of gzip, deflate, or br depending on the decompression algorithms your application supports. Omitting the Accept-Encoding header will disable compression. |
Request Body
Attribute | Type | Description |
---|---|---|
voice_uuid | string | The voice to synthesize the text in. |
project_uuid | string | The project to save the data to. |
title | string | The title of the clip. This is optional, default is to name the clip Low Latency Synthesis {some-uuid} |
data | string | The text or SSML to synthesize. Maximum file size of 50MB or maximum duration of 5 minutes . |
precision | string | The bit-depth of the generated wav file (if using wav as the response type). Either MULAW, PCM_16, PCM_24, or PCM_32 (default). |
output_format | string | The output format of the produced audio. Either wav, or mp3. |
sample_rate | integer | The sample rate of the produced audio. Either 8000, 16000, 22050, 32000, or 44100 |
HTTP Response
{
"audio_content": <base64 encoded string of the raw audio bytes>,
"audio_timestamps": {
"graph_chars": string[],
"graph_times": float[][],
"phon_chars": string[],
"phon-times": float[][],
},
"duration": float,
"issues": string[],
"output_format": string,
"sample_rate": float,
"success": boolean,
"synth_duration": float,
"title": string|null
}
Response Body
Attribute | Type | Description |
---|---|---|
audio_content | string | Base64 encoded string. When decoded it will contain the byte array containing the audio. |
audio_timestamps | object | Object containing phoneme_timestamp information. See section below for further information. |
duration | float | The duration of the produced audio file. Resemble does not bill on this value. |
issues | string[] | Any issues pertaining to the synthesis response. |
output_format | string | The output format of the produced audio. Either 'wav', or 'mp3'. |
sample_rate | integer | The sample rate of the produced audio. Either 8000, 16000, 22050, 32000, or 44100. |
success | boolean | True if the response was successful, false otherwise. |
synth_duration | float | The duration of the raw audio file produced, before any post processing affects are applied (e.g. the 'prosody' tag which may increase or decrease the duration of the final audio file). Resemble bills on this value. |
title | string | The title of the clip. If no title is provided in the request body, then the value will be null |
Audio Timestamps Object
Attribute | Type | Description |
---|---|---|
graph_times | string | A string containing all the phonemes pertaining to the synthesized audio. |
phon_times | float[] | An array of floats mapping 1 to 1 with the phoneme_chars. Each index represents the end time in the audio of the phoneme character at the same index in the phoneme_chars array. |
phon_chars | char[] | An array of characters mapping 1 to 1 with the end_times array. |
graph_chars | char[] | An array of characters mapping 1 to 1 with the end_times array. |
Try It Out
API Key:
JSON Body: