SSML Reference
You can use Speech Synthesis Markup Language (your SSML) as input to control how Resemble generates speech. Resemble automatically handles normal punctation, such as pausing after a period, or speaking a sentence that ends with a question mark as a question. However, in some cases, you may want additional control of Resemble’s synthetic speech. This may include, for example, having certain words pronounced in a specific way, saying a word or sentence with excitement, spelling certain words character by character, and much more.
SSML is a markup language that provides a standard way to markup text for the generation of synthetic speech. The specific tags Resemble supports are listed in Supported SSML Tags.
Supported SSML tags
These are the SSML elements that Resemble supports. The speak element is required. All other elements are optional.
SSML Element | Required | Summary |
---|---|---|
speak | Yes | Required root element for the SSML document. |
prosody | No | Specifics the pitch, volume, and rate of a word. |
emphasis | No | Apply a pre-defined emphasis on a word. Emphasis is a pre-set combination of pitch and volume. |
say-as | No | Indicates the type of text contained in the element. For example, acronym. |
sub | No | Specified the string of text to pronounce rather than the text contained in the element. |
break | No | Inserts a pause in between words. |
language | No | Specifies the language to generate the content. |
audio | No | Allows for the insertion of recorded audio files in addition with synthesized speech output. |
resemble:convert | No | Applies speech-to-speech given a source audio file. |
<speak>
: Speak tag
The required root element of the SSML document.
Syntax
<speak version="float" xmlns="string" xml:lang="string"></speak>
Attributes
Attribute | Required | Description |
---|---|---|
version | No | Indicates the version of the SSMl specification used to interpret the document markup. Defaults to v1.1. |
xml:lang | No | Specifies the language of the root language. The value may contain a lowercase, two-letter language code (for example, en), or the language code and uppercase country/region (for example, en-US). Defaults to en-us. |
xmlns | No | Specified the URI to the document that defines the markup vocabulary. The current URI is http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis.xsd |
temperature | No | This is controls the randomness of the generated output and the value ranges from 0.1 to 5. The default value is 0.8. |
exaggeration | No | This is controls the intensity of emotion and the value ranges from 0.0 to 1.0. |
seed | No | A non-negative integer. Initializes the model for deterministic output. |
<prosody>
tag
An optional tag used to style the way synthesized speech sounds by specifying the pitch, rate, or volume.
Example
The sample has been generated using this input:
<speak>This part is normal. <prosody pitch="x-high">This part is going to sound high pitched</prosody>. <prosody rate="150%">This part is going to be spoken fast</prosody>. <prosody volume="loud">And this part is loud</prosody>!<speak>
Syntax
<prosody pitch="string" rate="string" volume="string"></prosody>
Attributes
Attribute | Required | Description |
---|---|---|
pitch | No | The baseline pitch of the synthesized speech. This must be one of the following values:
|
rate | No | The baseline speed of the synthesized speech. The rate must be a percent value. For example 100% is normal, 50% is half as fast as normal, 200% is double the speed of normal. |
volume | No | Indicates the volume level of the synthesized speech. This must be one of the following values:
|
<emphasis>
tag
An optional tag that specifies the emphasis of the synthesized speech. Emphasis makes it easier to apply a pre-defined range of volume & pitch to the synthesized speech.
Example
The sample has been generated using this input:
<speak><emphasis level="reduced">I am more of a shy person really</emphasis>.</speak>
Syntax
<emphasis level="string"></emphasis>
Attributes
Attribute | Required | Description |
---|---|---|
level | No | Specifies the emphasis to apply on the text within the emphasis tag. The following are the possible level’s that you may specify:
|
<say-as>
tag
An optional element that indicates the content type. This provides guidance to the speech synthesis AI about how to pronounce the text.
Example
The following sample has been generated using this input:
<speak>This <say-as interpret-as="characters">SSML</say-as> stuff is really cool!</speak>
Syntax
<say-as interpret-as="string"></say-as>
Attributes
Attributes | Required | Description |
---|---|---|
interpret-as | Yes | Indicates the content type of element’s text. The only types that are currently support are:
|
<sub>
tag
An optional element that specifies a string of text that is pronounced in place of the element’s text.
Example
The following sample has been generated using this input:
<speak>Hi <sub alias="Joe">Jim</sub>, we are calling today to inform you of your account activation with Resemble.</speak>
Syntax
<sub alias="string"></sub>
Attributes
Attribute | Required | Description |
---|---|---|
alias | Yes | Specifies the substitute text to speak. |
<break>
tag
An optional tag used to insert pauses between words.
Example
The following sample has been generated using this input:
<speak>This is going to be a long <break time="2s"/>pause.</speak>
Syntax
<break time="string" />
Attributes
Attribute | Required | Description |
---|---|---|
time | Yes | Specifies the absolute duration of a pause in seconds. For example, 1s. |
<language>
tag
If supported by the voice, this tag will be able to switch languages.
Example
The following sample has been generated using this input:
<speak>Su vuelo a <lang xml:lang="en-us">Pearson International Airport</lang> partirá en 30 minutos.</speak>
Syntax
<lang xml:lang="string" />
Attributes
Attribute | Required | Description |
---|---|---|
xml:lang | Yes | Specifies the language that the text should generate in. Supported languages vary by voice. |
<resemble:convert>
tag
If supported by the voice, this tag will be able to perform speech-to-speech.
Speech-to-Speech enables you to transform a recording of one speaker, into a recording of another speaker. Similar to other Resemble features, you can take advantage of speech-to-speech in your application through SSML and our API.
⚠️ The maximum allowed file size is 50mb, and the maximum duration is 300 seconds. If a file exceeds any of these parameters, it will automatically be trimmed.
Example
The following sample has been generated using this input:
<speak><resemble:convert src="https://resemble-data.s3.us-east-2.amazonaws.com/source-s2s.wav"/></speak>
Syntax
<resemble:convert src="string" pitch="float"></resemble:convert>
Attributes
Attribute | Required | Description |
---|---|---|
src | Yes | A direct URL to the source audio file. Only WAV files are supported. The source audio file should be a recording of a single speaker. |
pitch | No | Adjust the pitch of the generated audio using a float value between -10.0 and 10.0. If the pitch value is set to 0 or not provided, the audio will be generated with no pitch adjustment. |
<resemble:fill>
tag
Resemble Fill enables you to take existing recordings of speech and modify them seamlessly (audio inpainting).
⚠️ Please see Using Resemble Fill Through The API for detailed instructions.
Example
The following sample has been generated using this input:
<resemble:fill recording_uuid="06ba9935">The Finch gladly accepted the invitation and arrived in good time and with a very good appetite.</resemble:fill>
Syntax
<resemble:fill recording_uuid="<string>"></resemble:fill>
Attributes
Attribute | Required | Description |
---|---|---|
recording_uuid | Yes | The recording UUID of the recording you want to modify. See Using Resemble Fill Through The API for detailed instructions. |
<audio>
tag
The <audio>
tag allows you to insert recorded audio files in addition with the synthesized speech output.
Example
The following sample has been generated using this input:
<audio src="angry_cow.mp3">
<desc>An angry cow</desc>
Moo!!! (The sound failed to load)
</audio>
Syntax
<audio
src="<string>"
soundLevel="<string>"
background="<boolean>">
Hello there
></audio>
Attributes
Attribute | Required | Description |
---|---|---|
src | Yes | A URI referring to an audio source. You must use wav |
soundLevel | No | Change the volume level of the audio, specified in percentage |
background | No | Play an audio file in the background of a spoken text or inline. For example, playing music in the background of a spoken text prompt. |