Speech Synthesis Markup Language (SSML) Reference
You can use Speech Synthesis Markup Language (your SSML) as input to control how Resemble generates speech. Resemble automatically handles normal punctation, such as pausing after a period, or speaking a sentence that ends with a question mark as a question. However, in some cases, you may want additional control of Resemble’s synthetic speech. This may include, for example, having certain words pronounced in a specific way, saying a word or sentence with excitement, spelling certain words character by character, and much more.
SSML is a markup language that provides a standard way to markup text for the generation of synthetic speech. The specific tags Resemble supports are listed in Supported SSML Tags.
Supported SSML tags
These are the SSML elements that Resemble supports. The speak element is required. All other elements are optional.
SSML Element | Required | Summary |
---|---|---|
speak | Yes | Required root element for the SSML document. |
prosody | No | Specifics the pitch, volume, and rate of a word. |
phoneme | No | Indicates the phonetic pronunciation of the contained text, overriding the default pronunciation. |
emphasis | No | Apply a pre-defined emphasis on a word. Emphasis is a pre-set combination of pitch and volume. |
say-as | No | Indicates the type of text contained in the element. For example, acronym. |
sub | No | Specified the string of text to pronounce rather than the text contained in the element. |
break | No | Inserts a pause in between words. |
language | No | Specifies the language to generate the content. |
audio | No | Allows for the insertion of recorded audio files in addition with synthesized speech output. |
resemble:emotion | No | Apply an emotion to a word or provide fine grain control over the pitch, intensity, and pace of word. |
resemble:convert | No | Applies speech-to-speech given a source audio file. |
Apply multiple SSML tags to the same speech
You can combine most supported tags with each other to multiply the effect on speech. For instance, this example uses both the phoneme and emphasis tags. This tells resemble to speak the entire sentence with a strong emphasis, and speak the text inside the phoneme tag with the provided pronunciation.
Example of applying multiple ssml tags to the same speech:
<speak><emphasis level="strong">Hey there! Welcome to <phoneme ph="ɹɪsɛmbəl" alphabet="ipa">Resemble</phoneme>.</emphasis></speak>
Incompatible Tags
Not all tags can be combined.
- A phoneme tag must only wrap around text and cannot contain any other elements within.
- The say-as tag is similar to the phoneme tag in the sense that it too can only wrap text and cannot contain other elements within.
- The sub tag is similar to the phoneme tag in the sense that it too can only wrap text and cannot contain other elements within.
- The resemble:emotion tag can only wrap the phoneme or the say-as tag, and text.
For example, these are invalid:
<speak>This is an <phoneme ph="ɛɡzampəl" alphabet="ipa"><emphasis level="reduced">example</emphasis></phoneme> of invalid use of the phoneme tag.</speak>
<speak>This is an example of invalid use of the <sub alias="substitute"><resemble:emotion pitch="0.5" rate="0.80">substitute</resemble:emotion></sub> tag.</speak>
On the other hand, these are valid:
<speak>This is an <emphasis level="reduced"><phoneme ph="ɛɡzampəl" alphabet="ipa">example<phoneme></emphasis> of valid use of the phoneme tag.</speak>
<speak>This is an example of valid use of the <resemble:emotion pitch="0.5" rate="0.80"><sub alias="substitute">substitute</sub></resemble:emotion> tag.</speak>
<speak>
: Speak tag
The required root element of the SSML document.
Syntax
<speak version="float" xmlns="string" xml:lang="string"></speak>
Attributes
Attribute | Required | Description |
---|---|---|
version | No | Indicates the version of the SSMl specification used to interpret the document markup. Defaults to v1.1. |
xml:lang | No | Specifies the language of the root language. The value may contain a lowercase, two-letter language code (for example, en), or the language code and uppercase country/region (for example, en-US). Defaults to en-us. |
xmlns | No | Specified the URI to the document that defines the markup vocabulary. The current URI is http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis.xsd |
<resemble:emotion>
tag
An optional tag used to style the way synthesized speech sounds when generated through Resemble’s AI. This tag works best when the voice submitted for cloning contains samples in the range of low-pitch to high-pitch, quiet to loud, slow to fast and a range of emotions. If the voice being cloned does not contain samples in the ranges described previously, the synthesized output may be undesirable – the best case in this situation would be to use the prosody tag to achieve your desired style.
Example
The sample has been generated using this input:
<speak><resemble:emotion pitch="0.9" intensity="0.9" pace="0.9">This is a resemble style test!</resemble:emotion> </speak>
Syntax
<resemble:emotion pitch="float" intensity="float" pace="float"></emotion>
Attributes
Attribute | Required | Description |
---|---|---|
emotions | No | A pre-set emotion. This must be a value in:
|
pitch | No | The expressiveness of the synthesized speech. This must be a value in 0 and 1 (inclusive). If pitch is provided, intensity and pace must also be provided. |
intensity | No | The aggressiveness of the synthesized speech. This must be a value in 0 and 1 (inclusive). If intensity is provided, pitch and pace must also be provided. |
pace | No | The pace/rate/speed of the synthesized speech. This must be a value in 0 and 1 (inclusive). If pace is provided, pitch and intensity must also be provided. |
Deprecated Syntax. Please note, the follow syntax is deprecated and will be removed in the future. Use the new syntax, listed above, instead.
<style emotions="string"></style>
<resemble:style expressiveness="float" aggressiveness="float" pace="float"></resemble:style>
Deprecated Attributes
Attribute | Required | Description |
---|---|---|
emotions | Yes | Either the emotion or the expressiveness, aggressiveness, and pace of the synthesized speech. Attribute should either be a string in:
expressiveness:float aggressiveness:float pace:float Where expressiveness, aggressiveness, and pace are values in 0 and 1 (inclusive). For example: <style emotions="expressiveness:0.5 aggressiveness:0.6 pace:0.7"></style> |
expressiveness | No | Renamed to "pitch"; see documentation on resemble:emotion |
aggressiveness | No | Renamed to "intensity" see documentation on resemble:emotion |
<prosody>
tag
An optional tag used to style the way synthesized speech sounds by specifying the pitch, rate, or volume.
Example
The sample has been generated using this input:
<speak>This part is normal. <prosody pitch="x-high">This part is going to sound high pitched</prosody>. <prosody rate="150%">This part is going to be spoken fast</prosody>. <prosody volume="loud">And this part is loud</prosody>!<speak>
Syntax
<prosody pitch="string" rate="string" volume="string"></prosody>
Attributes
Attribute | Required | Description |
---|---|---|
pitch | No | The baseline pitch of the synthesized speech. This must be one of the following values:
|
rate | No | The baseline speed of the synthesized speech. The rate must be a percent value. For example 100% is normal, 50% is half as fast as normal, 200% is double the speed of normal. |
volume | No | Indicates the volume level of the synthesized speech. This must be one of the following values:
|
<phoneme>
tag
An optional tag that specifies the phonetic pronunciation for the specified text using phones from a supported phonetic alphabet.
Example
The sample has been generated using this input:
<speak>This is a phoneme replacement on this <phoneme ph="laɪn">word</phoneme>.</speak>
Syntax
<phoneme alphabet="string" ph="string"></phoneme>
Attributes
Attribute | Required | Description |
---|---|---|
alphabet | No | Specifies the phonetic alphabet to use when synthesizing the pronunciation of the string in the ph attribute. The string specifying the alphabet must be specified in lowercase letters. The following are the possible alphabets that you may specify:
|
ph | Yes | A string containing phones that specify the pronunciation of the word in the phoneme element. |
<emphasis>
tag
An optional tag that specifies the emphasis of the synthesized speech. Emphasis makes it easier to apply a pre-defined range of volume & pitch to the synthesized speech.
Example
The sample has been generated using this input:
<speak><emphasis level="reduced">I am more of a shy person really</emphasis>.</speak>
Syntax
<emphasis level="string"></emphasis>
Attributes
Attribute | Required | Description |
---|---|---|
level | No | Specifies the emphasis to apply on the text within the emphasis tag. The following are the possible level’s that you may specify:
|
<say-as>
tag
An optional element that indicates the content type. This provides guidance to the speech synthesis AI about how to pronounce the text.
Example
The following sample has been generated using this input:
<speak>This <say-as interpret-as="characters">SSML</say-as> stuff is really cool!</speak>
Syntax
<say-as interpret-as="string"></say-as>
Attributes
Attributes | Required | Description |
---|---|---|
interpret-as | Yes | Indicates the content type of element’s text. The only types that are currently support are:
|
<sub>
tag
An optional element that specifies a string of text that is pronounced in place of the element’s text.
Example
The following sample has been generated using this input:
<speak>Hi <sub alias="Joe">Jim</sub>, we are calling today to inform you of your account activation with Resemble.</speak>
Syntax
<sub alias="string"></sub>
Attributes
Attribute | Required | Description |
---|---|---|
alias | Yes | Specifies the substitute text to speak. |
<break>
tag
An optional tag used to insert pauses between words.
Example
The following sample has been generated using this input:
<speak>This is going to be a long <break time="2s"/>pause.</speak>
Syntax
<break time="string" />
Attributes
Attribute | Required | Description |
---|---|---|
time | Yes | Specifies the absolute duration of a pause in seconds. For example, 1s. |
<language>
tag
If supported by the voice, this tag will be able to switch languages.
Example
The following sample has been generated using this input:
<speak>Su vuelo a <lang xml:lang="en-us">Pearson International Airport</lang> partirá en 30 minutos.</speak>
Syntax
<lang xml:lang="string" />
Attributes
Attribute | Required | Description |
---|---|---|
xml:lang | Yes | Specifies the language that the text should generate in. Supported languages vary by voice. |
<resemble:convert>
tag
If supported by the voice, this tag will be able to perform speech-to-speech.
Speech-to-Speech enables you to transform a recording of one speaker, into a recording of another speaker. Similar to other Resemble features, you can take advantage of speech-to-speech in your application through SSML and our API.
⚠️ The maximum allowed file size is 50mb, and the maximum duration is 300 seconds. If a file exceeds any of these parameters, it will automatically be trimmed.
Example
The following sample has been generated using this input:
<speak><resemble:convert src="https://resemble-data.s3.us-east-2.amazonaws.com/source-s2s.wav"/></speak>
Syntax
<resemble:convert src="string" pitch="float"></resemble:convert>
Attributes
Attribute | Required | Description |
---|---|---|
src | Yes | A direct URL to the source audio file. Only WAV files are supported. The source audio file should be a recording of a single speaker. |
pitch | No | Adjust the pitch of the generated audio using a float value between -10.0 and 10.0. If the pitch value is set to 0 or not provided, the audio will be generated with no pitch adjustment. |
<resemble:fill>
tag
Resemble Fill enables you to take existing recordings of speech and modify them seamlessly (audio inpainting).
⚠️ Please see Using Resemble Fill Through The API for detailed instructions.
Example
The following sample has been generated using this input:
<resemble:fill recording_uuid="06ba9935">The Finch gladly accepted the invitation and arrived in good time and with a very good appetite.</resemble:fill>
Syntax
<resemble:fill recording_uuid="<string>"></resemble:fill>
Attributes
Attribute | Required | Description |
---|---|---|
recording_uuid | Yes | The recording UUID of the recording you want to modify. See Using Resemble Fill Through The API for detailed instructions. |
<audio>
tag
The <audio>
tag allows you to insert recorded audio files in addition with the synthesized speech output.
Example
The following sample has been generated using this input:
<audio src="angry_cow.mp3">
<desc>An angry cow</desc>
Moo!!! (The sound failed to load)
</audio>
Syntax
<audio
src="<string>"
soundLevel="<string>"
background="<boolean>">
Hello there
></audio>
Attributes
Attribute | Required | Description |
---|---|---|
src | Yes | A URI referring to an audio source. You must use wav |
soundLevel | No | Change the volume level of the audio, specified in percentage |
background | No | Play an audio file in the background of a spoken text or inline. For example, playing music in the background of a spoken text prompt. |