Version: 2.0.0

Speech Synthesis Markup Language (SSML) Reference

You can use Speech Synthesis Markup Language (your SSML) as input to control how Resemble generates speech. Resemble automatically handles normal punctation, such as pausing after a period, or speaking a sentence that ends with a question mark as a question. However, in some cases, you may want additional control of Resemble’s synthetic speech. This may include, for example, having certain words pronounced in a specific way, saying a word or sentence with excitement, spelling certain words character by character, and much more.

SSML is a markup language that provides a standard way to markup text for the generation of synthetic speech. The specific tags Resemble supports are listed in Supported SSML Tags.

Supported SSML tags

These are the SSML elements that Resemble supports. The speak element is required. All other elements are optional.

SSML Element	Required	Summary
speak	Yes	Required root element for the SSML document.
prosody	No	Specifics the pitch, volume, and rate of a word.
phoneme	No	Indicates the phonetic pronunciation of the contained text, overriding the default pronunciation.
emphasis	No	Apply a pre-defined emphasis on a word. Emphasis is a pre-set combination of pitch and volume.
say-as	No	Indicates the type of text contained in the element. For example, acronym.
sub	No	Specified the string of text to pronounce rather than the text contained in the element.
break	No	Inserts a pause in between words.
language	No	Specifies the language to generate the content.
audio	No	Allows for the insertion of recorded audio files in addition with synthesized speech output.
resemble:emotion	No	Apply an emotion to a word or provide fine grain control over the pitch, intensity, and pace of word.
resemble:convert	No	Applies speech-to-speech given a source audio file.

Apply multiple SSML tags to the same speech

You can combine most supported tags with each other to multiply the effect on speech. For instance, this example uses both the phoneme and emphasis tags. This tells resemble to speak the entire sentence with a strong emphasis, and speak the text inside the phoneme tag with the provided pronunciation.

Example of applying multiple ssml tags to the same speech:

<speak><emphasis level="strong">Hey there! Welcome to <phoneme ph="ɹɪsɛmbəl" alphabet="ipa">Resemble</phoneme>.</emphasis></speak>

Incompatible Tags

Not all tags can be combined.

A phoneme tag must only wrap around text and cannot contain any other elements within.
The say-as tag is similar to the phoneme tag in the sense that it too can only wrap text and cannot contain other elements within.
The sub tag is similar to the phoneme tag in the sense that it too can only wrap text and cannot contain other elements within.
The resemble:emotion tag can only wrap the phoneme or the say-as tag, and text.

For example, these are invalid:

<speak>This is an <phoneme ph="ɛɡzampəl" alphabet="ipa"><emphasis level="reduced">example</emphasis></phoneme> of invalid use of the phoneme tag.</speak>

<speak>This is an example of invalid use of the <sub alias="substitute"><resemble:emotion pitch="0.5" rate="0.80">substitute</resemble:emotion></sub> tag.</speak>

On the other hand, these are valid:

<speak>This is an <emphasis level="reduced"><phoneme ph="ɛɡzampəl" alphabet="ipa">example<phoneme></emphasis> of valid use of the phoneme tag.</speak>

<speak>This is an example of valid use of the <resemble:emotion pitch="0.5" rate="0.80"><sub alias="substitute">substitute</sub></resemble:emotion> tag.</speak>

`<speak>`: Speak tag

The required root element of the SSML document.

Syntax

<speak version="float" xmlns="string" xml:lang="string"></speak>

Attributes

Attribute	Required	Description
version	No	Indicates the version of the SSMl specification used to interpret the document markup. Defaults to v1.1.
xml:lang	No	Specifies the language of the root language. The value may contain a lowercase, two-letter language code (for example, en), or the language code and uppercase country/region (for example, en-US). Defaults to en-us.
xmlns	No	Specified the URI to the document that defines the markup vocabulary. The current URI is http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis.xsd

`<resemble:emotion>` tag

An optional tag used to style the way synthesized speech sounds when generated through Resemble’s AI. This tag works best when the voice submitted for cloning contains samples in the range of low-pitch to high-pitch, quiet to loud, slow to fast and a range of emotions. If the voice being cloned does not contain samples in the ranges described previously, the synthesized output may be undesirable – the best case in this situation would be to use the prosody tag to achieve your desired style.

Example

The sample has been generated using this input:

<speak><resemble:emotion pitch="0.9" intensity="0.9" pace="0.9">This is a resemble style test!</resemble:emotion> </speak>

Syntax

<resemble:emotion pitch="float" intensity="float" pace="float"></emotion>

Attributes

Attribute	Required	Description
emotions	No	A pre-set emotion. This must be a value in: neutral angry annoyed question happy
pitch	No	The expressiveness of the synthesized speech. This must be a value in 0 and 1 (inclusive). If pitch is provided, intensity and pace must also be provided.
intensity	No	The aggressiveness of the synthesized speech. This must be a value in 0 and 1 (inclusive). If intensity is provided, pitch and pace must also be provided.
pace	No	The pace/rate/speed of the synthesized speech. This must be a value in 0 and 1 (inclusive). If pace is provided, pitch and intensity must also be provided.

Deprecated Syntax. Please note, the follow syntax is deprecated and will be removed in the future. Use the new syntax, listed above, instead.

<style emotions="string"></style>

<resemble:style expressiveness="float" aggressiveness="float" pace="float"></resemble:style>

Deprecated Attributes

Attribute	Required	Description
emotions	Yes	Either the emotion or the expressiveness, aggressiveness, and pace of the synthesized speech. Attribute should either be a string in: neutral angry annoyed question happy OR a string in the following form: `expressiveness:float aggressiveness:float pace:float` Where expressiveness, aggressiveness, and pace are values in 0 and 1 (inclusive). For example: `<style emotions="expressiveness:0.5 aggressiveness:0.6 pace:0.7"></style>`
expressiveness	No	Renamed to "pitch"; see documentation on resemble:emotion
aggressiveness	No	Renamed to "intensity" see documentation on resemble:emotion

`<prosody>` tag

An optional tag used to style the way synthesized speech sounds by specifying the pitch, rate, or volume.

Example

The sample has been generated using this input:

<speak>This part is normal. <prosody pitch="x-high">This part is going to sound high pitched</prosody>. <prosody rate="150%">This part is going to be spoken fast</prosody>. <prosody volume="loud">And this part is loud</prosody>!<speak>

Syntax

<prosody pitch="string" rate="string" volume="string"></prosody>

Attributes

Attribute	Required	Description
pitch	No	The baseline pitch of the synthesized speech. This must be one of the following values: x-Low low medium high x-high
rate	No	The baseline speed of the synthesized speech. The rate must be a percent value. For example 100% is normal, 50% is half as fast as normal, 200% is double the speed of normal.
volume	No	Indicates the volume level of the synthesized speech. This must be one of the following values: silent x-soft soft medium loud x-loud

`<phoneme>` tag

An optional tag that specifies the phonetic pronunciation for the specified text using phones from a supported phonetic alphabet.

Example

The sample has been generated using this input:

<speak>This is a phoneme replacement on this <phoneme ph="laɪn">word</phoneme>.</speak>

Syntax

<phoneme alphabet="string" ph="string"></phoneme>

Attributes

Attribute	Required	Description
alphabet	No	Specifies the phonetic alphabet to use when synthesizing the pronunciation of the string in the ph attribute. The string specifying the alphabet must be specified in lowercase letters. The following are the possible alphabets that you may specify: ipa The default value is "ipa" if no "alphabet" attribute is provided.
ph	Yes	A string containing phones that specify the pronunciation of the word in the phoneme element.

`<emphasis>` tag

An optional tag that specifies the emphasis of the synthesized speech. Emphasis makes it easier to apply a pre-defined range of volume & pitch to the synthesized speech.

Example

The sample has been generated using this input:

<speak><emphasis level="reduced">I am more of a shy person really</emphasis>.</speak>

Syntax

<emphasis level="string"></emphasis>

Attributes

Attribute	Required	Description
level	No	Specifies the emphasis to apply on the text within the emphasis tag. The following are the possible level’s that you may specify: reduced strong

`<say-as>` tag

An optional element that indicates the content type. This provides guidance to the speech synthesis AI about how to pronounce the text.

Example

The following sample has been generated using this input:

<speak>This <say-as interpret-as="characters">SSML</say-as> stuff is really cool!</speak>

Syntax

<say-as interpret-as="string"></say-as>

Attributes

Attributes	Required	Description
interpret-as	Yes	Indicates the content type of element’s text. The only types that are currently support are: characters The characters content type will spell out each character of the contained text.

`<sub>` tag

An optional element that specifies a string of text that is pronounced in place of the element’s text.

Example

The following sample has been generated using this input:

<speak>Hi <sub alias="Joe">Jim</sub>, we are calling today to inform you of your account activation with Resemble.</speak>

Syntax

<sub alias="string"></sub>

Attributes

Attribute	Required	Description
alias	Yes	Specifies the substitute text to speak.

`<break>` tag

An optional tag used to insert pauses between words.

Example

The following sample has been generated using this input:

<speak>This is going to be a long <break time="2s"/>pause.</speak>

Syntax

<break time="string" />

Attributes

Attribute	Required	Description
time	Yes	Specifies the absolute duration of a pause in seconds. For example, 1s.

`<language>` tag

If supported by the voice, this tag will be able to switch languages.

Example

The following sample has been generated using this input:

<speak>Su vuelo a <lang xml:lang="en-us">Pearson International Airport</lang> partirá en 30 minutos.</speak>

Syntax

<lang xml:lang="string" />

Attributes

Attribute	Required	Description
xml:lang	Yes	Specifies the language that the text should generate in. Supported languages vary by voice.

`<resemble:convert>` tag

If supported by the voice, this tag will be able to perform speech-to-speech.

Speech-to-Speech enables you to transform a recording of one speaker, into a recording of another speaker. Similar to other Resemble features, you can take advantage of speech-to-speech in your application through SSML and our API.

⚠️ The maximum allowed file size is 50mb, and the maximum duration is 300 seconds. If a file exceeds any of these parameters, it will automatically be trimmed.

Example

The following sample has been generated using this input:

<speak><resemble:convert src="https://resemble-data.s3.us-east-2.amazonaws.com/source-s2s.wav"/></speak>

Syntax

<resemble:convert src="string" pitch="float"></resemble:convert>

Attributes

Attribute	Required	Description
src	Yes	A direct URL to the source audio file. Only WAV files are supported. The source audio file should be a recording of a single speaker.
pitch	No	Adjust the pitch of the generated audio using a float value between -10.0 and 10.0. If the pitch value is set to 0 or not provided, the audio will be generated with no pitch adjustment.

`<resemble:fill>` tag

Resemble Fill enables you to take existing recordings of speech and modify them seamlessly (audio inpainting).

⚠️ Please see Using Resemble Fill Through The API for detailed instructions.

Example

The following sample has been generated using this input:

<resemble:fill recording_uuid="06ba9935">The Finch gladly accepted the invitation and arrived in good time and with a very good appetite.</resemble:fill>

Syntax

<resemble:fill recording_uuid="<string>"></resemble:fill>

Attributes

Attribute	Required	Description
recording_uuid	Yes	The recording UUID of the recording you want to modify. See Using Resemble Fill Through The API for detailed instructions.

`<audio>` tag

The <audio> tag allows you to insert recorded audio files in addition with the synthesized speech output.

Example

The following sample has been generated using this input:

<audio src="angry_cow.mp3">
  <desc>An angry cow</desc>
  Moo!!! (The sound failed to load)
</audio>

Syntax

<audio
    src="<string>"
    soundLevel="<string>"
    background="<boolean>">
    Hello there
></audio>

Attributes

Attribute	Required	Description
`src`	Yes	A URI referring to an audio source. You must use `wav`
`soundLevel`	No	Change the volume level of the audio, specified in percentage
`background`	No	Play an audio file in the background of a spoken text or inline. For example, playing music in the background of a spoken text prompt.

Speech Synthesis Markup Language (SSML) Reference

Supported SSML tags​

Apply multiple SSML tags to the same speech​

Incompatible Tags​

<speak>: Speak tag​

Attributes​

<resemble:emotion> tag​

Example​

Attributes​

Deprecated Attributes​

<prosody> tag​

Example​

Attributes​

<phoneme> tag​

Example​

Attributes​

<emphasis> tag​

Example​

Attributes​

<say-as> tag​

Example​

Attributes​

<sub> tag​

Example​

Attributes​

<break> tag​

Example​

Attributes​

<language> tag​

Example​

Attributes​

<resemble:convert> tag​

Example​

Attributes​

<resemble:fill> tag​

Example​

Attributes​

<audio> tag​

Example​

Attributes​

Supported SSML tags

Apply multiple SSML tags to the same speech

Incompatible Tags

`<speak>`: Speak tag

Attributes

`<resemble:emotion>` tag

Example

Attributes

Deprecated Attributes

`<prosody>` tag

Example

Attributes

`<phoneme>` tag

Example

Attributes

`<emphasis>` tag

Example

Attributes

`<say-as>` tag

Example

Attributes

`<sub>` tag

Example

Attributes

`<break>` tag

Example

Attributes

`<language>` tag

Example

Attributes

`<resemble:convert>` tag

Example

Attributes

`<resemble:fill>` tag

Example

Attributes

`<audio>` tag

Example

Attributes