Skip to main content
Version: 2.0.0

Speech Synthesis Markup Language (SSML) Reference

You can use Speech Synthesis Markup Language (your SSML) as input to control how Resemble generates speech. Resemble automatically handles normal punctation, such as pausing after a period, or speaking a sentence that ends with a question mark as a question. However, in some cases, you may want additional control of Resemble’s synthetic speech. This may include, for example, having certain words pronounced in a specific way, saying a word or sentence with excitement, spelling certain words character by character, and much more.

SSML is a markup language that provides a standard way to markup text for the generation of synthetic speech. The specific tags Resemble supports are listed in Supported SSML Tags.

Supported SSML tags

These are the SSML elements that Resemble supports. The speak element is required. All other elements are optional.

SSML ElementRequiredSummary
speakYesRequired root element for the SSML document.
prosodyNoSpecifics the pitch, volume, and rate of a word.
phonemeNoIndicates the phonetic pronunciation of the contained text, overriding the default pronunciation.
emphasisNoApply a pre-defined emphasis on a word. Emphasis is a pre-set combination of pitch and volume.
say-asNoIndicates the type of text contained in the element. For example, acronym.
subNoSpecified the string of text to pronounce rather than the text contained in the element.
breakNoInserts a pause in between words.
languageNoSpecifies the language to generate the content.
audioNoAllows for the insertion of recorded audio files in addition with synthesized speech output.
resemble:emotionNoApply an emotion to a word or provide fine grain control over the pitch, intensity, and pace of word.
resemble:convertNoApplies speech-to-speech given a source audio file.

Apply multiple SSML tags to the same speech

You can combine most supported tags with each other to multiply the effect on speech. For instance, this example uses both the phoneme and emphasis tags. This tells resemble to speak the entire sentence with a strong emphasis, and speak the text inside the phoneme tag with the provided pronunciation.

Example of applying multiple ssml tags to the same speech:

<speak><emphasis level="strong">Hey there! Welcome to <phoneme ph="ɹɪsɛmbəl" alphabet="ipa">Resemble</phoneme>.</emphasis></speak>

Incompatible Tags

Not all tags can be combined.

  • A phoneme tag must only wrap around text and cannot contain any other elements within.
  • The say-as tag is similar to the phoneme tag in the sense that it too can only wrap text and cannot contain other elements within.
  • The sub tag is similar to the phoneme tag in the sense that it too can only wrap text and cannot contain other elements within.
  • The resemble:emotion tag can only wrap the phoneme or the say-as tag, and text.

For example, these are invalid:

<speak>This is an <phoneme ph="ɛɡzampəl" alphabet="ipa"><emphasis level="reduced">example</emphasis></phoneme> of invalid use of the phoneme tag.</speak>
<speak>This is an example of invalid use of the <sub alias="substitute"><resemble:emotion pitch="0.5" rate="0.80">substitute</resemble:emotion></sub> tag.</speak>

On the other hand, these are valid:

<speak>This is an <emphasis level="reduced"><phoneme ph="ɛɡzampəl" alphabet="ipa">example<phoneme></emphasis> of valid use of the phoneme tag.</speak>
<speak>This is an example of valid use of the <resemble:emotion pitch="0.5" rate="0.80"><sub alias="substitute">substitute</sub></resemble:emotion> tag.</speak>

<speak>: Speak tag

The required root element of the SSML document.

Syntax

<speak version="float" xmlns="string" xml:lang="string"></speak>

Attributes

AttributeRequiredDescription
versionNoIndicates the version of the SSMl specification used to interpret the document markup. Defaults to v1.1.
xml:langNoSpecifies the language of the root language. The value may contain a lowercase, two-letter language code (for example, en), or the language code and uppercase country/region (for example, en-US). Defaults to en-us.
xmlnsNoSpecified the URI to the document that defines the markup vocabulary. The current URI is http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis.xsd

<resemble:emotion> tag

An optional tag used to style the way synthesized speech sounds when generated through Resemble’s AI. This tag works best when the voice submitted for cloning contains samples in the range of low-pitch to high-pitch, quiet to loud, slow to fast and a range of emotions. If the voice being cloned does not contain samples in the ranges described previously, the synthesized output may be undesirable – the best case in this situation would be to use the prosody tag to achieve your desired style.

Example

The sample has been generated using this input:

<speak><resemble:emotion pitch="0.9" intensity="0.9" pace="0.9">This is a resemble style test!</resemble:emotion> </speak>

Syntax

<resemble:emotion pitch="float" intensity="float" pace="float"></emotion>

Attributes

AttributeRequiredDescription
emotionsNoA pre-set emotion. This must be a value in:
  • neutral
  • angry
  • annoyed
  • question
  • happy
pitchNoThe expressiveness of the synthesized speech. This must be a value in 0 and 1 (inclusive). If pitch is provided, intensity and pace must also be provided.
intensityNoThe aggressiveness of the synthesized speech. This must be a value in 0 and 1 (inclusive). If intensity is provided, pitch and pace must also be provided.
paceNoThe pace/rate/speed of the synthesized speech. This must be a value in 0 and 1 (inclusive). If pace is provided, pitch and intensity must also be provided.

Deprecated Syntax. Please note, the follow syntax is deprecated and will be removed in the future. Use the new syntax, listed above, instead.

<style emotions="string"></style>
<resemble:style expressiveness="float" aggressiveness="float" pace="float"></resemble:style>

Deprecated Attributes

AttributeRequiredDescription
emotionsYesEither the emotion or the expressiveness, aggressiveness, and pace of the synthesized speech. Attribute should either be a string in:
  • neutral
  • angry
  • annoyed
  • question
  • happy
OR a string in the following form:
expressiveness:float aggressiveness:float pace:float
Where expressiveness, aggressiveness, and pace are values in 0 and 1 (inclusive). For example:
<style emotions="expressiveness:0.5 aggressiveness:0.6 pace:0.7"></style>
expressivenessNoRenamed to "pitch"; see documentation on resemble:emotion
aggressivenessNoRenamed to "intensity" see documentation on resemble:emotion

<prosody> tag

An optional tag used to style the way synthesized speech sounds by specifying the pitch, rate, or volume.

Example

The sample has been generated using this input:

<speak>This part is normal. <prosody pitch="x-high">This part is going to sound high pitched</prosody>. <prosody rate="150%">This part is going to be spoken fast</prosody>. <prosody volume="loud">And this part is loud</prosody>!<speak>

Syntax

<prosody pitch="string" rate="string" volume="string"></prosody>

Attributes

AttributeRequiredDescription
pitchNoThe baseline pitch of the synthesized speech. This must be one of the following values:
  • x-Low
  • low
  • medium
  • high
  • x-high
rateNoThe baseline speed of the synthesized speech. The rate must be a percent value. For example 100% is normal, 50% is half as fast as normal, 200% is double the speed of normal.
volumeNoIndicates the volume level of the synthesized speech. This must be one of the following values:
  • silent
  • x-soft
  • soft
  • medium
  • loud
  • x-loud

<phoneme> tag

An optional tag that specifies the phonetic pronunciation for the specified text using phones from a supported phonetic alphabet.

Example

The sample has been generated using this input:

<speak>This is a phoneme replacement on this <phoneme ph="laɪn">word</phoneme>.</speak>

Syntax

<phoneme alphabet="string" ph="string"></phoneme>

Attributes

AttributeRequiredDescription
alphabetNoSpecifies the phonetic alphabet to use when synthesizing the pronunciation of the string in the ph attribute. The string specifying the alphabet must be specified in lowercase letters. The following are the possible alphabets that you may specify:
  • ipa
The default value is "ipa" if no "alphabet" attribute is provided.
phYesA string containing phones that specify the pronunciation of the word in the phoneme element.

<emphasis> tag

An optional tag that specifies the emphasis of the synthesized speech. Emphasis makes it easier to apply a pre-defined range of volume & pitch to the synthesized speech.

Example

The sample has been generated using this input:

<speak><emphasis level="reduced">I am more of a shy person really</emphasis>.</speak>

Syntax

<emphasis level="string"></emphasis>

Attributes

AttributeRequiredDescription
levelNoSpecifies the emphasis to apply on the text within the emphasis tag. The following are the possible level’s that you may specify:
  • reduced
  • strong

<say-as> tag

An optional element that indicates the content type. This provides guidance to the speech synthesis AI about how to pronounce the text.

Example

The following sample has been generated using this input:

<speak>This <say-as interpret-as="characters">SSML</say-as> stuff is really cool!</speak>

Syntax

<say-as interpret-as="string"></say-as>

Attributes

AttributesRequiredDescription
interpret-asYesIndicates the content type of element’s text. The only types that are currently support are:
    characters
    • The characters content type will spell out each character of the contained text.

<sub> tag

An optional element that specifies a string of text that is pronounced in place of the element’s text.

Example

The following sample has been generated using this input:

<speak>Hi <sub alias="Joe">Jim</sub>, we are calling today to inform you of your account activation with Resemble.</speak>

Syntax

<sub alias="string"></sub>

Attributes

AttributeRequiredDescription
aliasYesSpecifies the substitute text to speak.

<break> tag

An optional tag used to insert pauses between words.

Example

The following sample has been generated using this input:

<speak>This is going to be a long <break time="2s"/>pause.</speak>

Syntax

<break time="string" />

Attributes

AttributeRequiredDescription
timeYesSpecifies the absolute duration of a pause in seconds. For example, 1s.

<language> tag

If supported by the voice, this tag will be able to switch languages.

Example

The following sample has been generated using this input:

<speak>Su vuelo a <lang xml:lang="en-us">Pearson International Airport</lang> partirá en 30 minutos.</speak>

Syntax

<lang xml:lang="string" />

Attributes

AttributeRequiredDescription
xml:langYesSpecifies the language that the text should generate in. Supported languages vary by voice.

<resemble:convert> tag

If supported by the voice, this tag will be able to perform speech-to-speech.

Speech-to-Speech enables you to transform a recording of one speaker, into a recording of another speaker. Similar to other Resemble features, you can take advantage of speech-to-speech in your application through SSML and our API.

⚠️ The maximum allowed file size is 50mb, and the maximum duration is 300 seconds. If a file exceeds any of these parameters, it will automatically be trimmed.

Example

The following sample has been generated using this input:

<speak><resemble:convert src="https://resemble-data.s3.us-east-2.amazonaws.com/source-s2s.wav"/></speak>

Syntax

<resemble:convert src="string" pitch="float"></resemble:convert>

Attributes

AttributeRequiredDescription
srcYesA direct URL to the source audio file. Only WAV files are supported. The source audio file should be a recording of a single speaker.
pitchNoAdjust the pitch of the generated audio using a float value between -10.0 and 10.0. If the pitch value is set to 0 or not provided, the audio will be generated with no pitch adjustment.

<resemble:fill> tag

Resemble Fill enables you to take existing recordings of speech and modify them seamlessly (audio inpainting).

⚠️ Please see Using Resemble Fill Through The API for detailed instructions.

Example

The following sample has been generated using this input:

<resemble:fill recording_uuid="06ba9935">The Finch gladly accepted the invitation and arrived in good time and with a very good appetite.</resemble:fill>

Syntax

<resemble:fill recording_uuid="<string>"></resemble:fill>

Attributes

AttributeRequiredDescription
recording_uuidYesThe recording UUID of the recording you want to modify. See Using Resemble Fill Through The API for detailed instructions.

<audio> tag

The <audio> tag allows you to insert recorded audio files in addition with the synthesized speech output.

Example

The following sample has been generated using this input:

<audio src="angry_cow.mp3">
<desc>An angry cow</desc>
Moo!!! (The sound failed to load)
</audio>

Syntax

<audio
src="<string>"
soundLevel="<string>"
background="<boolean>">
Hello there
></audio>

Attributes

AttributeRequiredDescription
srcYesA URI referring to an audio source. You must use wav
soundLevelNoChange the volume level of the audio, specified in percentage
backgroundNoPlay an audio file in the background of a spoken text or inline. For example, playing music in the background of a spoken text prompt.