Skip to main content
Version: 2.0.0

SSML Reference

You can use Speech Synthesis Markup Language (your SSML) as input to control how Resemble generates speech. Resemble automatically handles normal punctation, such as pausing after a period, or speaking a sentence that ends with a question mark as a question. However, in some cases, you may want additional control of Resemble’s synthetic speech. This may include, for example, having certain words pronounced in a specific way, saying a word or sentence with excitement, spelling certain words character by character, and much more.

SSML is a markup language that provides a standard way to markup text for the generation of synthetic speech. The specific tags Resemble supports are listed in Supported SSML Tags.

Supported SSML tags

These are the SSML elements that Resemble supports. The speak element is required. All other elements are optional.

SSML ElementRequiredSummary
speakYesRequired root element for the SSML document.
prosodyNoSpecifics the pitch, volume, and rate of a word.
emphasisNoApply a pre-defined emphasis on a word. Emphasis is a pre-set combination of pitch and volume.
say-asNoIndicates the type of text contained in the element. For example, acronym.
subNoSpecified the string of text to pronounce rather than the text contained in the element.
breakNoInserts a pause in between words.
languageNoSpecifies the language to generate the content.
audioNoAllows for the insertion of recorded audio files in addition with synthesized speech output.
resemble:convertNoApplies speech-to-speech given a source audio file.

<speak>: Speak tag

The required root element of the SSML document.

Syntax

<speak version="float" xmlns="string" xml:lang="string"></speak>

Attributes

AttributeRequiredDescription
versionNoIndicates the version of the SSMl specification used to interpret the document markup. Defaults to v1.1.
xml:langNoSpecifies the language of the root language. The value may contain a lowercase, two-letter language code (for example, en), or the language code and uppercase country/region (for example, en-US). Defaults to en-us.
xmlnsNoSpecified the URI to the document that defines the markup vocabulary. The current URI is http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis.xsd
temperatureNoThis is controls the randomness of the generated output and the value ranges from 0.1 to 5. The default value is 0.8.
exaggerationNoThis is controls the intensity of emotion and the value ranges from 0.0 to 1.0.
seedNoA non-negative integer. Initializes the model for deterministic output.

<prosody> tag

An optional tag used to style the way synthesized speech sounds by specifying the pitch, rate, or volume.

Example

The sample has been generated using this input:

<speak>This part is normal. <prosody pitch="x-high">This part is going to sound high pitched</prosody>. <prosody rate="150%">This part is going to be spoken fast</prosody>. <prosody volume="loud">And this part is loud</prosody>!<speak>

Syntax

<prosody pitch="string" rate="string" volume="string"></prosody>

Attributes

AttributeRequiredDescription
pitchNoThe baseline pitch of the synthesized speech. This must be one of the following values:
  • x-Low
  • low
  • medium
  • high
  • x-high
rateNoThe baseline speed of the synthesized speech. The rate must be a percent value. For example 100% is normal, 50% is half as fast as normal, 200% is double the speed of normal.
volumeNoIndicates the volume level of the synthesized speech. This must be one of the following values:
  • silent
  • x-soft
  • soft
  • medium
  • loud
  • x-loud

<emphasis> tag

An optional tag that specifies the emphasis of the synthesized speech. Emphasis makes it easier to apply a pre-defined range of volume & pitch to the synthesized speech.

Example

The sample has been generated using this input:

<speak><emphasis level="reduced">I am more of a shy person really</emphasis>.</speak>

Syntax

<emphasis level="string"></emphasis>

Attributes

AttributeRequiredDescription
levelNoSpecifies the emphasis to apply on the text within the emphasis tag. The following are the possible level’s that you may specify:
  • reduced
  • strong

<say-as> tag

An optional element that indicates the content type. This provides guidance to the speech synthesis AI about how to pronounce the text.

Example

The following sample has been generated using this input:

<speak>This <say-as interpret-as="characters">SSML</say-as> stuff is really cool!</speak>

Syntax

<say-as interpret-as="string"></say-as>

Attributes

AttributesRequiredDescription
interpret-asYesIndicates the content type of element’s text. The only types that are currently support are:
    characters
    • The characters content type will spell out each character of the contained text.

<sub> tag

An optional element that specifies a string of text that is pronounced in place of the element’s text.

Example

The following sample has been generated using this input:

<speak>Hi <sub alias="Joe">Jim</sub>, we are calling today to inform you of your account activation with Resemble.</speak>

Syntax

<sub alias="string"></sub>

Attributes

AttributeRequiredDescription
aliasYesSpecifies the substitute text to speak.

<break> tag

An optional tag used to insert pauses between words.

Example

The following sample has been generated using this input:

<speak>This is going to be a long <break time="2s"/>pause.</speak>

Syntax

<break time="string" />

Attributes

AttributeRequiredDescription
timeYesSpecifies the absolute duration of a pause in seconds. For example, 1s.

<language> tag

If supported by the voice, this tag will be able to switch languages.

Example

The following sample has been generated using this input:

<speak>Su vuelo a <lang xml:lang="en-us">Pearson International Airport</lang> partirá en 30 minutos.</speak>

Syntax

<lang xml:lang="string" />

Attributes

AttributeRequiredDescription
xml:langYesSpecifies the language that the text should generate in. Supported languages vary by voice.

<resemble:convert> tag

If supported by the voice, this tag will be able to perform speech-to-speech.

Speech-to-Speech enables you to transform a recording of one speaker, into a recording of another speaker. Similar to other Resemble features, you can take advantage of speech-to-speech in your application through SSML and our API.

⚠️ The maximum allowed file size is 50mb, and the maximum duration is 300 seconds. If a file exceeds any of these parameters, it will automatically be trimmed.

Example

The following sample has been generated using this input:

<speak><resemble:convert src="https://resemble-data.s3.us-east-2.amazonaws.com/source-s2s.wav"/></speak>

Syntax

<resemble:convert src="string" pitch="float"></resemble:convert>

Attributes

AttributeRequiredDescription
srcYesA direct URL to the source audio file. Only WAV files are supported. The source audio file should be a recording of a single speaker.
pitchNoAdjust the pitch of the generated audio using a float value between -10.0 and 10.0. If the pitch value is set to 0 or not provided, the audio will be generated with no pitch adjustment.

<resemble:fill> tag

Resemble Fill enables you to take existing recordings of speech and modify them seamlessly (audio inpainting).

⚠️ Please see Using Resemble Fill Through The API for detailed instructions.

Example

The following sample has been generated using this input:

<resemble:fill recording_uuid="06ba9935">The Finch gladly accepted the invitation and arrived in good time and with a very good appetite.</resemble:fill>

Syntax

<resemble:fill recording_uuid="<string>"></resemble:fill>

Attributes

AttributeRequiredDescription
recording_uuidYesThe recording UUID of the recording you want to modify. See Using Resemble Fill Through The API for detailed instructions.

<audio> tag

The <audio> tag allows you to insert recorded audio files in addition with the synthesized speech output.

Example

The following sample has been generated using this input:

<audio src="angry_cow.mp3">
<desc>An angry cow</desc>
Moo!!! (The sound failed to load)
</audio>

Syntax

<audio
src="<string>"
soundLevel="<string>"
background="<boolean>">
Hello there
></audio>

Attributes

AttributeRequiredDescription
srcYesA URI referring to an audio source. You must use wav
soundLevelNoChange the volume level of the audio, specified in percentage
backgroundNoPlay an audio file in the background of a spoken text or inline. For example, playing music in the background of a spoken text prompt.