Skip to main content
Version: 2.0.0

Timestamps

The main purpose of the timestamps JSON object is to know the time at which a grapheme or a phoneme can be heard in the synthesized audio. All API's pertaining to clips return a timestamps JSON object that contain the following attributes; graph_chars, graph_times, phon_chars, and phon_times. A detailed description of these attributes can be found below. Note that for streaming, this format is different.

Examples of use cases for timestamps:

  • For graphemes: a text reader can display which words are currently being read.
  • For phonemes: an animated model can synchronize mouth movements with the synthesized audio.

Sync synthesis

The timestamps Object

AttributeDescription
graph_charsAn array of individual characters that make up the text used to synthesize the audio. This is the original text you sent in your request minus the SSML tags, and with XML characters unescaped if you passed any (> -> >)
graph_timesAn array of pairs of floats corresponding to the time in the audio at which the grapheme in graph_chars with the same index starts and ends.
phon_charsAn array of individual phonemes as IPA characters (UTF-8 encoded) making up the sounds pronounced in the audio. Phonemes may be accompanied by diacritics and stress characters.
phon_timesAn array of pairs of floats corresponding to the time in the audio at which the phoneme in phon_chars with the same index starts and ends.

Example

With this input text:

"Hey there."

We get the following timestamps object:

"timestamps": {
"graph_chars": [
"H",
"e",
"y",
" ",
"t",
"h",
"e",
"r",
"e",
"."
],
"graph_times": [
[0.0374, 0.1247],
[0.0873, 0.1746],
[0.1372, 0.2245],
[0.1746, 0.3118],
[0.2744, 0.3866],
[0.2744, 0.3866],
[0.3617, 0.4864],
[0.4615, 0.5862],
[0.4615, 0.5862],
[0.5488, 0.6984]
],
"phon_chars": [
"h",
"ˈe",
"ɪ",
" ",
"ð",
"ˈɛ",
"ɹ",
"."
],
"phon_times": [
[0.0374, 0.1247],
[0.0873, 0.1746],
[0.1372, 0.2245],
[0.1746, 0.3118],
[0.2744, 0.3866],
[0.3617, 0.4864],
[0.4615, 0.5862],
[0.5488, 0.6984]
]
}

Each index in the graph_chars array has an equivalent index in the graph_times array which pertains to the time at which the grapheme has started and finished being spoken. For example at index 0 in the graph_chars array, we have:

H

And at index 0 in the graph_times array, we have:

[0.0374, 0.1247]

Therefore, the character H started being spoken at 0.0374s and finished being spoken at 0.1247s.

The same applies for the phon_chars and phon_times attributes.

Streaming synthesis

Find how we encode timestamps with streaming synthesis in the specification of our wav format.