Skip to main content
Version: 2.0.0

Timestamps

The main purpose of the timestamps JSON object is to know the time at which a grapheme or a phoneme can be heard in the synthesized audio. All API's pertaining to clips return a timestamps JSON object that contain the following attributes; graph_chars, graph_times, phon_chars, and phon_times. A detailed description of these attributes can be found below. Note that for streaming, this format is different.

Examples of use cases for timestamps:

  • For graphemes: a text reader can display which words are currently being read.
  • For phonemes: an animated model can synchronize mouth movements with the synthesized audio.

Sync synthesis

The timestamps Object

AttributeDescription
graph_charsAn array of individual characters that make up the text used to synthesize the audio (without SSML tags).
graph_timesAn array of pairs of floats corresponding to the time in the audio at which the grapheme in graph_chars with the same index starts and ends.
phon_charsAn array of individual phonemes as IPA characters (UTF-8 encoded) making up the sounds pronounced in the audio. Phonemes may be accompanied by diacritics and stress characters.
phon_timesAn array of pairs of floats corresponding to the time in the audio at which the phoneme in phon_chars with the same index starts and ends.

Example

With this input text:

"Hey there."

We get the following timestamps object:

"timestamps": {
"graph_chars": [
"H",
"e",
"y",
" ",
"t",
"h",
"e",
"r",
"e",
"."
],
"graph_times": [
[0.0374, 0.1247],
[0.0873, 0.1746],
[0.1372, 0.2245],
[0.1746, 0.3118],
[0.2744, 0.3866],
[0.2744, 0.3866],
[0.3617, 0.4864],
[0.4615, 0.5862],
[0.4615, 0.5862],
[0.5488, 0.6984]
],
"phon_chars": [
"h",
"ˈe",
"ɪ",
" ",
"ð",
"ˈɛ",
"ɹ",
"."
],
"phon_times": [
[0.0374, 0.1247],
[0.0873, 0.1746],
[0.1372, 0.2245],
[0.1746, 0.3118],
[0.2744, 0.3866],
[0.3617, 0.4864],
[0.4615, 0.5862],
[0.5488, 0.6984]
]
}

Each index in the graph_chars array has an equivalent index in the graph_times array which pertains to the time at which the grapheme has started and finished being spoken. For example at index 0 in the graph_chars array, we have:

H

And at index 0 in the graph_times array, we have:

[0.0374, 0.1247]

Therefore, the character H started being spoken at 0.0374s and finished being spoken at 0.1247s.

The same applies for the phon_chars and phon_times attributes.

Streaming synthesis

Find how we encode timestamps with streaming synthesis in the specification of our wav format.