Timestamps
The main purpose of the timestamps
JSON object is to know the time at which a grapheme or a phoneme can be heard in the synthesized audio. All API's pertaining to clips
return a timestamps
JSON object that contain the following attributes; graph_chars
, graph_times
, phon_chars
, and phon_times
. A detailed description of these attributes can be found below. Note that for streaming, this format is different.
Examples of use cases for timestamps
:
- For graphemes: a text reader can display which words are currently being read.
- For phonemes: an animated model can synchronize mouth movements with the synthesized audio.
Sync synthesis
The timestamps
Object
Attribute | Description |
---|---|
graph_chars | An array of individual characters that make up the text used to synthesize the audio. This is the original text you sent in your request minus the SSML tags, and with XML characters unescaped if you passed any (> -> > ) |
graph_times | An array of pairs of floats corresponding to the time in the audio at which the grapheme in graph_chars with the same index starts and ends. |
phon_chars | An array of individual phonemes as IPA characters (UTF-8 encoded) making up the sounds pronounced in the audio. Phonemes may be accompanied by diacritics and stress characters. |
phon_times | An array of pairs of floats corresponding to the time in the audio at which the phoneme in phon_chars with the same index starts and ends. |
Example
With this input text:
"Hey there."
We get the following timestamps
object:
"timestamps": {
"graph_chars": [
"H",
"e",
"y",
" ",
"t",
"h",
"e",
"r",
"e",
"."
],
"graph_times": [
[0.0374, 0.1247],
[0.0873, 0.1746],
[0.1372, 0.2245],
[0.1746, 0.3118],
[0.2744, 0.3866],
[0.2744, 0.3866],
[0.3617, 0.4864],
[0.4615, 0.5862],
[0.4615, 0.5862],
[0.5488, 0.6984]
],
"phon_chars": [
"h",
"ˈe",
"ɪ",
" ",
"ð",
"ˈɛ",
"ɹ",
"."
],
"phon_times": [
[0.0374, 0.1247],
[0.0873, 0.1746],
[0.1372, 0.2245],
[0.1746, 0.3118],
[0.2744, 0.3866],
[0.3617, 0.4864],
[0.4615, 0.5862],
[0.5488, 0.6984]
]
}
Each index in the graph_chars
array has an equivalent index in the graph_times
array which pertains to the time at which the grapheme has started and finished being spoken.
For example at index 0
in the graph_chars
array, we have:
H
And at index 0
in the graph_times
array, we have:
[0.0374, 0.1247]
Therefore, the character H
started being spoken at 0.0374s
and finished being spoken at 0.1247s
.
The same applies for the phon_chars
and phon_times
attributes.
Streaming synthesis
Find how we encode timestamps with streaming synthesis in the specification of our wav format.