Skip to main content
Version: 1.0.0

Audio Timestamps

The main purpose of the audio_timestamps JSON object is to know the time at which a grapheme or a phoneme can be heard in the synthesized audio. All API's pertaining to clips return a audio_timestamps JSON object that contain the following attributes; graph_chars, graph_times, phone_cars, and phone_times. A detailed description of these attributes can be found below.

A good use case of audio_timestamps is for animation purposes. An animated model can easily synchronize mouth movements with the synthesized audio.

The audio_timestamps Object

AttributeDescription
graph_charsAn array of characters containing the equivalent of the text used to synthesize the audio.
graph_timesA 2D array of floats corresponding to the time in the audio, at which a grapheme at the same index in the graph_chars array starts and ends.
phone_charsAn array of Kirshenbaum IPA (ASCII IPA) characters representing the phonetic equivalent of the text used to synthesize the audio
phone_timesA 2D array of floats corresponding to the time in the audio, at which the phoneme at the same index in the phone_chars array starts and ends.

How to use the audio_timestamps Object

Given the following input text:

"This is a test."

The following audio_timestamps object will be returned (times may vary depending on the voice you're using):

"audio_timestamps": {
"graph_chars": [
"T",
"h",
"i",
"s",
" ",
"i",
"s",
" ",
"a",
" ",
"t",
"e",
"s",
"t",
"."
],
"graph_times": [
[
0.0116,
0.0464
],
[
0.0116,
0.0464
],
[
0.0464,
0.0929
],
[
0.0929,
0.1509
],
[
0.1509,
0.1974
],
[
0.1974,
0.2438
],
[
0.2438,
0.2902
],
[
0.2902,
0.3367
],
[
0.3367,
0.3831
],
[
0.3831,
0.418
],
[
0.418,
0.476
],
[
0.476,
0.6502
],
[
0.6502,
0.7663
],
[
0.7663,
0.8707
],
[
0.8707,
0.952
]
],
"phon_chars": [
"ð",
"ɪ",
"s",
" ",
"ɪ",
"z",
" ",
"ɐ",
" ",
"t",
"ˈ",
"ɛ",
"s",
"t",
"."
],
"phon_times": [
[
0.0116,
0.0464
],
[
0.0464,
0.0929
],
[
0.0929,
0.1509
],
[
0.1509,
0.1974
],
[
0.1974,
0.2438
],
[
0.2438,
0.2902
],
[
0.2902,
0.3367
],
[
0.3367,
0.3831
],
[
0.3831,
0.418
],
[
0.418,
0.476
],
[
0.476,
0.5457
],
[
0.5457,
0.6502
],
[
0.6502,
0.7663
],
[
0.7663,
0.8707
],
[
0.8707,
0.952
]
]
}

Each index in the graph_chars array has an equivalent index in the graph_times array which pertains to the time at which the grapheme has started and finished being spoken. For example at index 0 in the graph_chars array, we have:

T

And at index 0 in the graph_times array, we have:

[
0.0116,
0.0464
]

Therefore grapheme T started being spoken at 0.0116s and finished being spoken at 0.0464s.

The same concept applies for the phon_chars and phon_times attributes.