Audio Timestamps
The main purpose of the audio_timestamps
JSON object is to know the time at which a grapheme or a phoneme can be heard in the synthesized audio. All API's pertaining to clips
return a audio_timestamps
JSON object that contain the following attributes; graph_chars
, graph_times
, phone_cars
, and phone_times
. A detailed description of these attributes can be found below.
A good use case of audio_timestamps
is for animation purposes. An animated model can easily synchronize mouth movements with the synthesized audio.
The audio_timestamps
Object
Attribute | Description |
---|---|
graph_chars | An array of characters containing the equivalent of the text used to synthesize the audio. |
graph_times | A 2D array of floats corresponding to the time in the audio, at which a grapheme at the same index in the graph_chars array starts and ends. |
phone_chars | An array of Kirshenbaum IPA (ASCII IPA) characters representing the phonetic equivalent of the text used to synthesize the audio |
phone_times | A 2D array of floats corresponding to the time in the audio, at which the phoneme at the same index in the phone_chars array starts and ends. |
How to use the audio_timestamps
Object
Given the following input text:
"This is a test."
The following audio_timestamps
object will be returned (times may vary depending on the voice you're using):
"audio_timestamps": {
"graph_chars": [
"T",
"h",
"i",
"s",
" ",
"i",
"s",
" ",
"a",
" ",
"t",
"e",
"s",
"t",
"."
],
"graph_times": [
[
0.0116,
0.0464
],
[
0.0116,
0.0464
],
[
0.0464,
0.0929
],
[
0.0929,
0.1509
],
[
0.1509,
0.1974
],
[
0.1974,
0.2438
],
[
0.2438,
0.2902
],
[
0.2902,
0.3367
],
[
0.3367,
0.3831
],
[
0.3831,
0.418
],
[
0.418,
0.476
],
[
0.476,
0.6502
],
[
0.6502,
0.7663
],
[
0.7663,
0.8707
],
[
0.8707,
0.952
]
],
"phon_chars": [
"ð",
"ɪ",
"s",
" ",
"ɪ",
"z",
" ",
"ɐ",
" ",
"t",
"ˈ",
"ɛ",
"s",
"t",
"."
],
"phon_times": [
[
0.0116,
0.0464
],
[
0.0464,
0.0929
],
[
0.0929,
0.1509
],
[
0.1509,
0.1974
],
[
0.1974,
0.2438
],
[
0.2438,
0.2902
],
[
0.2902,
0.3367
],
[
0.3367,
0.3831
],
[
0.3831,
0.418
],
[
0.418,
0.476
],
[
0.476,
0.5457
],
[
0.5457,
0.6502
],
[
0.6502,
0.7663
],
[
0.7663,
0.8707
],
[
0.8707,
0.952
]
]
}
Each index in the graph_chars
array has an equivalent index in the graph_times
array which pertains to the time at which the grapheme has started and finished being spoken.
For example at index 0
in the graph_chars
array, we have:
T
And at index 0
in the graph_times
array, we have:
[
0.0116,
0.0464
]
Therefore grapheme T
started being spoken at 0.0116s
and finished being spoken at 0.0464s
.
The same concept applies for the phon_chars
and phon_times
attributes.