Examples of requests for OFF-LINE transcription - various audio channels and diarization – Voicegain

In all of the examples below we assume a scenario similar to this example from github. The audio is coming from an uploaded DataObject (uploaded using /data API). We rely on the audio file to have a header, therefore, there is no need to provide audio.format settings. The system wil use the header to e.g. determine if the input audio is stereo or mono. For brevity we omit settings.asr except for diarization settings if applicable.

Transcription of mono/stereo file as a single channel w/o diarization

In this case we rely on the default setting of the sessions[].audioChannelSelector which is mix. This means that if source audio is mono we will treat is as mono (1 channel) and if it is stereo then we will mix left and right channels together into a single channel that will be transcribed. There is no diarization setting so the audio will be transcribed as if there was just one speaker.

{
 "sessions": [{
  "asyncMode": "OFF-LINE",
  "poll": {
   "persist": 120000
  },
  "content": {
   "incremental": ["progress"],
   "full" : ["words"]
  }
 }],
 "audio":{
  "source": {
   "dataStore": {
    "uuid": "<data-object-UUID>"
   }
  }
 },
 "settings": {
  "asr": {
     "acousticModelNonRealTime" : "VoiceGain-omega"
  }
 }
}

In the example above we have also shown how to switch to using a specific model for transcription ("acousticModelNonRealTime" : "VoiceGain-omega"). Normally you do not need to specify acoustic model as VoiceGain-omega, since it is our default out-of-the-box offline model. However, if you want to use a custom model for your use-case, please specify the model name provided by your Voicegain contact.

Transcription of Left channel of a stereo file w/o diarization

In this case we specify sessions[].audioChannelSelector as left. This means that if source audio is stereo then we will transcribe only the left audio channel of that file. If audio happened to be mono, then it will be transctibed as mono (1 channel). There is no diarization setting so the audio will be transcribed as if there was just one speaker.

{
 "sessions": [{
  "asyncMode": "OFF-LINE",
  "audioChannelSelector" : "left",
  "poll": {
   "persist": 120000
  },
  "content": {
   "incremental": ["progress"],
   "full" : ["words"]
  }
 }],
 "audio":{
  "source": {
   "dataStore": {
    "uuid": "<data-object-UUID>"
   }
  }
 },
 "settings": {
  "asr": {
  }
 }
}

Transcription of mono/stereo file as a single channel with diarization

In this case we rely on the default setting of the sessions[].audioChannelSelector which is mix. This means that if source audio is mono we will treat is as mono (1 channel) and if it is stereo then we will mix left and right channels together into a single channel that will be transcribed. There diarization setting is set to expect exactly 2 speakers - both min and max are set to 2. The returned transcript will have each recognized word labeled with speaker identified by the diarization algorithm.

{
 "sessions": [{
  "asyncMode": "OFF-LINE",
  "poll": {
   "persist": 120000
  },
  "content": {
   "incremental": ["progress"],
   "full" : ["words"]
  }
 }],
 "audio":{
  "source": {
   "dataStore": {
    "uuid": "<data-object-UUID>"
   }
  }
 },
 "settings": {
  "asr": {
   "diarization" : {
    "minSpeakers" : 2,
    "maxSpeakers" : 2
   }
  }
 }
}

Here is an example of a diarized transcript with spk labels - words are grouped in sections per speaker with the timestamp for the start of each section and duration also provided.

[
{"words":[{"utterance":"I","confidence":0.980556727294821,"start":80,"duration":80,"spk":2}, ... ,{"utterance":"matter.","confidence":0.9927005169638606,"start":7200,"duration":80,"spk":2}],"start":80,"duration":7200,"spk":2},
{"words":[{"utterance":"For","confidence":0.9624655850006895,"start":7700,"duration":80,"spk":1}, ... ,{"utterance":"Yes","confidence":0.9887934456864098,"start":9060,"duration":240,"spk":1}],"start":7700,"duration":1600,"spk":1} ... ]

Note, the above output with confidences and speakers will be in response to the request to polling URL with ?full=true parameter or in a response to API request to: /v1/asr/transcribe/{sessionId}/transcript?format=json

Transcription of a stereo file as a two-channel (w/o diarization)

We are assuming that there is one speaker per channels, of there are more, there is no need to diarization.

If this were a REAL-TIME transcription we could use a single request that would process left and right channels in separate sessions, like in this example.

In case of OFF-LINE transcription, we use "audioChannelSelector" : "two-channel". In the back-end two separate transcriptions will be done, one for the left and one for the right channel. Then the results will be collected and combined using the timestamp information. Each word in the output will be annotated using spk value: spk:1 for Left channel and spk:2 for the Right channel. If this option is chosen then all diarization settings will be ignored.

Example body request for transcribing two-channel is below.

{
 "sessions": [{
  "asyncMode": "OFF-LINE",
  "audioChannelSelector" : "two-channel",
  "poll": {
   "persist": 120000
  },
  "content": {
   "incremental": ["progress"],
   "full" : ["words"]
  }
 }],
 "audio":{
  "source": {
   "dataStore": {
    "uuid": "<data-object-UUID>"
   }
  }
 },
 "settings": {
  "asr": {
  }
 }
}

Transcription of mono/stereo file as a single channel w/o diarization

Transcription of Left channel of a stereo file w/o diarization

Transcription of mono/stereo file as a single channel with diarization

Transcription of a stereo file as a two-channel (w/o diarization)

Related articles