Below we describe how to use the Voicegain platform together with FreeSWITCH for real-time transcription of the audio (inbound and outbound channels) of the calls handled by FreeSWITCH.

mod_vg_tap_ws

In order to stream audio from FreeSWITCH to Voicegain Speech-to-Text API you will need mod_vg_tap_ws which is a FreeSWITCH application module. Voicegain makes it available as a set of C/C++ source files together with a make file. To gain access to it you need a giltab.com account. Please let us at support@voicegain.ai know your gitlab user name and we will give you access to the relevant code project.

In order to install mod_vg_tap_ws on your FreeSWITCH you will need to perform the following steps.

Install libwebsockets library:
```
apt-get install -yq libwebsockets-dev
```
Create a new directory on FreeSWITCH and copy the provided source files there
```
/usr/src/freeswitch/src/mod/applications/mod_vg_tap_ws
```

Add "applications/mod_vg_tap_ws" to

/usr/src/freeswitch/build/modules.conf.in

Add "src/mod/applications/mod_vg_tap_ws/Makefile" to
```
/usr/src/freeswitch/configure.ac
```
you can put under the line with "src/mod/Makefile"
In /usr/src/freeswitch run the following commands
(the first command creates a Makefile from Makefile.am)
```
./bootstrap.sh -j
./configure
make
make install
```

Finally, add "<load module="mod_vg_tap_ws"/>" to

/usr/local/freeswitch/conf/autoload_configs/modules.conf.xml

mod_vg_tap_ws provides the following commands:

uuid_vg_tap_ws <uuid> info - can be used to verify that mod_vg_tap is loaded and responds to commands
uuid_vg_tap_ws <uuid> start <url> - starts streaming, the <url> can be one of:
- websocket url (wss:// or ws://) - Audio will be streamed to this websocket in binary format.
- http url (https:// or http://) - This http url will be invoked using GET method (with empty body). The response has to be in text/plain format and must contain a websocket url to which the audio is supposed to be streamed.
uuid_vg_tap_ws <uuid> stop - used to stop streaming

Possible scenario for using mod_vg_tap_ws

The diagram below shows one possible scenario for using mod_vg_tap_ws. In this scenario mod_vg_tap_ws if given the wss:// url of the websocket to which to stream audio.This scenario will work well if you are already using scripting within your FreeSWITCH dialplan.

In the diagram:

Your Voicegain Adapter - is an http web service with the necessary back-end logic to tie FS/mod_vg_tap_ws, the Voicegain STT APi, and your application
Your application - is that part of your application that will consume the transcription results. If it is separate from Your Voicegain Adapter it will need to be able to accept messages from Your Voicegain Adapter. You can alternatively embed Your Voicegain Adapter into Your application for simplicity.

Here is a sequence of events as annotated on the diagram:

A call (e.g. from a customer) is received on FreeSWITCH - we assume that the call is forwarded to an Agent
(skipped)
A script referenced in your dialplan will call a web method provided by an Adapter code that you will need to write. You will likely want to pass some relevant call session variables to it. You will also need to pass the FS session sample rate, unless it can be assumed to be the same for all the calls.
Your adapter code calls Voicegain /asr/transcribe/async API and starts a real-time transcription session. A sample body of the request is provided at the end of this article.
Voicegain API returns parameters for the new transcribe session which will include:
1. audio websocket URL in audio.stream.websocketUrl
2. two websocket URLs (one for each channel) for the websockets that will send back the results of transription in sessions[0].websocket.url and sessions[1].websocket.url
Your Adapter will pass the audio websocket URL to the lua script (in the response to the request from step (3)) and will pass the results websocket URLs to your application
Lua script will launch vg_tap_ws providing the websocket URL
api:executeString('uuid_vg_tap_ws ' .. session:getVariable("uuid") .. ' start ' .. wsUrl);
At the same time your application will connect to the results websockets
mod_vg_tap_ws will stream the FreeSWITCH audio (left and right channel) over the websocket to the Voicegain STT
and your app will receive the transcript real-time over the two results websockets.

Alternative scenario for using mod_vg_tap_ws

The diagram below shows another possible scenario for using mod_vg_tap_ws. It differs mainly in the initial steps. An advantage of this scenario is that it is easier to make it a part of an existing dialplan that does not have a script.

Here is a sequence of events as annotated on the diagram:

A call (e.g. from a customer) is received on FreeSWITCH - we assume that the call is forwarded to an Agent
A script or a dialplan command will invoke mod_vg_tap_ws. Instead of passing a websocket URL, it will pass an http URL that mod_vg_tap_ws will use to call your Adapter code in next step. All parameters need to be in the URL. Example parameters would be FS UUID, sample rate, ani, dnis, plus any other ids or call information that may be usefull to your application.
mod_vg_tap_ws is using the URL that it got to invoke Adapter method using HTTP GET. Response to this request needs to be of type text/plain and needs to contain the websocket URL (see step 6)

The remaining steps are identical as in the first scenario, only that step 6) is not returning the websocket URL to the LUA script but to mod_vg_tap_ws, and there is only one part of step 7) because mod_vg_tap_ws has already been invoked.

Sample body of the request in step 4

Here is an example body of the request to /asr/transcribe/async - we use two separate sessions to transcribe the audio of the left and right (inbound and outbound) channels:

body = {
  "sessions": [
    {
      "asyncMode": "REAL-TIME",
      "audioChannelSelector" : "left",
      "websocket": {
        "adHoc": 'true',
        "useSTOMP" : 'false',
        "minimumDelay": 0
      },
      "content" : {
        "incremental" : ['words'],
        "full" : []
      },
      "metadata" : [
        {"name" : "ANI", "value" : "+19725180012"}
      ]
    },
    {
      "asyncMode": "REAL-TIME",
      "audioChannelSelector" : "right",
      "websocket": {
        "adHoc": 'true',
        "useSTOMP" : 'false',
        "minimumDelay": 0
      },
      "content" : {
        "incremental" : ['words'],
        "full" : []
      },
      "metadata" : [
        {"name" : "DNIS", "value" : "983476"}
      ]
    }
  ],
  "audio": {
    "source": { "stream": { "protocol": "WEBSOCKET" } },
    "format": "L16",
    "channels" : "stereo",
    "rate": 8000,
    "capture": 'true'
  },
  "settings": {
    "asr": {
      "acousticModelRealTime": "VoiceGain-kappa",
      "noInputTimeout": 60000,
      "completeTimeout": 0,
      "sensitivity" : 0.99
    }
  }
}

Note that the audio.rate, may need to be changed, based on a parameter passed from FreeSWITCH if you are handling pure SIP communications, or WebRTC.

Also note that you man pass any metadata into the request. This allows you to tie Voicegain sessions to specific FreeSWITCH sessions.

Sample code handling results in step 8

You can find sample code here: platform/async-real-time-websocket-two-channel-in-and-out.py at master · voicegain/platform (github.com), in particular see function process_ws_msg at line 101.

Tapping into FreeSWITCH for real-time transcription

mod_vg_tap_ws

Possible scenario for using mod_vg_tap_ws

Alternative scenario for using mod_vg_tap_ws

Sample body of the request in step 4

Sample code handling results in step 8

Comments

mod_vg_tap_ws

Possible scenario for using mod_vg_tap_ws

Alternative scenario for using mod_vg_tap_ws

Sample body of the request in step 4

Sample code handling results in step 8

Related articles