The key things that make Telephony Bot Callback API (also occasionally referred in the documentation as RTC API, where RTC stands for Real-Time Communications) different from a normal Web API for Speech-to-Text are:
- Establishing and tear-down of audio session are performed separately from the API requests that interact with the session.
- Audio session is generally long and spans multiple API interactions with the audio.
- Actual commands for actions to be performed upon the audio session are issued in HTTP responses (not in requests) - hence the name Callback API.
The whole idea is best illustrated on an example sequence diagram showing a simple telephone call (let's assume that this is a simple telephone call-in survey):
Notes about the diagram:
- Voicegain platform is capable of handling telephone calls. Currently call can only be made to phone numbers purchased from Voicegain (see more here).
- As soon as the telephone call is answered, Voicegain Platform created a new RTC Session and a POST request is made to a webhook URL associated with the phone number that was called (see more about associating phone numbers with webhooks here). The payload of the request includes Voicegain session id, Caller's phone number (ANI) and the phone number that has been called (DNIS).
- The Application logic (separate from Voicegain) makes a decision of what to do at a start of a call. In this case if decides to play a prompt "Welcome" to the caller. In the response to the POST requests it sends a JSON with instructions to output "Welcome" prompt using TTS. For more abut the available actions see RTC API Documentation.)
- Voicegain Platform processes the prompt request and plays the requested prompt using TTS over the phone to the caller. Once done playing it reports that to the Webhook using a PUT request.
- Application logic decides next to ask a question "Are you happy?" expecting yes/no response from the caller. In order to recognize the spoken response it specifies that built-in YES/NO grammar is to be used. I also specifies that the recognized semantic value of the response is to be attached to a variable "happy". (For more about available grammars see here.)
- Voicegain Platform loads the YES/NO grammar, starts recognition, and then starts playing the "Are you happy?" prompt. The reason we start recognition at the same time we start playing the prompt is because this allows users to barge-in with the response.
- User's spoken response is matched to the grammars and based on the match assigned utterance and corresponding semantic tag. Values of the recognition are passed to the webhook using another PUT request.
- The Application logic notes the response and then instructs Voicegain platform in response to disconnect the the call after saying "Goodbye".
- Voicegain platform plays "goodbye" over the phone using TTS and the disconnects (hangs up) the call.
- Final request to the Webhook is DELETE which informs that the Application logic that Goodbye prompt has finished playing and the call was terminated.