Using the V-Blaze 5.4.3 REST API

Using the V-Blaze5.4.3 REST API

Using the V-Blaze 5.4.3 REST API© Copyright 2017 Voci All Rights Reserved.

The information contained in this document is the proprietary and confidential information of Voci Technologies, Inc. incorporated. you may notdisclose, provide or make available this document, or any information contained in this document, to any third party, without the prior writtenconsent of Voci.

The information in this document is provided for use with V-Blaze. No license, express or implied, to any intellectual property associated with thisdocument or such products is granted by this document.

All Voci Technologies, Inc. products described in this document, including V-Blaze and others prefaced by Voci are owned by Voci (or thosecompanies that have licensed technology to Voci) and are protected by patents, trade secrets, copyrights or other industrial property rights. TheVoci products described in this document may still be in development. The final form of each product and release date thereof is at the sole andabsolute discretion of Voci. Your purchase, license and/or use of Voci products shall be subject to Voci's then current sales terms and conditions.

Trademarks

The following terms used in this document are trademarks of Voci Technologies, Inc. in the United States and other countries:

• Voci

• V-Blaze

• V-Discovery

• V-Ferno

• V-Purify

• V-Spark

• Voci

Other third party disclaimers or notices may be set forth in Voci's online or printed documentation. All other product and service names, andtrademarks not owned by Voci are the property of their respective owners.

Using the V-Blaze 5.4.3 REST API Table of Contents

Table of Contents1. Overview ....................................................................................................................... 1

1.1. About this Document ............................................................................................. 12. Queries .......................................................................................................................... 33. Query Responses ............................................................................................................. 5

3.1. Languages ............................................................................................................ 53.2. Models ................................................................................................................ 53.3. Params ................................................................................................................ 53.4. Status .................................................................................................................. 63.5. Sysinfo ................................................................................................................ 6

4. Transcribing Audio Files .................................................................................................. 94.1. Overview ............................................................................................................. 94.2. Multiple files per request ........................................................................................ 94.3. Passing parameters .............................................................................................. 104.4. Receiving Zipped Results ..................................................................................... 104.5. Receiving Results via Callback .............................................................................. 104.6. Receiving Purified Transcripts and Audio ................................................................ 114.7. Real-Time Streaming Transcription ........................................................................ 11

A. Transcription Parameter Reference ................................................................................... 13A.1. datahdr .............................................................................................................. 13A.2. diarize .............................................................................................................. 13A.3. emotion ............................................................................................................. 14A.4. encoding ........................................................................................................... 14A.5. endian ............................................................................................................... 14A.6. gender .............................................................................................................. 14A.7. model ............................................................................................................... 15A.8. numtrans ........................................................................................................... 15A.9. output ............................................................................................................... 16A.10. punctuate ......................................................................................................... 16A.11. samprate .......................................................................................................... 16A.12. sampwidth ....................................................................................................... 16A.13. scrubaudio ....................................................................................................... 16A.14. scrubmindist ..................................................................................................... 17A.15. scrubsil ............................................................................................................ 17A.16. scrubtext .......................................................................................................... 17A.17. subst_list ......................................................................................................... 18A.18. subst_words ..................................................................................................... 18A.19. uttmaxsilence ................................................................................................... 18A.20. uttmaxtime ....................................................................................................... 19A.21. vadtype ........................................................................................................... 19

B. Extended Examples ....................................................................................................... 21B.1. Returning Transcription in an HTTP Response ......................................................... 21B.2. Returing a Complete JSON Transcript .................................................................... 21B.3. Returning Both Scrubbed Text and Audio ............................................................... 23B.4. Posting Decoded Utterances to a Callback Server in Real-time .................................... 24

© 2017, Voci Technologies, Inc. Proprietary and Confidential iii

Table of Contents Using the V-Blaze 5.4.3 REST API

iv Proprietary and Confidential

Using the V-Blaze 5.4.3 REST API Chapter 1. Overview

Chapter 1. OverviewV-Blaze is the most accurate and scalable Speech-to-Text (STT) solution available in the market today.V-Blaze enables you to automatically generate high quality transcripts from 100% of your speech audioassets, whether your call volume is measured in hundreds or millions of hours per month. V-Blaze can bedeployed to operate on real-time audio streams or to process batches of audio files.

The V-Blaze REST API uses the HyperText Transfer Protocol (HTTP) for data transfers. This input/output(I/O) methodology adheres to the Representational State Transfer (REST) architecture which forms thebasis of the World Wide Web. Every Voci solution includes a REST API to make integration with ourproducts quick and easy in any computer language. A basic level of programming skill is required to makeeffective use this API.

Voci’s V-Blaze STT solution provides a REST API that enables both batch (file-based) and real-time (stream-based) operation. The API also supports several types of queries making it easy toprogrammatically inspect a V-Blaze for available options and status. By default, port 17171 is used onthe V-Blaze for API interactions.

The V-Blaze REST API supports the G.711 suite of audio standards: Uncompressed Pulse CodeModulation (PCM), µ-law, and A-law. Other audio formats must be converted into one of these standardsprior to submitting them to the V-Blaze REST API, using a tool such as sox or ffmpeg. These tools arefreely available for Linux, Windows, and OSX operating systems.

This guide describes how to use V-Blaze’s REST Application Programming Interface (API) to automatethe transcription process.

1.1. About this DocumentThe following naming and typographic conventions are used throughout this document:

• HTTP indicates that both HTTP and HTTPS are supported

• Boldface in examples indicates template text that must be replaced with a value suitable to yourenvironment

• Underlining is used in examples to call attention to regions most relevant to the concept being illustrated

© 2017, Voci Technologies, Inc. Proprietary and Confidential 1

Chapter 1. Overview Using the V-Blaze 5.4.3 REST API

2 Proprietary and Confidential

Using the V-Blaze 5.4.3 REST API Chapter 2. Queries

Chapter 2. QueriesThe RESTful API can be used to query V-Blaze for information. The HTTP GET method is used forsubmitting queries. JSON formatted results are returned in the query response.

V-Blaze API Request Results Returned

http://vblaze_name:17171/languages Available languages

http://vblaze_name:17171/models Available language models

http://vblaze_name:17171/params Default parameter settings

http://vblaze_name:17171/status Transcription status information

http://vblaze_name:17171/sysinfo Product and Server information


Chapter 2. Queries Using the V-Blaze 5.4.3 REST API


Using the V-Blaze 5.4.3 REST API Chapter 3. Query Responses

Chapter 3. Query Responses3.1. Languages

URL: http://vblaze_name:17171/languages

Example Response:

{"languages":["eng1","eng2","eng3"]}

Explanation:

The example response indicates that languages supported by vblaze_name are eng1, eng2, and eng3.Respectively, these designations refer to North American English, Australian English, and UK English.

3.2. ModelsURL: http://vblaze_name:17171/models

Example Response:

{"models":["eng1:callcenter","eng1:survey","eng2:voicemail"]}

Explanation:

The example response indicates that language models supported by vblaze_name are the general purposeCall Center and Survey models for North American English (eng1), and the general purpose Voicemailmodel for Australian English (eng2).

3.3. ParamsURL: http://vblaze_name:17171/params

Example Response:

{ "params": { "raw_events": false, "bufmaxtime": 30, "numtrans": true, "activitylevel": 175, "vadparams": {}, "uttminactivity": 500, "train_mode": "STREAM", "outputdir": "/opt/voci/ramfs", "uttmaxsilence": 800, "recvtimeout": 1000, "uttpadding": 300, "languages": [ "eng1" ], "models": [ "eng1:callcenter", "eng1:survey", "eng2:voicemail" ],


Chapter 3. Query Responses Using the V-Blaze 5.4.3 REST API

"pushconntimeout": 5, "punctuate": true, "uttmaxtime": 30, "oovthreshold": 0.20, "scrubmindist": 0.30, "queue": "bottom", "endian": "LITTLE", "punctrailing": 12, "model": "eng1:callcenter" } }

Explanation:

The example response is a JSON object that shows the names of audio-independent parameters that can bespecified when initiating a transcription session and their default values. Most of these parameters neverrequire modification, but are provided to enable tuning for special circumstances, such as highly time-sensitive real-time applications.

Both audio-independent and audio-dependent parameters are discussed in Chapter 4, Transcribing AudioFiles.

3.4. StatusURL: http://vblaze_name:17171/status

Example Response:

{ "status": { "bytesdone": 197356704, "speed": 271.58, "totalstreams": 7, "framesrecv": 1233477, "uttsdone": 914, "bytesproc": 197356704, "idlefor": 12134, "dectime": "0:00:45.418590", "framesproc": 1233474, "bytesrecv": 197356704, "framesdec": 549429, "speedrecv_Bps": 4345284.70, "#done": 7, "uttsproc": 914, "started": "2016-05-03 14:07:14.994434", "lag": 0.03, "speeddec": 120.97, "speedrecv": 271.58, "framesdone": 1233474, "#queued": 0, "#active": 0, "totalidle": 271040, "lastactive": "2016-05-06 14:05:25.843599" } }

Explanation:

The example response is a JSON object providing detailed status of the Voci STT engine.

3.5. SysinfoURL: http://vblaze_name:17171/sysinfo

6 Proprietary and Confidential © 2017, Voci Technologies, Inc.

Using the V-Blaze 5.4.3 REST API Chapter 3. Query Responses

Response:

{ "sysinfo": { "product": "V-Blaze XL 800 : 50", "revdate": "2015-10-16 15:44:01", "uptime": " 17:37:08 up 32 days, 7:12, 0 users, load average: 0.09, 0.04, 0.01\n", "uname": "Linux vocig8 2.6.32-573.18.1.el6.x86_64 #1 SMP Tue Feb 9 22:46:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux\n", "version": "5.2.1-1", "dbsize": "12M\t/opt/voci/ramfs\n" } }

Explanation:

The example response is a JSON object providing detailed information about the V-Blaze server.


Chapter 3. Query Responses Using the V-Blaze 5.4.3 REST API


Using the V-Blaze 5.4.3 REST API Chapter 4. Transcribing Audio Files

Chapter 4. Transcribing Audio Files4.1. Overview

The V-Blaze REST API enables client-side automation of audio file submission and receipt of completedtranscripts.

A transcription session is initiated by issuing an HTTP POST to the appropriate URL. A properly formedURL example is shown below:

http://blaze_name:17171/transcribe

If transmission encryption is not required you can use HTTP rather than HTTPS. You will need to replaceblaze_name with the name of your V-Blaze server. By default the REST API service monitors port 17171,but this can be changed if necessary.

The POST must be encoded as a multipart/form-data request, with the audio to be transcribed specified inthe file field, which must be the last field specified in the request. The transcript is returned in the HTTPresponse by default. Optionally, transcripts can be returned via HTTP POST; this method is known as a“callback” and is described in later sections.

Below is a sample HTTP packet:

POST /transcribe HTTP/1.1 User-Agent: curl/7.35.0 Host: server:17171 Accept: */* Content-Length: 104361 Expect: 100-continue Content-Type: multipart/form-data; boundary=------------------------c4a1be8892bcdcae --------------------------c4a1be8892bcdcae

Content-Disposition: form-data; name="file"; filename="sample1.wav"

Content-Type: application/octet-stream

<binary audio data omitted>

All examples of transcription request submissions shown in this document will use curl (crawl URL),a command line utility that sends HTTP requests. The curl application is freely available for Linux,Windows, and OSX operating systems. As an example, the sample HTTP packet shown previously canbe generated using curl in the following way.

curl -F file=@/path/to/sample1.wav \ -X POST http://blaze_name:17171/transcribe

All major programming languages provide support for generating multipart HTTP requests; practicallyany programming language can be used to communicate with V-Blaze via the REST API.

4.2. Multiple files per requestMultiple audio files can be sent in the same request simply by providing multiple file fields. The followingwill decode three sample WAV files with a single request:


Chapter 4. Transcribing Audio Files Using the V-Blaze 5.4.3 REST API

curl -F [email protected] \ -F [email protected] \ -F [email protected] \ -X POST http://blaze_name:17171/transcribe

4.3. Passing parametersVoci’s transcription service is highly configurable, enabling you to control details of how each audio fileor stream is transcribed. In most cases V-Blaze default behavior will provide best results. The parametersdescribed in Appendix I – Transcription Parameters enable you to modify default behavior when needed.

Parameter specifications must appear in the post before the audio file(s) to be transcribed. Providing aparameter setting (such as “emotion=true”) after a file specification results in the following error:

curl -F [email protected] \ -F emotion=true \ -X POST http://blaze_name:17171/transcribe -i

HTTP/1.1 400 Bad Request Content-Type: text/plain Content-Length: 23 Last field must be file

4.4. Receiving Zipped ResultsSet “zip=true” in order to receive results as a ZIP file. For example, the following command will downloada ZIP file containing “sample1.json” and “sample2.json”. Note that the “-o results.zip” curl specificationis required to save the ZIP data stream to a file. If you omit this specification, the ZIP binary data willbe printed to the console.

curl -F zip=true \ -F [email protected] \ -F [email protected] \ -X POST http://blaze_name:17171/transcribe \ -o results.zip

4.5. Receiving Results via CallbackResults can be HTTP POSTed to a callback server URL instead of being returned as part of the responseto the original transcription request.

Set the “callback” parameter to instruct the API to send results to the provided URL. For example, thefollowing command will transcribe “sample1.wav” and POST the complete transcript to “receiver:5555/results”:

curl -F callback=http://receiver:5555/results \ -F [email protected] \ -X POST http://blaze_name:17171/transcribe


Using the V-Blaze 5.4.3 REST API Chapter 4. Transcribing Audio Files

4.6. Receiving Purified Transcripts and AudioVoci’s Purify option can remove sensitive numeric information from both transcripts and audio. Whentranscript Purification is activated by setting the “scrubtext” parameter to “true”, all instances of sensitivenumeric digits will be replaced by the hash sign (#).

When audio Purification is activated by setting the “scrubaudio” parameter to “true”, all audio segmentscontaining sensitive numbers are replaced by silence. When using audio Purification, results areautomatically returned as a ZIP archive containing transcripts and Purified WAV files.

For example, to submit the file “sample1.wav” for transcription with both transcript and audio Purificationactivated, use a command like the following:

curl -F scrubtext=true \ -F scrubaudio=true \ -F [email protected] \ -X POST http://blaze_name:17171/transcribe \ -oresult.zip

The response to this POST will be a ZIP file that contains both “sample1.json” and a Purified version of“sample1.wav”.

4.7. Real-Time Streaming TranscriptionThe REST API supports real-time streaming transcription to support real-time use-cases such as in-callmonitoring and alerting that enables a supervisor to intervene in an active call. When operating in real-time mode, V-Blaze will return a transcript of each utterance as soon as the utterance has been transcribed.An utterance is a region of speech audio that ends with a period of silence that exceeds a threshold duration(default 800 ms), or that exceeds the maximum utterance duration threshold (default 30 seconds)

V-Blaze can be configured to transcribe live streaming audio at a rate of between 1X (i.e. real-time) and5X (5 times faster than real-time). In most cases 1X is sufficient, however higher speeds are offeredfor aggressive use-cases where milliseconds count. Note that delivering 5X real-time requires 5X morehardware than 1X, all other things being equal.

Utterance transcripts are HTTP POSTed to a client-side callback server. This works the same way asthe callback mechanism previously discussed, except in this case rather than the entire transcript beingPOSTed to the callback server, the transcript of each utterance is POSTed as soon as it is ready.

The three phases of the transcription of an utterance are provided below to illustrate the precise timing ofreal-time V-Blaze transcription:

1. V-Blaze receives audio data packets as fast as the sender can provide them. For a live 2-channeltelephone call being sampled at 8 KHz and encoded as PCM with a 2-byte sample size, each V-Blazestream will receive (8000 Hz * 2 Bytes * 2 Channels) = 32,000 bytes per second. V-Blaze will bufferthis audio data until it detects a sufficiently long silence or until the maximum utterance duration hasbeen exceeded. For example, for an utterance of duration 15 seconds, V-Blaze will spend 15 secondsbuffering audio.

2. Once V-Blaze has buffered a complete utterance, it will transcribe the utterance. If V-Blaze has beenconfigured to transcribe at 1X, it can take up to 15 seconds to complete the transcription process. If ithas been configured to transcribe at 5X, it can take up to 15/5 = 3 seconds.

3. As soon as the utterance transcription process has completed, it is POSTed to the utterance callbackserver.


Chapter 4. Transcribing Audio Files Using the V-Blaze 5.4.3 REST API

For example, suppose a server (the “sender”) is configured to broadcast a telephone call on port 5555, usingthe WAV container format and a supported audio encoding method such as PCM. Likewise, a server (the“receiver”) is configured to receive utterance transcript data on port 5556. Note that sender and receivercan be running on the same machine, and can even be different threads of the same program, or they canbe two entirely different, geographically distributed systems. The following request will initiate real-timetranscription:

curl -F utterance_callback=http://receiver:5556/utterance \ -F datahdr=WAVE \ -F socket=sender:5555 \ -X POST http://blaze_name:17171/transcribe

It is often the case that real-time streaming audio will not include a WAV header. This raw encoded audiois supported by explicitly providing the information normally provided by the header. This includes at aminimum the sample rate, sample width, and encoding. The byte endianness can also be specified, howeverthe default of “LITTLE” is usually correct; it is rare to encounter big endian byte ordering in practice. Anexample is provided below:

curl -F utterance_callback=http://receiver:5556/utterance \ -F socket=sender:5555 \ -F samprate=8000 \ -F sampwidth=2 \ -F encoding=spcm \ -F endian=LITTLE \ -X POST http://blaze_name:17171/transcribe


Using the V-Blaze 5.4.3 REST API Appendix A. Transcription Parameter Reference

Appendix A. Transcription ParameterReference

This section describes all of the parameters that can be used in V-Blaze API calls. Parameters are specifiedusing the form "Name=Value". Note that these settings are valid for a single HTTP request only. Settingsfor one request will not affect the transcription performed for any other requests.

A.1. datahdrValues: WAVE, RAW

Description:

Set datahdr to “WAVE” when audio contains a RIFF header that specifies audio sampling rate, samplingwidth, encoding, and byte endianness. Files ending in the “.wav” extension typically possess such a header,although this is not guaranteed.

For headerless audio, set datahdr to “RAW”. Live streaming audio is often headerless. While it is unusualfor files to be headerless, it is possible. Such files are often given the extension “.raw”, but this is purelyconvention. Use a tool such as “mediainfo” to ascertain details of your audio format.

A.2. diarizeValues: True, False, noise

Description:

Diarization is the process of recognizing distinct speakers on a single (i.e. mono) audio channel andsegmenting their speech into separate channels. Voci’s diarization capability is designed to do this for twospeakers, typically a call agent engaged in a conversation with a client over the phone.

Diarize is set to False by default. You should only set it to True under the following conditions:

• You know that your audio only contains a single audio channel

• You know that 2 people are talking on the channel

• Segregation of 2 speakers in the transcripts is important for your use-case

The noise setting is typically not needed, however if you are experiencing excessive diarization errorsdue to interference from music or other non-speech sources, you can apply noise reduction by settingdiarize=noise.

Diarization is a licensed optional feature of V-Blaze.

Note

When the diarize and scrubtext paramaters are used together, redaction accuracyis somewhat reduced. For maximum redaction accuracy do not activate diarization whenscrubtext=true.


Appendix A. Transcription Parameter Reference Using the V-Blaze 5.4.3 REST API

A.3. emotionValues: True, False

Description:

Voci’s emotion detection feature uses a synthesis of acoustic features and word sentiment scores todetermine if a given utterance is Positive, Negative, or Neutral. By default emotion is set to True. Settingemotion=False disables this feature.

Emotion detection is a licensed optional feature of V-Blaze.

A.4. encodingValues: SPCM, UPCM, ULAW, ALAW

Description:

Specifies the algorithm used to encode the audio.

• SPCM: Signed Pulse Code Modulation

• UPCM: Unsigned Pulse Code Modulation

• ULAW: µ-law

• ALAW: A-law

If your audio was not encoded with one of these algorithms, it will need to be converted to one of thesebefore you can submit it for transcription.

Encoding must be supplied when datahdr is set to “RAW”.

A.5. endianValues: LITTLE, BIG

Description:

Specifies the byte ordering of audio samples. In a BIG endian data word the most significant byte comesfirst, when reading from left to right. In a LITTLE endian data word, the least significant byte comes first.By convention, LITTLE endian (the default) is the most common.

This parameter is not required unless your audio uses BIG endian byte ordering and datahdr is set to“RAW”.

A.6. genderValues: True, False

Description:

Set gender=True to activate gender detection. For each utterance Voci will attempt to determine if thespeaker is male or female. By default this feature is disabled.



Gender Identification is a licensed optional feature of V-Blaze.

A.7. modelValues: installation-dependent

Description:

The model parameter enables you to specify the language mode(s) that you want to use for transcription.The value that you specify for this parameter can be:

• a single language model, which is then used to transcribe all channels

• a comma-separated list of models in channel order. For example, if the client is on channel 0 and theagent is on channel 1, you could use different models for each channel by setting the model parameter tosomething like model=eng1:client,eng1:agent. This setting would use the eng1:clientlanguage model when transcribing channel 0 and the eng1:agent language model when transcribingchannel 1. No spaces are permitted in the value that is specified for the model parameter.

When specifying multiple models, spaces cannot be present in the model=value specification, evenif the model list is surrounded by quotation marks.

If you do not specify a value for the model parameter, the first available model on your V-Blaze installationis used. To determine which of these that will be, you can use the V-Blaze /models API call, as in thefollowing example:

$ curl http://example:17171/models{"models":["eng1:callcenter","eng1:voicemail","eng1:survey"]}

In this example, if you did not specify a model when transcribing audio on the V-Blaze host example,the language model eng1:callcenter would be used, which is a US English (eng) Call Center model.

Voci works with customers to ensure that their deployment delivers the best results possible, installingthe language models that are associated with the types of audio that each customer is transcribing. Youcan retrieve a list of available language models in a V-Blaze installation by calling the /models API, asshown earlier in this section and in Chapter 2, Queries .

A.8. numtransValues: True, False

Description:

Controls whether or not number words in transcript text are converted into numeric digits and relatedconventional formats, including dollar amounts, wall-clock times, percentages, ordinals, and telephonenumbers. For example, with numtrans set to True (the default), the words “forty two percent” would betransformed into the text “42%”.

In most cases it is desirable to leave numtrans turned on, however there are special cases where it shouldbe turned off. For example, if you are evaluating the Word Error Rate (WER) of Voci’s transcriptions,numtrans must be disabled.

WER measurements are only valid against verbatim text because there is not a 1:1 mapping between wordsspoken and conventional representations. For example, both word sets “four nine zero” and “four hundredand ninety” will map to the numeric representation “490”.



A.9. outputValues: json, text

Description:

Specifies whether the transcript will be formatted as JSON or flat text. Voci’s JSON format contains awealth of information in addition to the text of the transcript, including confidence scores for each word,utterance, and channel, the start and stop time of each word and utterance, customer metadata (if provided),and Voci-generated metadata (if activated and licensed) including gender and emotion.

The text format only contains the words transcribed. For certain niche use-cases only the words arerequired, however in most cases a transcript complete with all forms of metadata is preferable. Text modecan be useful during integration testing because it is easy to read directly.

The default output mode is JSON.

A.10. punctuateValues: True, False

Description:

Controls whether transcript text is punctuated or not. In most cases it is desirable to leave punctuationturned on (the default), however there are special cases where it should be turned off. For example, if youare evaluating the Word Error Rate (WER) of Voci’s transcriptions, punctuation must be disabled.

A.11. samprateValues: Frequency (integer)

Description:

Specifies the sampling rate of the audio to be transcribed. Telephone audio is typically sampled at 8000Hz. For best results, ensure the sampling rate is a multiple of 8000 (e.g., 8000, 16000, 24000, etc.). Valuesless than 8000 are not supported.

The sampling rate must be supplied when datahdr is set to “RAW”.

A.12. sampwidthValues: Bytes (integer)

Description:

Specifies the size of each digitized audio sample. This parameter is only applicable if the “encoding”parameter is set to SPCM or UPCM.

The sample width must be supplied when datahdr is set to “RAW” and the encoding is either SPCM orUPCM.

A.13. scrubaudioValues: True, False



Description:

Set scrubaudio=True to apply Voci’s Purify feature to audio. When activated, V-Blaze will return audioalong with the transcript, using the same method being used to return the transcript. Sections of audiowhere sensitive numbers are spoken are replaced with silence. Redacted audio is not preserved in any way.See “scrubtext” for more information on what constitutes a sensitive number.

The default value for scrubaudio is False. Purify is a licensed optional feature of V-Blaze.

A.14. scrubmindistValues: Seconds (float)

Description:

Specifies the number of seconds of audio to remove between numbers, when scrubaudio is set to True.Defaults to 0.3. This is a fine-tuning parameter that rarely needs adjustment.

A.15. scrubsilValues: True, False

Description:

Set scrubsil=True to make Purify remove audio from regions where speech was not detected. Thisparameter only applies when scrubaudio is set to True. This setting increases security by blanking out audiothat could contain sensitive information at a very low signal level, below Voci’s recognition threshold.

This helps to guard against the situation where a low amplitude copy of a speech signal from one audiochannel “bleeds over” onto another audio channel, as can happen with faulty headsets or due to errors inrecording configuration or bad wiring. This can also happen with open-mic (far-field) acoustic recordings,where a background conversation can be heard, but at such a low level as to not trigger transcription.

The default value for scrubsil is False.

A.16. scrubtextValues: True, False

Description:

Set scrubtext=True to apply Voci’s Purify feature to transcript text. Purify recognizes and removessensitive numeric values from V-Blaze results. By default, sensitive numbers are any string of numericdigits (as produced when numtrans=True) that do not fit into a “white listed” pattern defined in /opt/voci/state/scrub.conf on the V-Blaze server.

Patterns that are white listed by default are as follows:

• Ordinals (1st, 2nd, etc.)

• Percentages

• Clock times (12:57 PM, etc.)

• Monetary amounts



• Floating point numbers with 4 or fewer digits (12.47 GB, etc.)

The default value for scrubtext is False. Purify is a licensed optional feature of V-Blaze.

Note

The scrubtext and diarize options should rarely be used together because togetherthey typically result in a higher rate of missed redactions than normally expected, and areless accurate than Purify done against a mono-only decoding.

A.17. subst_listValues: Filename

Description:

Set subst_list to the name of a file that contains a substitution list to be applied in addition to thosedefined for the selected Language and Language Model. Normally substitutions are first performed withthe Language-level substitution list, followed by applying substitutions defined at the Language Model-level substitution list.

When subst_list is set to a valid filename, the specified substitution list is applied last. This makes itpossible to apply a different set of substitutions on a data-driven basis. For example, metadata mightindicate that one call is from an insurance provider while another is from a telecom provider. You can usethis metadata to decide which domain-specific substitution list to apply for best results.

Substitution lists that can be specified using the subst_list tag must be placed in /opt/voci/state/substitutions/ on each V-Blaze server.

A.18. subst_wordsValues: True, False

Description:

Controls whether or not substitution lists are used. When set to True (the default), substitution lists definedfor the selected Language and Language Model are applied during transcription.

Substitution lists are files that contain lists of “old_value : new_value” mappings that are useful for “fieldtuning”, to correct consistent and frequent transcription errors that result from out-of-vocabulary words,excess noise in the audio, poor enunciation, strong accents, or word combinations that rarely occur ingeneral speech but occur frequently within a specific domain or company.

For example, useful substitutions in the insurance domain include “giant whale : giant hail” and “hit abeer : hit a deer”.

Typically subst_words would only be disabled for testing purposes, to ensure only the pure output of theselected Language Model is represented in returned transcripts.

A.19. uttmaxsilenceValues: Milliseconds (integer)

Description:



Specifies the maximum amount of silence that can occur between speech sounds without terminating thecurrent utterance. Once a silence occurs that exceeds “uttmaxsilence”, an utterance “cut” is made withinthe detected silent region.

The default uttmaxsilence is 800 Milliseconds. This setting will not need to be modified except in unusuallyaggressive real-time deployments. In most cases, shortening uttmaxsilence to be less than 650 Millisecondswill compromise accuracy, getting worse as uttmaxsilence is reduced towards its minimum setting of 100Milliseconds.

Accuracy is reduced because the shorter threshold for splitting audio regions into utterances will resultin shorter utterances on average. Shorter utterances mean less contextual information available for errorreduction.

A.20. uttmaxtimeValues: Seconds (integer)

Description:

Specifies the maximum amount of time allotted for a spoken utterance. Normally an utterance is terminatedby a period of silence that exceeds “uttmaxsilence”, however if no such period of silence is encounteredprior to reaching the “uttmaxtime”, the utterance will be terminated.

The default uttmaxtime is 30 seconds. Human utterances are typically 5-20 seconds long. This setting rarelyrequires modification. Examples of use-cases that can benefit from adjustment of this parameter includetranscribing monologues or speeches with unusually long unbroken utterances, and real-time deploymentswith aggressive turn-around time requirements.

In most cases, shortening the uttmaxtime to be less than 20 seconds will compromise accuracy, gettingworse as uttmaxtime is reduced towards its minimum setting of 1 second.

Accuracy is reduced when the Voci engine is forced to terminate an utterance at the uttmaxtime boundary.Such “cuts” take place while a word is being spoken. This means that a portion of one word will be inthe first utterance, while the remainder of the word is located in the second. With few exceptions, wordfragments do not sound like the original word, resulting in erroneous transcription. In addition, shorterutterances contain less context, further reducing achievable accuracy.

A.21. vadtypeValues: energy, level

Description:

The two types of Voice Activity Detection (VAD) available during Voci transcription are “energy” and“level”. The “energy” setting instructs the engine to use the amount of energy in the audio signal todetermine if speech might be present. This is the best setting to use when transcribing audio files (i.e. post-call, or “batch” transcription).

The “level” setting instructs the engine to use the simple amplitude level of the audio signal for VAD.This is the best setting to use when transcribing live audio streams (i.e. in-call, or real-time transcription)because it operates instantaneously, without the need for buffering.

The default vadlevel setting is “energy”.


Using the V-Blaze 5.4.3 REST API Appendix B. Extended Examples

Appendix B. Extended ExamplesIn the following examples the hostname of the V-Blaze server is “vocig1”. All results are actualtranscriptions produced by Version 5.x of Voci's V-Blaze product, except where text has been redacted toremove identifying information. Redacted text has been replaced with <redacted>.

B.1. Returning Transcription in an HTTPResponse

Description:

Transcribe a short audio file and return text in the HTTP response. The response is not pure text, but iswrapped in a light-weight JSON wrapper. The first field in the response (source) is the name of the audiofile. The second field (utterances) is a list of text strings, where each string is from a separate utterance.

Command:

curl -F output=text \ -F model=eng1:callcenter \ -F [email protected] \ -X POST http://vocig1:17171/transcribe

Result:

[ {"source":"sample7.wav", "utterances":[ "Thank you for calling <redacted> technical support. I understand you need to report a gas leak and I have your name please", "my name is John. Thank you, Mr. Doe. What is your address or account number", "my address is <redacted> is there. Anyone inside the house?", "No everyone is out of the house. I noticed a strange smell when I got home and I had called you I am sending a gas technician to your home to fix the problem. Could you give me a good number to reach you at.", "You can call <redacted>.", "Thank you and please be safe and wait for the technician to arrive call us back if anything changes.", "Thank you, bye. Good bye and thank you for calling <redacted>." ] } ]

B.2. Returing a Complete JSON TranscriptDescription:

Transcribe a short audio file and return a full JSON transcript in the HTTP response. Due to the size offull JSON transcripts, a shorter example audio file is used for this example than was used in Example 1.The JSON data has been reformatted to make it easier to read, but is otherwise unmodified. Emotion andGender identification has been turned on to show how these are represented in the JSON file.

Command:


Appendix B. Extended Examples Using the V-Blaze 5.4.3 REST API

curl -F output=json \ -F model=eng1:callcenter \ -F gender=True \ -F emotion=True \ -F [email protected] \ -X POST http://vocig1:17171/transcribe

Result:

[ [ [3,0], [ ["+",1, [1,4] ], ["+",1, [6,9] ], ["+", 1, [10,14] ] ] ], "start": 0, "donedate": "2016-06-07 18:08:11.973793", "recvdate": "2016-06-07 18:08:11.098504", "events": [ { "start": 0.55, "confidence": 0.93, "end": 0.67, "word": "The" }, { "start": 0.67, "confidence": 0.97, "end": 1.02, "word": "matter" }, { "wordex": "was(2)", "confidence": 0.90, "end": 1.33, "word": "was", "start": 1.02 }, { "start": 1.33, "confidence": 0.88, "end": 1.94, "word": "resolved" }, { "start": 2.07, "confidence": 0.95, "end": 2.24, "word": "in" }, { "start": 2.24, "confidence": 0.95, "end": 2.30, "word": "a" }, { "start": 2.30, "confidence": 0.98,



"end": 2.70, "word": "very" }, { "start": 2.70, "confidence": 0.99, "end": 3.39, "word": "professional" }, { "start": 3.39, "confidence": 0.91, "end": 3.95, "word": "manner." }, { "wordex": "Your(2)", "confidence": 0.96, "end": 4.72, "word": "Your", "start": 4.48 }, { "wordex": "employees(2)", "confidence": 0.93, "end": 5.32, "word": "employees", "start": 4.72 }, { "wordex": "are(2)", "confidence": 0.91, "end": 5.44, "word": "are", "start": 5.36 }, { "start": 5.44, "confidence": 0.98, "end": 5.85, "word": "very" }, { "start": 5.85, "confidence": 0.98, "end": 6.17, "word": "good." } ], "metadata": { "source": "\u000000051@vocig1", "model": "eng1:callcenter", "uttid": 0, "channel": 0 } } ], "sentiment": "Positive", "nchannels": null, "gender": "female", "source": "sample1.wav", "sentiment_scores": [3,0], "model": "eng1:callcenter", "recvdate": "2016-06-07 18:08:11.098504" }]

B.3. Returning Both Scrubbed Text and AudioDescription:



Transcribe a short audio file and return both scrubbed text and scrubbed audio. Files are automaticallyreturned as a ZIP archive when audio scrubbing is active. The curl “-o” option is used to save ZIP resultsto a file.

Command:

curl -F scrubaudio=True \ -F scrubtext=True -F output=text \ -F model=eng1:callcenter \ -F gender=True -F emotion=True \ -F [email protected] \ -X POST http://vocig1:17171/transcribe -o test.zip

Result:

$ ls -l total 80 -rw-rw-r--. 1 vociadmin vociadmin 80824 Jun 7 18:25 test.zip $ unzip test.zip Archive: test.zip inflating: sample1.wav inflating: sample1.json

$ ls -l total 188 -rw-r--r--. 1 vociadmin vociadmin 127 Jun 7 18:25 sample1.json -rw-r--r--. 1 vociadmin vociadmin 104044 Jun 7 18:25 sample1.wav -rw-rw-r--. 1 vociadmin vociadmin 80824 Jun 7 18:25 test.zip

B.4. Posting Decoded Utterances to a CallbackServer in Real-time

Description:

Transcribe a short audio file and post the decoded utterances to a callback HTTP server as soon as theyare decoded. This example shows how real-time streaming transcription results are provided. The “text”output mode was selected to make example results shorter and easier to read. The “json” output mode isof course also supported.

Command:

curl -F utterance_callback=http://10.173.215.203:5555/utterance \ -F datahdr=WAVE -F output=text \ -F [email protected] \ -X POST http://vocig1:17171/transcribe

Result:

These results were produced using a minimal HTTP server consisting of about 50 lines of Python code.Detailed timing information is also provided to illustrate the amount of time between the return of eachutterance. As with Example 1, identifying information has been replaced with <redacted>. Note also thatthere are two people speaking on a single audio channel in this example, and diarization is turned off. Forthis reason you will see text from both speakers occurring in the same utterance in some cases.

started httpserver...



19:31:14.105023 { "source":"sample7.wav", "utterance":"Thank you for calling <redacted> technical support. I understand you need to report a gas leak and I have your name please" } 10.173.212.231 - - [07/Jun/2016 19:31:14] "POST /utterance HTTP/1.1" 200 - 19:31:14.107844 { "source":"sample7.wav", "utterance":"my name is Joe and I thank you Mr. Know what is your address or account number" } 10.173.212.231 - - [07/Jun/2016 19:31:14] "POST /utterance HTTP/1.1" 200 - 19:31:14.115829 { "source":"sample7.wav", "utterance":"my address <redacted> is there. Anyone inside the house? I know everyone is out of the house. I notice the strange smell when I got home and I called you I am sending and gas technician to your home to fix the problem. Could you give me a good number to reach you at." } 10.173.212.231 - - [07/Jun/2016 19:31:14] "POST /utterance HTTP/1.1" 200 - 19:31:14.119927 { "source":"sample7.wav", "utterance":"You can call <redacted>." } 10.173.212.231 - - [07/Jun/2016 19:31:14] "POST /utterance HTTP/1.1" 200 - 19:31:14.129272 { "source":"sample7.wav", "utterance":"Thank you, please be safe and wait for the technician to arrive call us back if anything changes." } 10.173.212.231 - - [07/Jun/2016 19:31:14] "POST /utterance HTTP/1.1" 200 - 19:31:14.133110 { "source":"sample7.wav", "utterance":"Thank you, bye. Good bye and thank you for calling <redacted>." } 10.173.212.231 - - [07/Jun/2016 19:31:14] "POST /utterance HTTP/1.1" 200 -


Using the V-Blaze 5.4.3 REST API

Documents

Transcript of Using the V-Blaze 5.4.3 REST API