Performance

Our GPU-based speech recognition platform can work with models, that are created with publicly available speech toolkits (KALDI, HTK, Sphinx, SRI LM toolkit).

The following table summarizes performance comparison of our GPU-enabled speech recognition engine with the reference implementation of the DNN-HMM hybrid of the KALDI toolkit:

The models were created in KALDI using the standard data for American English read news transcription (the classical Wall Street Journal (WSJ) speech corpus). The reference experiment script (egs/wsj/s5/..) was used without modification. The acoustic model (AM) was created using WSJ data only (LDC93S6B & LDC94S13B). The DNN had contained 5 layers of 2000 neurons. It was evaluating 3420 distributions. All language models (LMs), namely BCB05ONP, BCB05CNP and TCB20CNP, were taken from the WSJ distribution without any modification. The search network had ~82 Million Arcs for the tri-gram model (TCB20ONP) and ~6.5 Million Arcs for the bi-gram models (BCB05ONP & BCB05CNP).

- Accuracy of our GPU-enabled engine is approximately equal to that of the reference implementation. There is a small fluctuation of the actual Word Error Rate (WER) due to the differences in arithmetic implementation.

- For the single-channel recognition the TITAN-enabled engine is significantly (~7 times) faster than the reference implementation. This is important in tasks like serving ASR to a Spoken Dialogue System (SDS) or media-mining for specific spoken events.

- Our implementation of speech recognition in the mobile device (NVIDIA Tegra K1) enables twice faster than real-time processing without any degradation of accuracy.

- Our GPU-enabled engine allows unprecedented energy efficiency of speech recognition. The value of 15W per RT channel for i7-4930K was estimated while the CPU was fully loaded with 12 concurrent recognition jobs. This configuration is the most power efficient manner of CPU utilization. Our TITAN-enabled server does better while maintaining its processing speed. The Tegra-based solution is several times more power efficient.

- Power consumption and recognition speed of the GPU-based solution are linearly proportional to the system's load. On the contrary, the CPU consumes much more energy (per channel) when operating at the maximum pace (working on a single channel).

Here is another example of the WSJ models created by the famous HTK toolkit:

The AM was created using WSJ data only (LDC93S6B & LDC94S13B). The model generation recipe of Keith Vertanen was used without modifications. The final AM consists of 13261 mixtures of 212224 Gaussians. All language models (LMs), namely BCB05ONP, BCB05CNP and TCB20CNP, were taken from WSJ distribution without any modification. The search graphs had ~ 310 Million Arcs for TCB20ONP vs ~16 Million Arcs for BCB05ONP & BCB05CNP.

- With the bi-gram LMs accuracy is roughly the same;

- With the tri-gram LM (TCB20ONP) accuracy is significantly better;

- Verbumware speech recognition system is massively faster (from 50 to 67 times);

- Our recognition speed with the optimal parameter set does not depend much on the total search network size. This fact indicates optimality of the search implementation.