Skip to content

Model Adaptation

Alexander Veysov edited this page Sep 16, 2020 · 2 revisions

💎 Improving STT Quality

We often discuss model adaptation / tuning for particular use-cases with our clients. Generally with our tech it is possible to do so without losing generalization of the core solution.

On Generalization

Let's start-off with a brief discussion of generalization. There are 2 kinds of STT systems:

  • Ones that work relatively well out-of-the box on any reasonable data;
  • Ones that do not work on a "cold start", i.e. without training acoustic or language models;

Of course it is impossible to be 100% domain / codec / noise / vocabulary agnostic. But it is important to understand, that "quality" heavily depends on noise level and domain peculiarities. We often encountered that people wrongfully compare general domain-agnostic solutions with solutions heavily tuned on particular domains.

In a nutshell solution A may start off with 30% WER as-is and solution B may not start out-of-the-box at all but show 25% WER after some investment of time and effort. It does not mean necessarily that A > B or B > A, but most likely it means that B is much more fragile than A.

When heavily researching the market in Russian we noticed that solutions available on the market are (i) either do not work in a cold start (ii) or are prohibitively priced. It goes without saying that no solution provider even bothered to provide any decent quality measurements. Our design philosophy implies that our models should in general try to work on all domains at least decently.

Also if you try our CE-models you may wrongfully assign more importance to the fact that CE models tend to produce not visually pleasing outputs from time to time. But in fact this is merely the distinction of our CE and EE tiers.

Improving Quality

Without further ado we have 4 approaches of doing this with our EE models:

Approach Costs WER Reduction
Term Dictionary 💵 1-2 percentage points (i.e. 20% => 18%)
Secondary LM 💵 4-5 percentage points (i.e. 20% => 15%)
Audio annotation 💵 💵 💵 8 - 10 percentage points (i.e. 20% => 10%)
Custom heuristics Ranging from 💵 to 💵 💵 💵 💵 It depends

Real Use Cases

Term Dictionary

Sometimes your domain may have some custom words or phrases in vocabulary that are very rare otherwise and have no well-established spelling, but nevertheless appear frequently in your case. We can just add such vocabularies to our EE system. It greatly improves perceptual quality, but does not affect overall quality greatly.

In case of taxi calls - we could shave 1-2 pp WER in reach region using this trick.

Secondary LM

When analyzing quality on one of domains (finance) we noticed that just by adding a dictionary and a secondary LM we could shave-off additional 4-5 pp of WER without any major investments in annotation.

Audio Annotation

By annotating less than 100 hours of audio (and applying other optimizations) we could reduce WER from 20% to 12% on taxi hailing calls.

Custom Heuristics

In general it depends, but commonly we stumbled on 2 main types of solutions:

  • Playing with your the metadata that you store;
  • Playing with parsing multiple hypotheses that sometimes our EE models produce;

Which One to Pick?

In real large-scale applications, it is simple really. You should apply all of the methods at the same time. They just different time-scales and returns on efforts.

References

Clone this wiki locally