OpenAI has announced a new automatic speech recognition system (ASR) called Whisper. It is based on an encoder-decoder Transformer and is available in five open-source versions on GitHub. The development team trained the ASR system with 680,000 hours of audio material from the Internet. Two thirds of the recordings were in English, the last third in various other languages. As a multitasking model, Whisper should not only transcribe, but also be able to recognize and translate languages.
Speech recognition hurdle: fine-tuning
In the research report on Whisper, the OpenAI team states that they developed the model with the aim of creating a robust language processing program that does not require dataset-specific fine-tuning. The researchers state that pre-trained audio encoders often learned unsupervised. As a result, the encoders are highly specialized, but human fine-tuning is required to enable the decoders to output data in the appropriate quality. So for Whisper, the team used about 10,000 hours of monitored speech data for every 30,000 hours of data with more background noise, resulting in a weakly monitored model. According to the report, the process could be easily automated.
Whisper is based on an end-to-end architecture and is implemented as a transformer. Audio data is available as mel spectrograms of 30-second sound snippets. The encoder blocks shown contain the multilayer perceptrons (MLP) and self-attention, the decoder blocks in addition to MLP and self-attention also cross-attention to predict the next text token.
Transformer-based speech recognition
Whisper is based on an encoder-decoder Transformer. The program reads audio data as 30-second snippets, which the developers put in front of the system as mel spectrograms. The decoders were trained to generate text that went with the sound. Whisper also uses special tokens that allow the program to perform multiple tasks. According to OpenAI, the program is suitable for performing language identification, phrase-level time stamping, multilingual speech transcription and speech translation into English.
Due to the large data basis for training and the lack of fine-tuning for a specific data set, Whisper lags behind specialized models in the LibriSpeech benchmark, for example. However, the team at OpenAI reports better zero-shot performance when dealing with unknown datasets. According to the developers, the robustness of the model is reflected in a 50 percent lower error rate in tests on different data sets than in specifically developed systems.
Whisper is said to be capable of multilingual speech recognition, speech translation, identifying spoken language, and recognizing speech activity. All of these tasks represent a common sequence of tokens that the decoder is expected to predict. A single model is intended to replace various stages of a traditional language processing pipeline.
Five models open source
Whisper is available in five different model sizes on GitHub. Training parameters range from 39 million to over 1.5 billion. The smallest model needs about 1 GB of VRAM, the largest needs about 10 GB. Except for the largest version, the models can only deal with English. The different sizes mean a difference in the speed and accuracy of the systems.
Speech models and speech recognition currently play a major role, for example with the chat program LaMDA, which caused a sensation this year due to an alleged awareness. Like Whisper, LaMDA is also based on a Transformer architecture. A basic explanation of the structure and function of transformers can be found here.
More information about Whisper can be found on the OpenAI blog and in the research report on the new speech recognition system.
To home page