notes.billmill.org / computer_usage / audio_transcription /

whisper

last updated: Oct 20, 2023

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

an open-source model from OpenAI that turns audio into text

via Dan Nguyen on twitter

The new release has the ability to work with quantized ggml models, which makes it much easier to work with. It's quite a lot faster than the default whisper model; here's a comparison using the large-v2 model; for whisper-cpp I'm running it with q5_0 quantization:

$ time whisper-cpp samples/gb0.wav /tmp/out

real	0m40.617s
user	2m43.490s
sys	0m1.806s

$ time whisper --model large-v2 --output_format=srt samples/gb0.wav
<snip output>
real	8m19.783s
user	16m23.622s
sys	9m1.749s

I'm using a shell script wrapper for the whisper main function that I wrote myself; I'd like to open source it but I'm not sure how to include the model file or tell people all the steps they need to do to generate it.

(I guess I could script all the downloading and conversion?)

To build the ggml-encoded models:

python convert-pt-to-ggml.py ~/.cache/whisper/large-v2.pt <whisper_repo_path> .

To build the quantized model, I ran:

$ make quantize
$ ./quantize whisper-large-v2-model.bin whisper-quantized-large-v2.ggml.q5_0.bin q5_0

Backlinks:

Why Can't we Build Simple Software?

↑ up