In education and social sciences research, and of course in journalism, in order to understand a situation or a problem, researchers use interviews, conversations, dialogues, in general, speech as raw data. The first step of this approach, which we can generalise as qualitative research, is to record what is spoken aloud, and the second step is to transcribe it.

Transcribing speech was very challenging before the invention of artificial intelligence technologies. It could take 5 to 10 minutes to transcribe one minute of audio. Now this work is relatively easy. There are transcription services. I haven’t tried it myself yet. I plan to try it in the near future. I will share my experiences as I try.

But I wonder if I can transcribe the audio data on my own computer without uploading it to another platform, thus being a little more confident that privacy is protected.

Enter whisper

My story began when I asked ChatGPT for a transcription of an audio file. It said it could give me a one-minute transcription. What am I supposed to do with a one-minute transcription! But it offered an alternative, it said that it could give me code to do it on my own computer. The code was about using the whisper library, an Automatic Speech Recognition (ASR) model developed by OpenAI.

My computer’s video card is GeForce GTX 1070. It has 8 GB VRAM. While I was wondering if it would work with this much VRAM, I read that whisper would work except for the large model and decided to try it.

Installing whisper

I am using Ubuntu 24.04 as the operating system. Therefore, I am describing my experience in Ubuntu.

First of all, the following command is said to be sufficient to install whisper and the necessary libraries.

pip install -U openai-whisper

But Ubuntu said that I cannot run this pip command system-wide, I needed to create a virtual environment. So I created python with a virtual environment on a drive where I had plenty room. The command was as follows.

python3 -m venv /home2/fm/python_projects

Then I entered into this folder.

cd /home2/fm/python_projects

Then I was able to install whisper with this command since it was in the bin folder.

bin/pip install -U openai-whisper

whisper at Work

I recorded a small voice memo on my computer to try it out. Then I ran whisper with the following command.

bin/whisper deneme.mp3 --model turbo --language Turkish

In approximately the same time as the playback time of the audio file, whisper completed its work. It converted the audio to text. While whisper is running, the terminal outputs timestamps and text. When whisper finishes its work, this information is stored in various formats (txt, srt, tsv, vtt, json) on the disk.

Where does whisper fall short?

It can transcribe speech. But if there is more than one speaker, whisper cannot distinguish who said what and when. The identification of speakers is called diarization. There are other software that can do this. For example whisperX. whisperX did not work on my system. I will check it out and try it later again.

whisper is still relatively slow. faster-whisper is said to be up to four times faster. faster-whisper also did not work on my computer. I will also check it from time to time.

For speaker diarization, tried pyannotate-audio, I used m4a and mp3 formats, and the results were not satisfactory. I may pursue this later.

What is whisper good for?

It is a very good tool for voice notes you take to yourself. It is very useful for monologues with a single speaker. It is also useful for converting the notes you take in your notebook into voice notes on your phone and then obtaining written transcripts of these voice notes.

How Reliabile is whisper?

Word Error Rate (WER) of whisper’s large model was found to be between 9% and 13% (Shafer, 2024). The WER of whisper’s small model was found to be around 18% (Oyucu, 2023)

These are high error rates. Therefore, especially when using whisper for research, it may make sense to check the transcribed output once by playing the audio file in the background.

References