Mar 12, 2023

Generating Subtitles in Real-Time with OpenAI Whisper and PyAudio

Last night, I started watching a recent show which includes dialogues in multiple languages, so naturally, I wondered if I could use OpenAI’s Whisper model to transcribe and translate audio to subtitles in real time. I hadn’t used it in the past, so there was some initial research and fiddling around until it worked, let’s check it out!

What’s Whisper, anyway?

Released in September 2022, Whisper is a model trained by OpenAI designed to recognize, transcribe, and translate speech in multiple languages. Interestingly, translation is not an afterthought but is embedded within the model, so you can either run a simple transcription or automatically translate the detected speech into English.

The Any-to-English speech translation case was exactly what I needed. Ideally, I would pass in the audio stream in real-time and have Whisper transcribe and translate the content to English, detecting the language without any hints.

We need some audio

To get started, I had to capture my Mac’s computer audio to pass it into Whisper. Unfortunately, Whisper isn’t designed for handling streams, instead, it accepts audio files and processes them in a sliding 30-second window. That’s why I had to record and save audio, then load it into Whisper. You might already be thinking that this fact alone would hurt the latency, and you’d be right, we’ll check that out in the end. Let’s not get ahead of ourselves, though, and start by recording our computer audio.

MacOS has some quirks, including the fact that you can’t easily record computer audio (i.e. the combined output of all applications and the system) without installing Kernel extensions or performing some magic that effectively does the same. I installed Audio Hijack and Loopback to create a virtual loopback audio device, which I could then consume with PyAudio.

import pyaudio
import wave

def record_audio():
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 16000
    CHUNK = 1024
    RECORD_SECONDS = 3
    WAVE_OUTPUT_FILENAME = "output.wav"

    audio = pyaudio.PyAudio()

    stream = audio.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
        frames_per_buffer=CHUNK,
        input_device_index=2
    )

    frames = []

    for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)

    stream.stop_stream()
    stream.close()
    audio.terminate()

    waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
    waveFile.setnchannels(CHANNELS)
    waveFile.setsampwidth(audio.get_sample_size(FORMAT))
    waveFile.setframerate(RATE)
    waveFile.writeframes(b''.join(frames))
    waveFile.close()

You might already be wondering about the parameters here, and we’ll get to them in a bit, I promise! Another important detail you have to double-check is input_device_index=2, which specifies the device to capture audio from.

To get the device index for your loopback device, you can open a Python REPL

>>> import pyaudio
>>> audio = pyaudio.PyAudio()
>>> audio.get_device_count()
3
>>> audio.get_device_info_by_index(2)
{..., 'name': 'Loopback Audio', ...}
# use input_device_index=2

Up next, we can look into transcribing audio content!

Transcribing our audio content

Whisper has five different model sizes with speed and accuracy tradeoffs. For the purpose of my experiment, I used the base model to get acceptable results relatively quickly.

Before we can load any audio, we have to load our model. This has to be done just once during startup. Once the model is ready, we can load the captured audio, and apply padding up to 30s. This is required in case our audio file doesn’t line up with the expected length of 30s.

model = whisper.load_model("base")

# capture audio once
record_audio()

audio = whisper.pad_or_trim(whisper.load_audio("output.wav"))
print(whisper.transcribe(model, audio)["text"])

Running our Python file now outputs the transcription results. Right now, languages will be detected and transcribed as-is, without translation. Let’s figure that part out next!

Adding automatic translation

As translation is baked into the model, we can simply specify that we want to have the model translate as well

print(whisper.transcribe(model, audio, task="translate")["text"])

Running our program again should now return the English translation for audio in any other language. We’ve got all our building blocks prepared so let’s assemble everything in one final step.

So, about the real-time part

To come up with an acceptable solution, we have to think about what we want. As a viewer, I wanted to get subtitles displayed in (almost) real-time. This is a really hard task for our model to accomplish, as it receives a very short slice of computer audio with no other context attached and has to detect the language, transcribe the content, and then translate the result.

The closer we want to get to real-time transcription, the shorter our captured clips would ideally be. But if we cut off too much, we make it harder for the model. Ideally, Whisper expects the full 30s of audio, including complete sentences. We’re passing in clips and hoping it works, so take this with a grain of salt.

A naive approach to real-time transcription is setting the RECORD_SECONDS parameter of record_audio() as low as possible while receiving acceptable results from Whisper.

while True:
    record_audio()
    # read wav file into whisper file
    audio = whisper.pad_or_trim(whisper.load_audio("output.wav"))
    print(whisper.transcribe(model, audio, task="translate", fp16=False)["text"])

This approach makes it easy to spot the limitations of our real-time transcription idea. While processing, we are not capturing the audio played in the meantime so we’re dropping subtitles. This could be improved by continuously recording and transcribing, making sure there’s never a time slice where only the transcription task is running.

While this makes sure we’re not running into blind spots, the real issue is that the shorter our audio clips become, the harder it gets for Whisper to transcribe and output helpful results. After putting all the parts together, I tested the script on the remaining ~30mins of the episode and checked if it worked at all.

Surprisingly, it had some really good moments, and less surprisingly, most of the time it was rather useless, hallucinating content that definitely was not spoken (”Thanks for watching” came up repeatedly so either someone had fun adding inaudible content to the episode or the model was a bit too creative) or not detecting the right language.

And still, the moments it worked, the experience was magical. The authors of Whisper clearly stated that the model was not geared toward real-time transcription, and I’ve started to grasp the complexity of what that would entail (really fast processing on short inputs without any context). There are ways to fine-tune Whisper for sure, I could have checked out the different model sizes, or worked on adding more context (like detecting the language upfront), but that’s for another time.