Building an Automatic Japanese Subtitle Generator with faster-whisper
Most developers assume speech-to-text requires heavy infrastructure, external APIs, or a GPU.
It doesn’t.
Last weekend, I built a CLI tool that takes a Japanese video file and generates a .srt subtitle file — running entirely on CPU.
No cloud calls.
No per-request costs.
No black-box SaaS.
Just Python, faster-whisper, and a deeper understanding of how modern speech models actually work.
Local-first AI tools are more viable than most engineers think — if you understand the tradeoffs.
As I’ve been studying Japanese, I’ve been consuming more native content — especially anime and long-form videos. But not all of it comes with reliable subtitles, or for people who is learning, it’s difficult to find content with subtitles in the same language.
Instead of treating that as a limitation, I saw an opportunity to combine two things I care about: language learning and backend engineering.
So I built a local CLI tool that takes a video file and outputs a properly formatted .srt subtitle file — automatically, entirely offline.

The stack is simple: Python + faster-whisper. But the real goal wasn’t just to generate subtitles. I wanted to understand what’s actually happening inside modern speech-to-text models, beyond the “just call the API” level.
The pipeline looks like this:
.mkv / .mp4 → faster-whisper → .srt file
What is faster-whisper?
If you’ve been following the AI space, you’ve probably heard of Whisper — OpenAI’s open-source speech recognition model based on the Transformer architecture. It’s trained on 680,000 hours of multilingual audio and can handle transcription, translation, and language identification.
faster-whisper is a re-implementation of Whisper using CTranslate2, an inference engine optimized for Transformer models. The result? Same accuracy, dramatically better performance.
faster-whisper gives you up to 4x faster inference with significantly lower memory usage. It supports int8 and float16 quantization, making it viable to run entirely on CPU — no expensive GPU required.
Here’s how I initialized the model:
from faster_whisper import WhisperModel
model = WhisperModel(
"small",
device="cpu",
compute_type="int8"
)
The "small" model hits a sweet spot for Japanese: it’s accurate enough for conversational speech while keeping inference fast on consumer hardware. And int8 quantization cuts memory usage roughly in half compared to float32.
Architecture and Key Decisions
The core of the tool is a single transcription call. But the parameters you choose here have a big impact on output quality.
segments_generator, info = model.transcribe(
input_audio_or_video,
language="ja",
vad_filter=True,
beam_size=5
)
Let me break down each parameter and why I chose it:
language="ja"— Explicitly setting Japanese avoids auto-detection errors and produces more stable output.vad_filter=True— Voice Activity Detection filters out silence, reducing hallucinated text in quiet sections.beam_size=5— Beam search with 5 candidates. Balances transcription quality with processing speed.compute_type="int8"— Quantization that dramatically improves CPU performance with minimal quality loss.
One design detail I really liked: the transcribe() method returns a generator, not a list. This means segments are yielded one at a time — enabling streaming-style processing without loading the entire transcription into memory. This is an underrated pattern in ML APIs that more libraries should adopt.
What’s Happening Under the Hood
It’s easy to treat speech-to-text as a black box. But understanding the pipeline helps you debug issues and make better parameter choices. Here’s what Whisper actually does:
Audio Input → Preprocessing → Mel Spectrogram → Encoder → Decoder → Beam Search → Text
The audio is first converted into a log-Mel spectrogram — essentially a visual representation of sound frequencies over time. This spectrogram is then processed by a Transformer encoder, which produces a rich representation of the audio content.
The decoder is autoregressive: it generates one token at a time, using the encoder output plus all previously generated tokens as context. Beam search explores multiple candidate sequences simultaneously, keeping the top-k most promising paths at each step.
This is important to understand because it explains why parameters like beam_size affect both quality and speed — more beams means more candidates to evaluate at every single decoding step.
Manual SRT Formatting
Rather than relying on a library for subtitle formatting, I implemented the SRT timestamp conversion myself. SRT files have a very specific format: HH:MM:SS,mmm with comma-separated milliseconds.
def fmt_srt_time(seconds: float) -> str:
ms = int(round(seconds * 1000))
h = ms // 3_600_000
ms %= 3_600_000
m = ms // 60_000
ms %= 60_000
s = ms // 1000
ms %= 1000
return f"{h:02}:{m:02}:{s:02},{ms:03}"
The full SRT entry for each segment looks like this:
1
00:00:03,240 --> 00:00:07,680
日本語の字幕が自動的に生成されます
I also added tqdm progress bars for both the transcription phase and the subtitle writing phase. It’s a small detail, but for CLI tools it makes a huge difference in user experience — especially when processing long videos where transcription can take several minutes.
Measuring Performance
When you’re working with ML models, measuring execution time is non-negotiable. Even basic timing gives you insight into model and hardware tradeoffs.
import time
start_time = time.time()
# ... transcription and writing ...
elapsed = time.time() - start_time
minutes = int(elapsed // 60)
seconds = elapsed % 60
print(f"Total execution time: {minutes}m {seconds:.2f}s")
On my machine (CPU only, int8 quantization, small model), a 24-minute video typically takes a few minutes to transcribe (around 2–3 minutes in my runs). That’s comfortably faster than real time, which makes offline processing practical.
Switching to the medium model improves accuracy for more challenging speech, but increases processing time significantly — a classic speed vs quality tradeoff.
What I Learned
-
Whisper isn’t magic — it’s a structured pipeline. Understanding the flow from audio preprocessing through spectrogram generation, encoder, autoregressive decoder, and beam search demystifies the entire process. Every parameter you tune maps to a specific stage.
-
Quantization makes local AI viable. The jump from
float32toint8isn’t just a minor optimization — it’s what makes running these models on consumer hardware realistic. The quality tradeoff is often negligible. -
Generator-based APIs are underrated in ML. Streaming segments instead of returning a full list changes how you can build downstream features — real-time display, early stopping, progress tracking — all become trivial.
-
CPU inference is more powerful than most developers think. Not every project needs a GPU. With the right optimizations (quantization, efficient runtimes), CPU-only setups handle many real-world workloads just fine.
What’s Next
This was a weekend project, but there’s a clear path to make it more useful:
- Auto language detection — Let the model figure out the language instead of hardcoding it.
- Bilingual subtitles — Generate Japanese + English side by side using Whisper’s translation capabilities.
- CLI packaging — Turn it into a proper installable tool with argument parsing and config files.
- Backend integration — Wire it into a FastAPI + S3 pipeline for batch processing.
- Elixir port — Rebuild the core pipeline in Elixir using Nx and Bumblebee, leveraging concurrency for batch video processing.
Weekend projects like this force you to truly understand the technology — not just consume it through an API. The gap between “I used Whisper” and “I understand how Whisper works” is where real engineering growth happens.
~Norman Argueta