Windows · MIT · Local-first · ONNX Runtime · GPU + CPU fallback

Press a hotkey, speak, paste.

Lazy to Text records audio on a global keyboard shortcut, transcribes it locally through ONNX Runtime, and pastes the result into whatever window has focus. A second tab transcribes audio / video files — WAV, MP3, FLAC, OGG, M4A, MP4, MOV, MKV, WebM and friends. No cloud round-trip, no subscription, no telemetry.

View on GitHub Install in 30 seconds

The Models tab of Lazy to Text showing model cards with family chips, an inline inference settings panel, and live resource stats in the topbar.

Why local

Audio stays on your machine

Recording, encoding, and inference all happen in-process. Model weights download once from Hugging Face and live in your local cache. The internet is never involved during transcription.

Works offline

After the first model download the app needs no network connection — useful for trains, planes, and locked-down enterprise laptops.

Lean install

Single ONNX inference path — no PyTorch, no NeMo, no CTranslate2. The exe folder lands at ~700 MB instead of the 6–8 GB the multi-engine builds used to need.

Features

One inference path, nine models

Whisper, GigaAM v3, Parakeet TDT v3, Canary, T-One, Vosk RU — all loaded by the same OnnxAsrBackend on onnxruntime-gpu. Adding a tenth model is a registry entry, not a new dependency tree.

Global hotkey dictation

Ctrl+F2 records, Ctrl+F3 stops + transcribes + pastes, Ctrl+F6 discards. Ctrl+1..5 jump between tabs. Every hotkey is remappable in Settings; edits persist on focus loss with no Save button.

File transcription

Separate Transcribe tab accepts drag-drop or Browse. soundfile handles WAV / FLAC / OGG / OPUS / AIFF natively; bundled static ffmpeg covers MP3 / M4A / AAC / WMA / MP4 / MOV / MKV / WebM / AVI / FLV / 3GP. No system-wide ffmpeg install required.

Russian-strong lineup

GigaAM v3 (CTC + RNN-T, both with built-in punctuation), T-One (Conformer trained on 80k hours of speech, beats Whisper on telephony / noisy audio), Vosk RU (30 / 50 MB Zipformers for low-spec laptops), plus Parakeet and Canary for multilingual coverage that includes Russian.

GPU with auto CPU fallback

Backend tries CUDA / TensorRT first; if no compatible NVIDIA driver is present, it silently retries on CPU and keeps running. Vosk and T-One are usable on CPU; the larger Whisper / Parakeet / Canary models benefit from a GPU but don't require one.

Per-model inference settings

Each Whisper card surfaces an inline panel — language, VAD filter, beam size, temperature, initial prompt — persisted per alias and pushed live into the running backend. No restart needed. Parakeet / GigaAM / T-One / Vosk panels match their narrower API surface.

Live resource monitor

Topbar widget tracks CPU / RAM / GPU utilisation / VRAM in real time via psutil + nvidia-ml-py. Bars colour-code green / amber / red as load climbs. The GPU bar hides gracefully on machines without an NVIDIA driver.

Live VU meter

A STATUS · Idle / Recording / Processing chip in the sidebar's bottom-left corner mirrors the topbar cards; the VU bar under it animates while recording, so silent or muted mics show up immediately.

Cancel-load button

Mis-clicked a 1.6 GB model? A red Cancel pill appears next to the loading indicator. One click rolls back the active card / config / topbar pill — no waiting for the download to finish.

Storage card

Settings → Storage shows the resolved models folder (overridable), total disk space used by the cache (computed asynchronously), and an Open-folder shortcut to inspect / clean the cache in Explorer.

Searchable history

Every transcription is persisted to logs/transcription_history.json, displayed in a filterable table with hover tooltip and detail dialog on double-click. Export to plain text from the toolbar.

Coloured logs view

Records colour-coded by level + logger source, with a "Show network logs" toggle that hides httpx / huggingface_hub noise and a search field that filters the buffer on the fly.

Refresh-rate-aware UI

Smooth-scroll animations and the live VU meter detect the primary monitor's refresh rate (60 / 144 / 240 Hz) and tick at native cadence. No more 60 Hz stutter on a 144 Hz display.

Hide to tray

Close button minimises to the system tray instead of quitting. State-aware tray icon mirrors the recording pipeline. Right-click for Show / Quit; left-click restores the window.

Single-instance guard

A named-mutex check prevents a second copy from registering the same global hotkeys. The duplicate launch shows a branded warning and exits cleanly.

Screenshots

Models tab — search box, family-filter chips, cards with family-coloured chips, alias / canonical subtitle, badges, and an inline inference settings panel on the active card. — Models — family-filtered grid with per-model inference settings inline.

Transcribe tab — drop zone with format hint, Browse button, transcript output, Copy / Save .txt buttons. — Transcribe — drag-drop or Browse, transcript appears in the read-only panel.

History tab — searchable table with sample transcriptions, alias-displayed model column, full-text tooltip, and Copy / Export / Clear buttons. — History — searchable table with model aliases, tooltips, and a detail dialog on double-click.

Logs tab showing colour-coded log lines with a search field and a Show network logs toggle. — Logs — colour-coded by level, network noise hidden by default, searchable.

Settings tab — Audio input, Hotkeys, Clipboard, and Storage cards with Test microphone and Open-folder buttons. — Settings — auto-saving fields grouped into themed cards plus a Storage card with disk-usage readout.

Models

Every model is an ONNX export downloaded from Hugging Face on first use. Click Download on a card to fetch weights into the configured storage folder.

Alias	HF repo	Size	Languages	Best for
`whisper-large-v3-turbo`	`onnx-community/whisper-large-v3-turbo`	1.6 GB	multilingual	universal pick
`whisper-large-v3`	`onnx-community/whisper-large-v3`	3.1 GB	multilingual	best raw quality
`gigaam-v3-ctc`	`istupakov/gigaam-v3-onnx`	260 MB	RU only	fast Russian dictation
`gigaam-v3-rnnt`	`istupakov/gigaam-v3-onnx`	290 MB	RU only	best Russian quality
`t-one`	`t-tech/T-one`	290 MB	RU only	noisy / telephony Russian
`vosk-ru-small`	`alphacep/vosk-model-small-ru`	30 MB	RU only	low-spec laptop CPU
`vosk-ru`	`alphacep/vosk-model-ru`	50 MB	RU only	balanced CPU choice
`parakeet-tdt-v3`	`istupakov/parakeet-tdt-0.6b-v3-onnx`	1.2 GB	25 langs incl. Russian	fastest multilingual
`canary-1b-v2`	`istupakov/canary-1b-v2-onnx`	2.0 GB	25 langs incl. Russian	short-utterance accuracy

Install

Requires Windows 10 / 11, Python 3.12, and a microphone. A CUDA-capable GPU is recommended for the larger models (Whisper Large, Parakeet, Canary). Vosk and T-One run comfortably on CPU.

git clone https://github.com/aa-blinov/lazy-to-text.git
cd lazy-to-text

uv sync                       # creates the venv from uv.lock
uv run lazy-to-text-ui        # launch the app

The first launch leaves no model loaded — pick one from the Models tab and click Download. Press Ctrl+F2, speak, Ctrl+F3; the transcript pastes into the focused window. Or drop a file on the Transcribe tab to process audio / video from disk.

System

Windows 10 or 11 (64-bit)
Microphone (built-in or USB)
Optional: NVIDIA GPU + CUDA 12 era driver for fast inference
Python 3.12 (only when building from source — bundled in the .exe build)

VRAM by model

Vosk RU small / RU — 0.5 / 1 GB
GigaAM v3 CTC / RNN-T — 2 / 2.5 GB
T-One — 1.5 GB
Parakeet TDT v3 — 2 GB
Whisper Large v3 Turbo — 4 GB
Whisper Large v3 — 6 GB
Canary 1B v2 — 4 GB

Roadmap

Onboarding overlay for first-launch users
Per-model VRAM forecasting before download
Light theme + theme switcher
Word-level timestamps for the file-transcribe view
Speaker diarisation (file transcribe → labelled segments)

Done in the latest cycle: ONNX-only refactor (single inference path, all backends collapsed onto onnx-asr); Transcribe tab with drag-drop and bundled ffmpeg; T-One, Vosk RU small / large, Canary 1B v2 added; Storage card with disk usage + Open-folder; refresh-rate-aware smooth-scroll and VU meter (60 / 144 / 240 Hz); cancel-load button; live resource monitor; live VU meter; transcription confirmation toast; searchable history with detail dialog; coloured logs view; smooth pixel scrolling everywhere.