Danish Whisper (FUTO / ACFT) — selected components

This document covers four pieces of the pipeline: training (train_acft.py), Hugging Face → GGML conversion (convert-to-android.sh), the best small checkpoint (small_244M_danish_whisper_acft_futo_best), and Android-ready binaries (android-futo-ready).

Model Card

Fine-tuned OpenAI Whisper checkpoints for dynamic audio context robustness (ACFT): shorter encoder contexts work well on short or streaming audio. Method and upstream models: FUTO whisper-acft. This repo trains Danish variants on CoRal v3.

Developed by AIOnTheEdge
Finetuned from OpenAI Whisper small
Model license Apache-2.0
Dataset license OpenRAIL-D (commercial use with restrictions, e.g. speech synthesis and biometric identification)

Uses. Checkpoints are meant for runtimes that expose variable audio context—e.g. whisper.cpp with --audio-context, or FUTO Keyboard on Android. Under default full-context Whisper settings they behave like ordinary Whisper. Pre-converted GGML models and usage notes: whisper-acft README.


Overview

train_acft.py  →  small_244M_danish_whisper_acft_futo_best/  →  convert-to-android.sh  →  android-futo-ready/
   (PyTorch)              (HF weights + tokenizer)              (convert)           (ggml-model.bin)
Step Artifact Role
1 train_acft.py Distill Whisper small toward FUTO-style partial-context behavior on Danish CoRal audio
2 small_244M_danish_whisper_acft_futo_best/ Best small (244M) Hugging Face checkpoint (~1.1 GB model.safetensors)
3 convert-to-android.sh Convert HF folders to ggml-model.bin for whisper.cpp / FUTO Keyboard
4 android-futo-ready/ Renamed .bin files ready to drop into the Android voice-input stack

Prerequisites

  • uv and Python ≥ 3.13 (see pyproject.toml)
  • CUDA GPU recommended for training
  • Git submodules / clones used by conversion:
    • whisper/ — OpenAI reference repo (vocab layout for convert-h5-to-ggml.py)
    • whisper.cpp/ — contains models/convert-h5-to-ggml.py
  • Hugging Face token for the CoRal dataset (create .env):
HF_TOKEN=hf_...

Install dependencies:

uv sync

train_acft.py

FUTO-style ACFT (audio-context fine-tuning) distillation for Danish ASR. The student model is trained to match the teacher’s decoder hidden states when the encoder only sees a variable-length partial mel context (simulating short streaming chunks), while the frozen teacher always uses the full 1500-frame encoder context.

What it does

  • Loads openai/whisper-small as both student (model_train) and teacher (model_base).
  • Streams CoRal v3 read_aloud at 16 kHz; skips clips longer than 29 s.
  • For each sample, picks a random partial encoder length n_ctx (with jitter), runs MSE loss between student and teacher decoder hidden states, then optimizes with AdamW (batch size 1).
  • Logs to TensorBoard; saves checkpoints on improved EMA loss every 500 steps; early-stops after 20 evals without improvement or at 20 000 steps.
  • On exit (normal, early stop, max steps, or Ctrl+C), always writes a latest checkpoint directory.

Usage

uv run python train_acft.py --size small

Output directories

Training writes Hugging Face–format folders (config, tokenizer, model.safetensors). Naming pattern from the script:

  • Best (lowest EMA loss): small_244M_danish_whisper_acft_futo_best
  • Latest (always saved at end): small_244M_danish_whisper_acft_futo_latest

Hyperparameters (in script)

Setting Value
Learning rate 1e-6
Weight decay 0.1
Max steps 20_000
Eval / checkpoint interval 500
Early-stop patience 20 evals
Processor openai/whisper-small, language Danish, task transcribe

Best run (reference)

From training logs for the bundled checkpoint:

[Step 24000] New best loss: 0.0313. Saved checkpoint to small_244M_danish_whisper_acft_futo_best

small_244M_danish_whisper_acft_futo_best

Hugging Face–format Whisper small checkpoint (244M parameters) after ACFT distillation. Use this folder as input to convert-to-android.sh or load directly with Transformers.

Contents

File Description
model.safetensors Fine-tuned weights (~1.1 GB)
config.json, generation_config.json Whisper-small architecture + generation defaults
tokenizer.json, tokenizer_config.json, processor_config.json Danish transcribe tokenizer/processor
vocab.json, added_tokens.json Vocabulary (also refreshed by convert-to-android.sh from upstream whisper-small)

Load in Python (example)

from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_dir = "small_244M_best_danish_whisper_acft_futo"
model = WhisperForConditionalGeneration.from_pretrained(model_dir)
processor = WhisperProcessor.from_pretrained(model_dir)

Architecture matches OpenAI whisper-small (d_model=768, 12 encoder/decoder layers).


convert-to-android.sh

Batch-converts trained HF model directories into GGML binaries for whisper.cpp / FUTO Keyboard.

Behavior

  1. Discovers folders (default: any directory name containing danish_whisper_acft_futo in the project root).
  2. Downloads added_tokens.json and vocab.json from openai/whisper-small into each folder (ensures tokenizer compatibility with the converter).
  3. Runs uv run whisper.cpp/models/convert-h5-to-ggml.py <folder>/ ./whisper/ <folder>/
  4. Moves ggml-model.binandroid-futo-ready/<folder>.bin

Usage

# Convert all danish_whisper_acft_futo* folders in the current directory
./convert-to-android.sh

# Convert one or more folders explicitly (space-separated)
./convert-to-android.sh small_244M_danish_whisper_acft_futo_best

Make the script executable if needed:

chmod +x convert-to-android.sh

Requirements

  • Run from the repository root.
  • whisper/ and whisper.cpp/ must be present.
  • Each source folder must contain a full HF checkpoint (including model.safetensors or equivalent weights the converter accepts).

android-futo-ready

Output directory for the GGML model consumed on-device (e.g. FUTO Keyboard Android voice input).

File Source Approx. size
small_244M_danish_whisper_acft_futo_best.bin small_244M_danish_whisper_acft_futo_best/ ~488 MB

Regenerate after retraining or updating the HF folder:

./convert-to-android.sh small_244M_danish_whisper_acft_futo_best

End-to-end

# 1. Train (or skip if using the bundled best checkpoint)
uv run python train_acft.py --size small

# 2. Convert best small checkpoint to GGML
./convert-to-android.sh small_244M_danish_whisper_acft_futo_best

# 3. Deploy
# Copy android-futo-ready/small_244M_danish_whisper_acft_futo_best.bin into your Android / FUTO build.

Troubleshooting

Issue Likely cause
Dataset load fails Missing or invalid HF_TOKEN in .env
CUDA OOM during training Reduce batch size or use a GPU with more VRAM; clips < 29 s are filtered but small still needs substantial memory
convert-h5-to-ggml.py not found Clone or init whisper.cpp under the repo root
Converter vocab errors Ensure whisper/ clone exists; script re-fetches vocab.json / added_tokens.json
Empty android-futo-ready Run convert-to-android.sh after a successful HF checkpoint exists
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AIOnTheEdge/acft-whisper-small.da

Finetuned
(3518)
this model

Dataset used to train AIOnTheEdge/acft-whisper-small.da