IWSLT 2026 Instruction Following — Checkpoints

Intermediate checkpoints for the IWSLT 2026 Instruction Following shared task (constrained setting).

Architecture

Speech encoder: facebook/seamless-m4t-v2-large (frozen)
Projector: TransformerProjector (frame averaging + 4-layer Transformer + Linear, 1024->2560)
LLM: Qwen/Qwen3-4B-Instruct-2507 with LoRA r=16
Training framework: ms-swift 4.0

Checkpoints

stage1_swift/projector.pt (LATEST)

ASR projector-only training via SWIFT. Trained on 3,000 English EuroParlST samples, 3 epochs (279 steps). Loss: 9.6 -> 4.4. Eval loss: 4.40. Best checkpoint at step 200.

stage0_prealign/projector.pt (deprecated)

MSE pre-alignment of projector. Trained on 37,696 alignment pairs. Loss converged to ~0.0002 but audit showed degenerate mean-pooling loss. Not useful.

stage1_asr/projector.pt (deprecated)

Old projector from pre-SWIFT pipeline. Trained on 1,000 samples with custom Trainer. Superseded by stage1_swift.

stage2_text_lora/lora_adapters/ (needs re-training)

Text-only LoRA pre-training on MT data. 157,976 EuroParlST EN->DE/IT translation pairs, 500 steps. Final loss ~2.08. Audit found missing source text in prompts — needs re-training with fixed prompts.

Usage

import torch
from src.model.adapters import TransformerProjector

# Load projector
projector = TransformerProjector(speech_dim=1024, llm_dim=2560, downsample_factor=3)
state_dict = torch.load("stage1_swift/projector.pt")
projector.load_state_dict(state_dict)

Status

Stage 1 (SWIFT): complete on Mac M4 Pro 48GB
Stage 2: needs re-training with fixed prompts (SWIFT native Qwen3)
Stage 3: needs cluster (A100) for multimodal merge

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support