YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
IWSLT 2026 Instruction Following โ Checkpoints
Intermediate checkpoints for the IWSLT 2026 Instruction Following shared task (constrained setting).
Architecture
- Speech encoder: facebook/seamless-m4t-v2-large (frozen)
- Projector: TransformerProjector (frame averaging + 4-layer Transformer + Linear, 1024->2560)
- LLM: Qwen/Qwen3-4B-Instruct-2507 with LoRA r=16
- Training framework: ms-swift 4.0
Checkpoints
stage1_swift/projector.pt (LATEST)
ASR projector-only training via SWIFT. Trained on 3,000 English EuroParlST samples, 3 epochs (279 steps). Loss: 9.6 -> 4.4. Eval loss: 4.40. Best checkpoint at step 200.
stage0_prealign/projector.pt (deprecated)
MSE pre-alignment of projector. Trained on 37,696 alignment pairs. Loss converged to ~0.0002 but audit showed degenerate mean-pooling loss. Not useful.
stage1_asr/projector.pt (deprecated)
Old projector from pre-SWIFT pipeline. Trained on 1,000 samples with custom Trainer. Superseded by stage1_swift.
stage2_text_lora/lora_adapters/ (needs re-training)
Text-only LoRA pre-training on MT data. 157,976 EuroParlST EN->DE/IT translation pairs, 500 steps. Final loss ~2.08. Audit found missing source text in prompts โ needs re-training with fixed prompts.
Usage
import torch
from src.model.adapters import TransformerProjector
# Load projector
projector = TransformerProjector(speech_dim=1024, llm_dim=2560, downsample_factor=3)
state_dict = torch.load("stage1_swift/projector.pt")
projector.load_state_dict(state_dict)
Status
- Stage 1 (SWIFT): complete on Mac M4 Pro 48GB
- Stage 2: needs re-training with fixed prompts (SWIFT native Qwen3)
- Stage 3: needs cluster (A100) for multimodal merge