ethos / docs /plans /2026-03-01-evoxtral-implementation.md
Lior-0618's picture
chore: sync master to HF Spaces (skip binary history)
0b6ab33

Evoxtral Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Finetune Voxtral-Mini-3B to produce transcriptions with inline ElevenLabs v3 audio tags, with a full data pipeline, eval suite, backend API, and frontend demo.

Architecture: Reverse-pipeline synthetic data generation (tagged text -> ElevenLabs TTS -> audio) produces training pairs. LoRA finetuning on Voxtral-Mini-3B teaches the model to predict tags from audio. FastAPI backend serves the model from HuggingFace. Next.js frontend provides a demo UI.

Tech Stack: Python 3.11+, transformers, peft, torchaudio, wandb, jiwer, elevenlabs SDK, FastAPI, Next.js, shadcn/ui


Task 1: Project Scaffolding & Dependencies

Files:

  • Create: pyproject.toml
  • Create: requirements.txt
  • Create: data/README.md
  • Create: data/scripts/
  • Create: data/audio/
  • Create: data/processed/
  • Create: src/__init__.py
  • Create: src/data/__init__.py
  • Create: src/training/__init__.py
  • Create: src/eval/__init__.py
  • Create: src/api/__init__.py
  • Create: tests/__init__.py

Step 1: Create requirements.txt

# Data pipeline
elevenlabs>=1.0.0
openai>=1.0.0

# Training
torch>=2.1.0
torchaudio>=2.1.0
transformers>=4.54.0
peft>=0.7.0
datasets>=2.14.0
accelerate>=0.25.0
bitsandbytes>=0.41.0

# Eval
jiwer>=3.0.0

# Tracking
wandb>=0.16.0

# API
fastapi>=0.104.0
uvicorn>=0.24.0
python-multipart>=0.0.6
websockets>=12.0

# Utils
python-dotenv>=1.0.0
soundfile>=0.12.0
librosa>=0.10.0
tqdm>=4.66.0

Step 2: Create directory structure

Run:

mkdir -p src/{data,training,eval,api} tests data/{scripts,audio,processed}
touch src/__init__.py src/data/__init__.py src/training/__init__.py src/eval/__init__.py src/api/__init__.py tests/__init__.py

Step 3: Create .env template

Create .env.example:

ELEVENLABS_API_KEY_1=sk_...
ELEVENLABS_API_KEY_2=sk_...
ELEVENLABS_API_KEY_3=sk_...
WANDB_API_KEY=...
WANDB_PROJECT=evoxtral
HF_TOKEN=...
MISTRAL_API_KEY=...

Step 4: Install dependencies

Run: pip install -r requirements.txt

Step 5: Commit

git add -A
git commit -m "chore: scaffold project structure and dependencies"

Task 2: Script Generator (Tagged Text Creation)

Files:

  • Create: src/data/tag_taxonomy.py
  • Create: src/data/script_generator.py
  • Create: tests/test_script_generator.py

Step 1: Write tag taxonomy

src/data/tag_taxonomy.py:

"""ElevenLabs v3 audio tag taxonomy for Evoxtral."""

EMOTION_TAGS = ["excited", "sad", "angry", "nervous", "calm", "frustrated"]
NONVERBAL_TAGS = ["laughs", "sighs", "gasps", "clears throat", "crying"]
DELIVERY_TAGS = ["whispers", "shouts", "stammers"]
PAUSE_TAGS = ["pause"]

ALL_BRACKET_TAGS = EMOTION_TAGS + NONVERBAL_TAGS + DELIVERY_TAGS + PAUSE_TAGS

# Slice definitions for balanced dataset
SLICE_CONFIG = {
    "plain": {"ratio": 0.25, "tag_density": 0, "description": "No tags, plain ASR"},
    "light": {"ratio": 0.25, "tag_density": (1, 2), "description": "1-2 tags per sample"},
    "moderate": {"ratio": 0.25, "tag_density": (3, 4), "description": "3-4 tags per sample"},
    "dense": {"ratio": 0.15, "tag_density": (5, 8), "description": "5+ tags per sample"},
    "edge": {"ratio": 0.10, "tag_density": (1, 6), "description": "Edge cases: ambiguous, boundary"},
}

DOMAINS = [
    "conversation", "monologue", "podcast", "presentation",
    "argument", "storytelling", "interview", "voicemail"
]

# Semantic groups for eval (tags within a group are considered equivalent)
TAG_SEMANTIC_GROUPS = {
    "laughter": ["laughs", "giggles", "chuckles"],
    "sadness": ["sad", "crying", "sorrowful"],
    "breathing": ["sighs", "gasps", "exhales"],
    "loud": ["shouts", "yells"],
    "quiet": ["whispers", "murmurs"],
}

Step 2: Write script generator

src/data/script_generator.py:

"""Generate diverse tagged scripts for ElevenLabs v3 TTS synthesis."""

import json
import random
import os
from pathlib import Path
from openai import OpenAI
from .tag_taxonomy import ALL_BRACKET_TAGS, SLICE_CONFIG, DOMAINS

SYSTEM_PROMPT = """You are a script writer generating realistic speech samples with inline ElevenLabs v3 audio tags.

Audio tags use square brackets: [laughs], [excited], [whispers], [pause], etc.
Emphasis uses CAPITALIZATION of stressed words.
Pauses use ellipses ...

Rules:
- Write natural, diverse dialogue/monologue snippets (15-80 words)
- Tags must feel organic, not forced
- Vary domains: conversation, podcast, storytelling, argument, etc.
- Include a mix of male/female perspectives
- Each sample should be self-contained (no context needed)

Available tags: {tags}

Output ONLY a JSON array of objects with fields:
- "tagged_text": the text with inline tags
- "plain_text": same text with all tags and CAPS emphasis removed
- "tags_used": list of tags used (without brackets)
- "domain": one of {domains}
- "tag_count": number of tags used
"""


def generate_scripts_for_slice(
    slice_name: str,
    count: int,
    client: OpenAI | None = None,
    model: str = "mistral-large-latest",
) -> list[dict]:
    """Generate tagged scripts for a given slice type."""
    if client is None:
        client = OpenAI(
            api_key=os.getenv("MISTRAL_API_KEY"),
            base_url="https://api.mistral.ai/v1",
        )

    config = SLICE_CONFIG[slice_name]
    tag_density = config["tag_density"]

    if slice_name == "plain":
        density_instruction = "Do NOT include any audio tags or CAPS emphasis. Plain speech only."
    elif isinstance(tag_density, tuple):
        density_instruction = f"Use exactly {tag_density[0]}-{tag_density[1]} audio tags per sample."
    else:
        density_instruction = f"Use exactly {tag_density} audio tags per sample."

    if slice_name == "edge":
        density_instruction += (
            " Include edge cases: tags at the very start/end, "
            "back-to-back tags like [angry][laughs], "
            "ambiguous emotions, very short utterances with tags."
        )

    scripts = []
    batch_size = 20  # generate 20 at a time

    for i in range(0, count, batch_size):
        n = min(batch_size, count - i)
        prompt = (
            f"Generate exactly {n} speech samples.\n"
            f"Slice type: {slice_name} - {config['description']}\n"
            f"Tag density: {density_instruction}\n"
            f"Distribute evenly across these domains: {', '.join(DOMAINS)}"
        )

        response = client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "system",
                    "content": SYSTEM_PROMPT.format(
                        tags=", ".join(ALL_BRACKET_TAGS),
                        domains=", ".join(DOMAINS),
                    ),
                },
                {"role": "user", "content": prompt},
            ],
            temperature=0.9,
            response_format={"type": "json_object"},
        )

        try:
            content = response.choices[0].message.content
            parsed = json.loads(content)
            batch = parsed if isinstance(parsed, list) else parsed.get("samples", parsed.get("scripts", []))
            for item in batch:
                item["slice_type"] = slice_name
            scripts.extend(batch)
        except (json.JSONDecodeError, KeyError) as e:
            print(f"Failed to parse batch {i // batch_size}: {e}")
            continue

    return scripts[:count]


def generate_full_dataset(total: int = 1000, output_path: str = "data/scripts/scripts.json") -> list[dict]:
    """Generate the full balanced dataset of tagged scripts."""
    all_scripts = []

    for slice_name, config in SLICE_CONFIG.items():
        count = int(total * config["ratio"])
        print(f"Generating {count} {slice_name} scripts...")
        scripts = generate_scripts_for_slice(slice_name, count)
        all_scripts.extend(scripts)
        print(f"  Got {len(scripts)} scripts")

    random.shuffle(all_scripts)

    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "w") as f:
        json.dump(all_scripts, f, indent=2)

    print(f"Total: {len(all_scripts)} scripts saved to {output_path}")
    return all_scripts


if __name__ == "__main__":
    from dotenv import load_dotenv
    load_dotenv()
    generate_full_dataset()

Step 3: Write tests

tests/test_script_generator.py:

"""Tests for script generator output format and balance."""

import json
import pytest
from src.data.tag_taxonomy import SLICE_CONFIG, ALL_BRACKET_TAGS


def test_slice_ratios_sum_to_one():
    total = sum(c["ratio"] for c in SLICE_CONFIG.values())
    assert abs(total - 1.0) < 0.01


def test_all_tags_present():
    assert len(ALL_BRACKET_TAGS) >= 15


def validate_script_format(script: dict):
    """Validate a single script has required fields."""
    assert "tagged_text" in script
    assert "plain_text" in script
    assert "slice_type" in script
    assert len(script["tagged_text"]) > 0
    assert len(script["plain_text"]) > 0

Step 4: Run tests

Run: python -m pytest tests/test_script_generator.py -v Expected: PASS

Step 5: Commit

git add src/data/ tests/test_script_generator.py
git commit -m "feat: add script generator with balanced tag taxonomy"

Task 3: ElevenLabs Synthesizer (Audio Generation)

Files:

  • Create: src/data/synthesizer.py
  • Create: tests/test_synthesizer.py

Step 1: Write synthesizer with key rotation and concurrency

src/data/synthesizer.py:

"""ElevenLabs v3 TTS synthesizer with API key rotation and concurrency."""

import asyncio
import os
import json
import hashlib
from pathlib import Path
from itertools import cycle
from dataclasses import dataclass
from elevenlabs import ElevenLabs
from dotenv import load_dotenv

load_dotenv()

# Voice IDs to cycle through (select 6-8 diverse voices)
# These should be populated with actual ElevenLabs voice IDs
DEFAULT_VOICES = [
    # Fill with actual voice IDs from ElevenLabs library
    # Mix: 2 male adult, 2 female adult, 1 young male, 1 young female, 1 older male, 1 older female
]

MODEL_ID = "eleven_v3"


@dataclass
class SynthesisResult:
    script_index: int
    audio_path: str
    voice_id: str
    success: bool
    error: str | None = None


def get_api_keys() -> list[str]:
    """Load all ElevenLabs API keys from environment."""
    keys = []
    for key, value in os.environ.items():
        if key.startswith("ELEVENLABS_API_KEY"):
            keys.append(value)
    if not keys:
        raise ValueError("No ELEVENLABS_API_KEY* found in environment")
    return keys


def create_clients(keys: list[str]) -> list[ElevenLabs]:
    """Create ElevenLabs client per API key."""
    return [ElevenLabs(api_key=key) for key in keys]


def synthesize_one(
    client: ElevenLabs,
    text: str,
    voice_id: str,
    output_path: str,
) -> bool:
    """Synthesize a single text to audio file."""
    try:
        audio_generator = client.text_to_speech.convert(
            text=text,
            voice_id=voice_id,
            model_id=MODEL_ID,
            output_format="mp3_44100_128",
        )
        Path(output_path).parent.mkdir(parents=True, exist_ok=True)
        with open(output_path, "wb") as f:
            for chunk in audio_generator:
                f.write(chunk)
        return True
    except Exception as e:
        print(f"Synthesis failed: {e}")
        return False


def synthesize_dataset(
    scripts_path: str = "data/scripts/scripts.json",
    output_dir: str = "data/audio",
    voices: list[str] | None = None,
    max_retries: int = 2,
) -> list[SynthesisResult]:
    """Synthesize all scripts to audio, rotating keys and voices."""
    with open(scripts_path) as f:
        scripts = json.load(f)

    keys = get_api_keys()
    clients = create_clients(keys)
    client_cycle = cycle(clients)
    voice_list = voices or DEFAULT_VOICES
    voice_cycle = cycle(voice_list)

    results = []
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for i, script in enumerate(scripts):
        text = script["tagged_text"]
        voice_id = next(voice_cycle)
        client = next(client_cycle)

        # Deterministic filename from content hash
        content_hash = hashlib.md5(text.encode()).hexdigest()[:8]
        audio_path = f"{output_dir}/{i:04d}_{content_hash}.mp3"

        # Skip if already exists
        if Path(audio_path).exists():
            results.append(SynthesisResult(i, audio_path, voice_id, True))
            continue

        success = False
        error = None
        for attempt in range(max_retries + 1):
            success = synthesize_one(client, text, voice_id, audio_path)
            if success:
                break
            # Rotate to next client on failure
            client = next(client_cycle)
            error = f"Failed after {attempt + 1} attempts"

        results.append(SynthesisResult(i, audio_path, voice_id, success, error))

        if (i + 1) % 50 == 0:
            ok = sum(1 for r in results if r.success)
            print(f"Progress: {i + 1}/{len(scripts)} | Success: {ok}/{i + 1}")

    # Save manifest
    manifest = []
    for r, script in zip(results, scripts):
        manifest.append({
            **script,
            "audio_path": r.audio_path,
            "voice_id": r.voice_id,
            "synthesis_success": r.success,
        })

    manifest_path = f"{output_dir}/manifest.json"
    with open(manifest_path, "w") as f:
        json.dump(manifest, f, indent=2)

    ok = sum(1 for r in results if r.success)
    print(f"Done: {ok}/{len(scripts)} successful. Manifest: {manifest_path}")
    return results


if __name__ == "__main__":
    synthesize_dataset()

Step 2: Write test

tests/test_synthesizer.py:

"""Tests for synthesizer key rotation and output format."""

from src.data.synthesizer import get_api_keys, SynthesisResult


def test_synthesis_result_format():
    r = SynthesisResult(0, "data/audio/0000.mp3", "voice_123", True)
    assert r.success
    assert r.error is None

Step 3: Run test

Run: python -m pytest tests/test_synthesizer.py -v Expected: PASS

Step 4: Commit

git add src/data/synthesizer.py tests/test_synthesizer.py
git commit -m "feat: add ElevenLabs synthesizer with key rotation"

Task 4: Dataset Formatter (HuggingFace Datasets)

Files:

  • Create: src/data/formatter.py
  • Create: tests/test_formatter.py

Step 1: Write dataset formatter

src/data/formatter.py:

"""Format synthesized data into HuggingFace Dataset for training."""

import json
import re
from pathlib import Path
from datasets import Dataset, Audio, Features, Value, Sequence


def extract_tags(tagged_text: str) -> list[dict]:
    """Extract tags and their positions from tagged text."""
    tags = []
    # Match [tag] patterns
    for match in re.finditer(r'\[([^\]]+)\]', tagged_text):
        tags.append({
            "tag": match.group(1),
            "start_char": match.start(),
            "end_char": match.end(),
        })
    return tags


def strip_tags(tagged_text: str) -> str:
    """Remove all [tags] from text, leaving plain transcription."""
    return re.sub(r'\[[^\]]+\]\s*', '', tagged_text).strip()


def has_emphasis(text: str) -> bool:
    """Check if text contains CAPS emphasis (2+ consecutive uppercase words)."""
    return bool(re.search(r'\b[A-Z]{2,}\b', text))


def format_dataset(
    manifest_path: str = "data/audio/manifest.json",
    output_dir: str = "data/processed",
) -> Dataset:
    """Convert manifest + audio files into a HuggingFace Dataset."""
    with open(manifest_path) as f:
        manifest = json.load(f)

    # Filter to successful syntheses only
    records = [m for m in manifest if m.get("synthesis_success", True)]

    rows = []
    for record in records:
        tagged_text = record["tagged_text"]
        plain_text = record.get("plain_text", strip_tags(tagged_text))
        tags = extract_tags(tagged_text)

        rows.append({
            "audio": record["audio_path"],
            "tagged_text": tagged_text,
            "plain_text": plain_text,
            "tags": json.dumps(tags),
            "tag_count": len(tags),
            "has_emphasis": has_emphasis(tagged_text),
            "slice_type": record.get("slice_type", "unknown"),
            "domain": record.get("domain", "unknown"),
            "voice_id": record.get("voice_id", "unknown"),
        })

    ds = Dataset.from_dict({k: [r[k] for r in rows] for k in rows[0]})
    ds = ds.cast_column("audio", Audio(sampling_rate=16000))

    # Train/val/test split: 80/10/10
    ds_split = ds.train_test_split(test_size=0.2, seed=42)
    val_test = ds_split["test"].train_test_split(test_size=0.5, seed=42)

    from datasets import DatasetDict
    final = DatasetDict({
        "train": ds_split["train"],
        "validation": val_test["train"],
        "test": val_test["test"],
    })

    Path(output_dir).mkdir(parents=True, exist_ok=True)
    final.save_to_disk(output_dir)
    print(f"Dataset saved to {output_dir}")
    print(f"  Train: {len(final['train'])}, Val: {len(final['validation'])}, Test: {len(final['test'])}")

    return final


if __name__ == "__main__":
    format_dataset()

Step 2: Write tests

tests/test_formatter.py:

"""Tests for dataset formatter tag extraction and stripping."""

from src.data.formatter import extract_tags, strip_tags, has_emphasis


def test_extract_tags():
    text = "[excited] Hello WORLD! [laughs] That was great."
    tags = extract_tags(text)
    assert len(tags) == 2
    assert tags[0]["tag"] == "excited"
    assert tags[1]["tag"] == "laughs"


def test_strip_tags():
    text = "[excited] Hello WORLD! [laughs] That was great."
    plain = strip_tags(text)
    assert "[" not in plain
    assert "Hello" in plain
    assert "great" in plain


def test_strip_tags_no_tags():
    text = "Just a normal sentence."
    assert strip_tags(text) == text


def test_has_emphasis():
    assert has_emphasis("That was AMAZING")
    assert not has_emphasis("That was amazing")
    assert has_emphasis("I can't BELIEVE this HAPPENED")


def test_extract_tags_empty():
    assert extract_tags("No tags here") == []

Step 3: Run tests

Run: python -m pytest tests/test_formatter.py -v Expected: PASS

Step 4: Commit

git add src/data/formatter.py tests/test_formatter.py
git commit -m "feat: add HuggingFace dataset formatter with tag extraction"

Task 5: LoRA Finetuning Script

Files:

  • Create: src/training/finetune.py
  • Create: src/training/config.py

Step 1: Write training config

src/training/config.py:

"""Training configuration for Evoxtral LoRA finetuning."""

from dataclasses import dataclass, field


@dataclass
class EvoxtralTrainingConfig:
    # Model
    model_name: str = "mistralai/Voxtral-Mini-3B-2507"

    # LoRA
    lora_rank: int = 64
    lora_alpha: int = 128
    lora_dropout: float = 0.05
    lora_target_modules: list[str] = field(default_factory=lambda: [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
        "multi_modal_projector.linear_1",
        "multi_modal_projector.linear_2",
    ])

    # Training
    num_train_epochs: int = 1
    per_device_train_batch_size: int = 4
    gradient_accumulation_steps: int = 4
    learning_rate: float = 2e-4
    warmup_steps: int = 50
    weight_decay: float = 0.01
    bf16: bool = True

    # NEFTune
    neftune_noise_alpha: float = 5.0

    # Data
    dataset_path: str = "data/processed"
    max_seq_length: int = 2048

    # Output
    output_dir: str = "model/evoxtral-lora"
    logging_steps: int = 10
    save_steps: int = 100
    eval_steps: int = 50

    # W&B
    wandb_project: str = "evoxtral"
    report_to: str = "wandb"

    # HuggingFace
    hub_model_id: str = "mistral-hackaton-2026/evoxtral-lora"
    push_to_hub: bool = True

Step 2: Write finetuning script

src/training/finetune.py:

"""LoRA finetuning script for Voxtral-Mini-3B with NEFTune."""

import os
import torch
from datasets import load_from_disk
from transformers import (
    AutoProcessor,
    AutoModelForVision2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)
from peft import LoraConfig, get_peft_model, TaskType
from dotenv import load_dotenv

from .config import EvoxtralTrainingConfig

load_dotenv()


def load_dataset(config: EvoxtralTrainingConfig):
    """Load the processed dataset."""
    ds = load_from_disk(config.dataset_path)
    return ds["train"], ds["validation"]


def setup_model_and_processor(config: EvoxtralTrainingConfig):
    """Load Voxtral-Mini-3B and configure LoRA."""
    processor = AutoProcessor.from_pretrained(config.model_name)

    model = AutoModelForVision2Seq.from_pretrained(
        config.model_name,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

    # Configure LoRA
    lora_config = LoraConfig(
        r=config.lora_rank,
        lora_alpha=config.lora_alpha,
        lora_dropout=config.lora_dropout,
        target_modules=config.lora_target_modules,
        task_type=TaskType.CAUSAL_LM,
        bias="none",
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    return model, processor


def preprocess_function(examples, processor):
    """Preprocess audio + tagged text into model inputs.

    The prompt format for Voxtral is:
    [AUDIO]...[/AUDIO] <transcribe>

    We replace the standard transcription target with tagged text.
    """
    # Process audio
    audios = [ex["array"] for ex in examples["audio"]]
    sampling_rate = examples["audio"][0]["sampling_rate"]

    # The target is the tagged transcription
    texts = examples["tagged_text"]

    # Use processor to create inputs
    inputs = processor(
        audios=audios,
        text=["<transcribe>" for _ in texts],
        sampling_rate=sampling_rate,
        return_tensors="pt",
        padding=True,
    )

    # Create labels from tagged text
    labels = processor.tokenizer(
        texts,
        return_tensors="pt",
        padding=True,
    )

    inputs["labels"] = labels["input_ids"]

    # Mask prompt tokens in labels (set to -100)
    # Only compute loss on the tagged transcription output
    prompt_length = inputs["input_ids"].shape[1] - labels["input_ids"].shape[1]
    if prompt_length > 0:
        mask = torch.full((labels["input_ids"].shape[0], prompt_length), -100)
        inputs["labels"] = torch.cat([mask, inputs["labels"]], dim=1)

    return inputs


def train(config: EvoxtralTrainingConfig | None = None):
    """Run the full finetuning pipeline."""
    if config is None:
        config = EvoxtralTrainingConfig()

    print("Loading dataset...")
    train_ds, val_ds = load_dataset(config)
    print(f"Train: {len(train_ds)}, Val: {len(val_ds)}")

    print("Loading model...")
    model, processor = setup_model_and_processor(config)

    training_args = Seq2SeqTrainingArguments(
        output_dir=config.output_dir,
        num_train_epochs=config.num_train_epochs,
        per_device_train_batch_size=config.per_device_train_batch_size,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        learning_rate=config.learning_rate,
        warmup_steps=config.warmup_steps,
        weight_decay=config.weight_decay,
        bf16=config.bf16,
        logging_steps=config.logging_steps,
        save_steps=config.save_steps,
        eval_steps=config.eval_steps,
        eval_strategy="steps",
        save_total_limit=3,
        load_best_model_at_end=True,
        report_to=config.report_to,
        run_name="evoxtral-sft-lora",
        neftune_noise_alpha=config.neftune_noise_alpha,
        push_to_hub=config.push_to_hub,
        hub_model_id=config.hub_model_id,
    )

    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        processing_class=processor,
    )

    print("Starting training...")
    trainer.train()

    print("Saving model...")
    trainer.save_model(config.output_dir)
    processor.save_pretrained(config.output_dir)

    if config.push_to_hub:
        print("Pushing to HuggingFace Hub...")
        trainer.push_to_hub()

    print("Done!")


if __name__ == "__main__":
    train()

Step 3: Commit

git add src/training/
git commit -m "feat: add LoRA finetuning script with NEFTune and W&B"

Task 6: Evoxtral-Bench (Evaluation Suite)

Files:

  • Create: src/eval/bench.py
  • Create: src/eval/tag_metrics.py
  • Create: src/eval/roundtrip.py
  • Create: tests/test_eval.py

Step 1: Write tag metrics

src/eval/tag_metrics.py:

"""Tag-level evaluation metrics for Evoxtral-Bench."""

import re
from dataclasses import dataclass
from src.data.tag_taxonomy import TAG_SEMANTIC_GROUPS


@dataclass
class TagMetrics:
    precision: float
    recall: float
    f1: float
    position_accuracy: float
    total_predicted: int
    total_ground_truth: int


def extract_tag_list(text: str) -> list[str]:
    """Extract ordered list of tags from tagged text."""
    return re.findall(r'\[([^\]]+)\]', text)


def extract_tag_positions(text: str) -> list[tuple[str, int]]:
    """Extract tags with their word-position index."""
    tags_with_pos = []
    # Remove tags to count word positions
    words_before = []
    parts = re.split(r'(\[[^\]]+\])', text)
    word_idx = 0
    for part in parts:
        if part.startswith('[') and part.endswith(']'):
            tag = part[1:-1]
            tags_with_pos.append((tag, word_idx))
        else:
            word_idx += len(part.split())
    return tags_with_pos


def normalize_tag(tag: str) -> str:
    """Normalize tag to canonical form using semantic groups."""
    tag_lower = tag.lower().strip()
    for canonical, variants in TAG_SEMANTIC_GROUPS.items():
        if tag_lower in variants:
            return canonical
    return tag_lower


def compute_tag_metrics(
    predicted: str,
    ground_truth: str,
    position_tolerance: int = 3,
) -> TagMetrics:
    """Compute tag-level precision, recall, F1, and position accuracy."""
    pred_tags = extract_tag_list(predicted)
    gt_tags = extract_tag_list(ground_truth)

    pred_normalized = [normalize_tag(t) for t in pred_tags]
    gt_normalized = [normalize_tag(t) for t in gt_tags]

    # Tag presence F1 (bag-of-tags)
    pred_set = set(pred_normalized)
    gt_set = set(gt_normalized)

    if len(pred_set) == 0 and len(gt_set) == 0:
        precision = recall = f1 = 1.0
    elif len(pred_set) == 0:
        precision = recall = f1 = 0.0
    elif len(gt_set) == 0:
        precision = 0.0
        recall = 1.0
        f1 = 0.0
    else:
        tp = len(pred_set & gt_set)
        precision = tp / len(pred_set)
        recall = tp / len(gt_set)
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    # Position accuracy
    pred_positions = extract_tag_positions(predicted)
    gt_positions = extract_tag_positions(ground_truth)

    if len(gt_positions) == 0:
        position_accuracy = 1.0 if len(pred_positions) == 0 else 0.0
    else:
        correct_positions = 0
        gt_used = [False] * len(gt_positions)
        for p_tag, p_pos in pred_positions:
            p_norm = normalize_tag(p_tag)
            for j, (g_tag, g_pos) in enumerate(gt_positions):
                if not gt_used[j] and normalize_tag(g_tag) == p_norm and abs(p_pos - g_pos) <= position_tolerance:
                    correct_positions += 1
                    gt_used[j] = True
                    break
        position_accuracy = correct_positions / len(gt_positions)

    return TagMetrics(
        precision=precision,
        recall=recall,
        f1=f1,
        position_accuracy=position_accuracy,
        total_predicted=len(pred_tags),
        total_ground_truth=len(gt_tags),
    )

Step 2: Write main bench runner

src/eval/bench.py:

"""Evoxtral-Bench: 3-layer evaluation for tagged transcription."""

import json
from dataclasses import dataclass, asdict
from jiwer import wer, cer
from src.data.formatter import strip_tags
from .tag_metrics import compute_tag_metrics, TagMetrics


@dataclass
class BenchResult:
    # Layer 1: Text accuracy
    wer: float
    cer: float
    # Layer 2: Tag accuracy
    tag_precision: float
    tag_recall: float
    tag_f1: float
    tag_position_accuracy: float
    # Layer 3: Round-trip (optional)
    roundtrip_score: float | None = None
    # Meta
    num_samples: int = 0


def evaluate(
    predictions: list[str],
    ground_truths: list[str],
    roundtrip_scores: list[float] | None = None,
) -> BenchResult:
    """Run full Evoxtral-Bench evaluation.

    Args:
        predictions: list of predicted tagged transcriptions
        ground_truths: list of ground-truth tagged transcriptions
        roundtrip_scores: optional list of round-trip audio similarity scores
    """
    assert len(predictions) == len(ground_truths)

    # Layer 1: WER/CER on plain text (tags stripped)
    pred_plain = [strip_tags(p) for p in predictions]
    gt_plain = [strip_tags(g) for g in ground_truths]

    # Filter empty pairs
    valid = [(p, g) for p, g in zip(pred_plain, gt_plain) if g.strip()]
    if valid:
        vp, vg = zip(*valid)
        text_wer = wer(list(vg), list(vp))
        text_cer = cer(list(vg), list(vp))
    else:
        text_wer = text_cer = 0.0

    # Layer 2: Tag metrics
    all_tag_metrics = [
        compute_tag_metrics(p, g)
        for p, g in zip(predictions, ground_truths)
    ]

    avg_precision = sum(m.precision for m in all_tag_metrics) / len(all_tag_metrics)
    avg_recall = sum(m.recall for m in all_tag_metrics) / len(all_tag_metrics)
    avg_f1 = sum(m.f1 for m in all_tag_metrics) / len(all_tag_metrics)
    avg_pos = sum(m.position_accuracy for m in all_tag_metrics) / len(all_tag_metrics)

    # Layer 3: Round-trip
    rt_score = None
    if roundtrip_scores:
        rt_score = sum(roundtrip_scores) / len(roundtrip_scores)

    return BenchResult(
        wer=text_wer,
        cer=text_cer,
        tag_precision=avg_precision,
        tag_recall=avg_recall,
        tag_f1=avg_f1,
        tag_position_accuracy=avg_pos,
        roundtrip_score=rt_score,
        num_samples=len(predictions),
    )


def print_results(result: BenchResult):
    """Pretty-print Evoxtral-Bench results."""
    print("\n" + "=" * 50)
    print("EVOXTRAL-BENCH RESULTS")
    print("=" * 50)
    print(f"\nLayer 1 - Text Accuracy:")
    print(f"  WER:  {result.wer:.2%}")
    print(f"  CER:  {result.cer:.2%}")
    print(f"\nLayer 2 - Tag Accuracy:")
    print(f"  Precision: {result.tag_precision:.2%}")
    print(f"  Recall:    {result.tag_recall:.2%}")
    print(f"  F1:        {result.tag_f1:.2%}")
    print(f"  Position:  {result.tag_position_accuracy:.2%}")
    if result.roundtrip_score is not None:
        print(f"\nLayer 3 - Round-Trip:")
        print(f"  Score: {result.roundtrip_score:.2%}")
    print(f"\nSamples: {result.num_samples}")
    print("=" * 50)


def save_results(result: BenchResult, path: str = "eval_results.json"):
    """Save results to JSON."""
    with open(path, "w") as f:
        json.dump(asdict(result), f, indent=2)

Step 3: Write round-trip evaluator stub

src/eval/roundtrip.py:

"""Round-trip evaluation: tagged text -> ElevenLabs TTS -> compare to original audio."""

import os
from elevenlabs import ElevenLabs
from dotenv import load_dotenv

load_dotenv()


def roundtrip_evaluate(
    tagged_text: str,
    original_audio_path: str,
    voice_id: str,
    client: ElevenLabs | None = None,
) -> float:
    """Generate audio from tagged text and compare to original.

    Returns a similarity score between 0 and 1.
    For hackathon MVP, we use a simple duration-ratio heuristic.
    Full implementation would use audio embeddings (e.g., wav2vec2 cosine similarity).
    """
    if client is None:
        client = ElevenLabs(api_key=os.getenv("ELEVENLABS_API_KEY_1"))

    # Generate audio from tagged prediction
    try:
        audio_gen = client.text_to_speech.convert(
            text=tagged_text,
            voice_id=voice_id,
            model_id="eleven_v3",
        )
        # For MVP: return 1.0 if synthesis succeeds (validates tag format)
        # TODO: implement audio embedding comparison
        return 1.0
    except Exception:
        return 0.0

Step 4: Write tests

tests/test_eval.py:

"""Tests for Evoxtral-Bench evaluation metrics."""

from src.eval.tag_metrics import (
    extract_tag_list,
    extract_tag_positions,
    normalize_tag,
    compute_tag_metrics,
)
from src.eval.bench import evaluate


def test_extract_tag_list():
    text = "[excited] Hello! [laughs] That was fun."
    tags = extract_tag_list(text)
    assert tags == ["excited", "laughs"]


def test_extract_tag_list_empty():
    assert extract_tag_list("No tags here.") == []


def test_normalize_tag_semantic_group():
    assert normalize_tag("giggles") == "laughter"
    assert normalize_tag("laughs") == "laughter"
    assert normalize_tag("excited") == "excited"  # no group, return as-is


def test_perfect_match():
    pred = "[excited] Hello WORLD! [laughs]"
    gt = "[excited] Hello WORLD! [laughs]"
    m = compute_tag_metrics(pred, gt)
    assert m.f1 == 1.0
    assert m.position_accuracy == 1.0


def test_missing_tag():
    pred = "[excited] Hello!"
    gt = "[excited] Hello! [laughs]"
    m = compute_tag_metrics(pred, gt)
    assert m.recall < 1.0
    assert m.precision == 1.0


def test_extra_tag():
    pred = "[excited] Hello! [laughs] [sighs]"
    gt = "[excited] Hello! [laughs]"
    m = compute_tag_metrics(pred, gt)
    assert m.precision < 1.0
    assert m.recall == 1.0


def test_no_tags_both():
    m = compute_tag_metrics("Hello world", "Hello world")
    assert m.f1 == 1.0


def test_full_bench():
    preds = ["[excited] Hello!", "Goodbye [sighs]"]
    gts = ["[excited] Hello!", "Goodbye [sighs]"]
    result = evaluate(preds, gts)
    assert result.wer < 0.01
    assert result.tag_f1 == 1.0

Step 5: Run tests

Run: python -m pytest tests/test_eval.py -v Expected: PASS

Step 6: Commit

git add src/eval/ tests/test_eval.py
git commit -m "feat: add Evoxtral-Bench evaluation suite (WER + Tag F1 + round-trip)"

Task 7: Backend API (FastAPI)

Files:

  • Create: src/api/main.py
  • Create: src/api/model_service.py
  • Create: src/api/schemas.py

Step 1: Write schemas

src/api/schemas.py:

"""API request/response schemas."""

from pydantic import BaseModel


class TranscribeResponse(BaseModel):
    tagged_text: str
    plain_text: str
    tags: list[dict]
    processing_time_ms: float


class HealthResponse(BaseModel):
    status: str
    model_loaded: bool

Step 2: Write model service

src/api/model_service.py:

"""Model loading and inference service."""

import torch
import torchaudio
import time
from pathlib import Path
from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel
from src.data.formatter import extract_tags, strip_tags


class ModelService:
    def __init__(
        self,
        base_model: str = "mistralai/Voxtral-Mini-3B-2507",
        adapter_path: str | None = None,
    ):
        self.base_model = base_model
        self.adapter_path = adapter_path
        self.model = None
        self.processor = None

    def load(self):
        """Load model and processor."""
        self.processor = AutoProcessor.from_pretrained(self.base_model)
        model = AutoModelForVision2Seq.from_pretrained(
            self.base_model,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            trust_remote_code=True,
        )
        if self.adapter_path:
            model = PeftModel.from_pretrained(model, self.adapter_path)
        self.model = model
        self.model.eval()

    @property
    def is_loaded(self) -> bool:
        return self.model is not None

    def transcribe(self, audio_path: str) -> dict:
        """Transcribe audio file to tagged text."""
        if not self.is_loaded:
            raise RuntimeError("Model not loaded")

        start = time.time()

        # Load and resample audio
        waveform, sr = torchaudio.load(audio_path)
        if sr != 16000:
            resampler = torchaudio.transforms.Resample(sr, 16000)
            waveform = resampler(waveform)
        if waveform.shape[0] > 1:
            waveform = waveform.mean(dim=0, keepdim=True)

        # Process
        inputs = self.processor(
            audios=waveform.squeeze().numpy(),
            text="<transcribe>",
            sampling_rate=16000,
            return_tensors="pt",
        ).to(self.model.device)

        with torch.no_grad():
            output_ids = self.model.generate(
                **inputs,
                max_new_tokens=512,
                do_sample=False,
            )

        tagged_text = self.processor.decode(output_ids[0], skip_special_tokens=True)
        elapsed_ms = (time.time() - start) * 1000

        return {
            "tagged_text": tagged_text,
            "plain_text": strip_tags(tagged_text),
            "tags": extract_tags(tagged_text),
            "processing_time_ms": elapsed_ms,
        }

Step 3: Write FastAPI app

src/api/main.py:

"""FastAPI application for Evoxtral."""

import os
import tempfile
from contextlib import asynccontextmanager
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from dotenv import load_dotenv

from .model_service import ModelService
from .schemas import TranscribeResponse, HealthResponse

load_dotenv()

model_service = ModelService(
    adapter_path=os.getenv("EVOXTRAL_ADAPTER_PATH", None),
)


@asynccontextmanager
async def lifespan(app: FastAPI):
    model_service.load()
    yield


app = FastAPI(title="Evoxtral API", lifespan=lifespan)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)


@app.get("/health", response_model=HealthResponse)
async def health():
    return HealthResponse(status="ok", model_loaded=model_service.is_loaded)


@app.post("/transcribe", response_model=TranscribeResponse)
async def transcribe(file: UploadFile = File(...)):
    if not model_service.is_loaded:
        raise HTTPException(503, "Model not loaded")

    suffix = os.path.splitext(file.filename or ".wav")[1]
    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name

    try:
        result = model_service.transcribe(tmp_path)
        return TranscribeResponse(**result)
    finally:
        os.unlink(tmp_path)

Step 4: Commit

git add src/api/
git commit -m "feat: add FastAPI backend with model serving"

Task 8: Frontend (Next.js + shadcn)

Files:

  • Modify: pages/index.tsx (or create if not exists)
  • Create: components/AudioRecorder.tsx
  • Create: components/TaggedTranscript.tsx
  • Create: components/AudioPlayer.tsx

This task covers the Next.js frontend with:

  • Audio upload/record functionality
  • Display of tagged transcription with color-coded tags
  • Playback of re-synthesized audio
  • API integration with the FastAPI backend

Step 1: Install frontend dependencies

Run:

npx shadcn@latest init
npm install @phosphor-icons/react framer-motion

Step 2: Create AudioRecorder component

components/AudioRecorder.tsx:

"use client";

import { useState, useRef } from "react";
import { Microphone, Stop, Upload } from "@phosphor-icons/react";

interface AudioRecorderProps {
  onAudioReady: (file: File) => void;
  isProcessing: boolean;
}

export function AudioRecorder({ onAudioReady, isProcessing }: AudioRecorderProps) {
  const [isRecording, setIsRecording] = useState(false);
  const mediaRecorderRef = useRef<MediaRecorder | null>(null);
  const chunksRef = useRef<Blob[]>([]);
  const fileInputRef = useRef<HTMLInputElement>(null);

  const startRecording = async () => {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    const mediaRecorder = new MediaRecorder(stream);
    mediaRecorderRef.current = mediaRecorder;
    chunksRef.current = [];

    mediaRecorder.ondataavailable = (e) => {
      if (e.data.size > 0) chunksRef.current.push(e.data);
    };

    mediaRecorder.onstop = () => {
      const blob = new Blob(chunksRef.current, { type: "audio/webm" });
      const file = new File([blob], "recording.webm", { type: "audio/webm" });
      onAudioReady(file);
      stream.getTracks().forEach((t) => t.stop());
    };

    mediaRecorder.start();
    setIsRecording(true);
  };

  const stopRecording = () => {
    mediaRecorderRef.current?.stop();
    setIsRecording(false);
  };

  const handleFileUpload = (e: React.ChangeEvent<HTMLInputElement>) => {
    const file = e.target.files?.[0];
    if (file) onAudioReady(file);
  };

  return (
    <div className="flex items-center gap-4">
      <button
        onClick={isRecording ? stopRecording : startRecording}
        disabled={isProcessing}
        className={`flex items-center gap-2 px-6 py-3 rounded-full font-medium transition-all ${
          isRecording
            ? "bg-red-500 text-white animate-pulse"
            : "bg-zinc-900 text-white hover:bg-zinc-700"
        } disabled:opacity-50`}
      >
        {isRecording ? <Stop size={20} /> : <Microphone size={20} />}
        {isRecording ? "Stop" : "Record"}
      </button>

      <span className="text-zinc-500">or</span>

      <button
        onClick={() => fileInputRef.current?.click()}
        disabled={isProcessing}
        className="flex items-center gap-2 px-6 py-3 rounded-full border border-zinc-300 hover:bg-zinc-50 transition-all disabled:opacity-50"
      >
        <Upload size={20} />
        Upload
      </button>
      <input
        ref={fileInputRef}
        type="file"
        accept="audio/*"
        onChange={handleFileUpload}
        className="hidden"
      />
    </div>
  );
}

Step 3: Create TaggedTranscript component

components/TaggedTranscript.tsx:

"use client";

import { motion } from "framer-motion";

interface TaggedTranscriptProps {
  text: string;
}

const TAG_COLORS: Record<string, string> = {
  excited: "bg-yellow-100 text-yellow-800",
  sad: "bg-blue-100 text-blue-800",
  angry: "bg-red-100 text-red-800",
  nervous: "bg-purple-100 text-purple-800",
  calm: "bg-green-100 text-green-800",
  frustrated: "bg-orange-100 text-orange-800",
  laughs: "bg-amber-100 text-amber-800",
  sighs: "bg-slate-100 text-slate-800",
  gasps: "bg-pink-100 text-pink-800",
  whispers: "bg-indigo-100 text-indigo-800",
  shouts: "bg-red-200 text-red-900",
  pause: "bg-gray-100 text-gray-600",
  crying: "bg-blue-200 text-blue-900",
  stammers: "bg-violet-100 text-violet-800",
  "clears throat": "bg-teal-100 text-teal-800",
};

function getTagColor(tag: string): string {
  return TAG_COLORS[tag.toLowerCase()] || "bg-zinc-100 text-zinc-700";
}

export function TaggedTranscript({ text }: TaggedTranscriptProps) {
  if (!text) return null;

  // Parse text into segments: tags and plain text
  const parts = text.split(/(\[[^\]]+\])/g).filter(Boolean);

  return (
    <div className="font-mono text-lg leading-relaxed space-x-1">
      {parts.map((part, i) => {
        const tagMatch = part.match(/^\[([^\]]+)\]$/);
        if (tagMatch) {
          const tag = tagMatch[1];
          return (
            <motion.span
              key={i}
              initial={{ opacity: 0, scale: 0.8 }}
              animate={{ opacity: 1, scale: 1 }}
              className={`inline-block px-2 py-0.5 rounded-md text-sm font-semibold ${getTagColor(tag)}`}
            >
              {tag}
            </motion.span>
          );
        }
        return <span key={i}>{part}</span>;
      })}
    </div>
  );
}

Step 4: Create main page

pages/index.tsx:

import { useState } from "react";
import { AudioRecorder } from "../components/AudioRecorder";
import { TaggedTranscript } from "../components/TaggedTranscript";

const API_URL = process.env.NEXT_PUBLIC_API_URL || "http://localhost:8000";

export default function Home() {
  const [isProcessing, setIsProcessing] = useState(false);
  const [taggedText, setTaggedText] = useState("");
  const [plainText, setPlainText] = useState("");
  const [error, setError] = useState("");
  const [audioUrl, setAudioUrl] = useState<string | null>(null);

  const handleAudio = async (file: File) => {
    setIsProcessing(true);
    setError("");
    setAudioUrl(URL.createObjectURL(file));

    const formData = new FormData();
    formData.append("file", file);

    try {
      const res = await fetch(`${API_URL}/transcribe`, {
        method: "POST",
        body: formData,
      });
      if (!res.ok) throw new Error(`API error: ${res.status}`);
      const data = await res.json();
      setTaggedText(data.tagged_text);
      setPlainText(data.plain_text);
    } catch (e: any) {
      setError(e.message);
    } finally {
      setIsProcessing(false);
    }
  };

  return (
    <main className="min-h-screen bg-white">
      <div className="max-w-4xl mx-auto px-6 py-16">
        <h1 className="text-4xl font-bold tracking-tight mb-2">Evoxtral</h1>
        <p className="text-zinc-500 mb-12">
          Emotion-aware transcription with inline audio tags
        </p>

        <div className="mb-12">
          <AudioRecorder onAudioReady={handleAudio} isProcessing={isProcessing} />
        </div>

        {isProcessing && (
          <div className="text-zinc-500 animate-pulse">Transcribing...</div>
        )}

        {error && (
          <div className="text-red-500 mb-4">Error: {error}</div>
        )}

        {taggedText && (
          <div className="space-y-8">
            <div>
              <h2 className="text-sm font-medium text-zinc-400 uppercase tracking-wide mb-3">
                Tagged Transcription
              </h2>
              <div className="p-6 rounded-xl border border-zinc-200 bg-zinc-50">
                <TaggedTranscript text={taggedText} />
              </div>
            </div>

            <div>
              <h2 className="text-sm font-medium text-zinc-400 uppercase tracking-wide mb-3">
                Plain Text
              </h2>
              <div className="p-6 rounded-xl border border-zinc-200">
                <p className="font-mono text-lg">{plainText}</p>
              </div>
            </div>

            {audioUrl && (
              <div>
                <h2 className="text-sm font-medium text-zinc-400 uppercase tracking-wide mb-3">
                  Original Audio
                </h2>
                <audio controls src={audioUrl} className="w-full" />
              </div>
            )}
          </div>
        )}
      </div>
    </main>
  );
}

Step 5: Commit

git add components/ pages/
git commit -m "feat: add Next.js frontend with audio recorder and tag display"

Task 9: Data Pipeline Runner (End-to-End)

Files:

  • Create: scripts/generate_data.py
  • Create: scripts/run_training.py
  • Create: scripts/run_eval.py

Step 1: Create data generation runner

scripts/generate_data.py:

"""End-to-end data generation: scripts -> synthesis -> dataset."""

import sys
sys.path.insert(0, ".")

from dotenv import load_dotenv
load_dotenv()

from src.data.script_generator import generate_full_dataset
from src.data.synthesizer import synthesize_dataset
from src.data.formatter import format_dataset


def main():
    print("=== Step 1: Generate tagged scripts ===")
    generate_full_dataset(total=1000, output_path="data/scripts/scripts.json")

    print("\n=== Step 2: Synthesize audio via ElevenLabs ===")
    synthesize_dataset(
        scripts_path="data/scripts/scripts.json",
        output_dir="data/audio",
    )

    print("\n=== Step 3: Format HuggingFace dataset ===")
    format_dataset(
        manifest_path="data/audio/manifest.json",
        output_dir="data/processed",
    )

    print("\nData pipeline complete!")


if __name__ == "__main__":
    main()

Step 2: Create training runner

scripts/run_training.py:

"""Run LoRA finetuning with W&B tracking."""

import sys
sys.path.insert(0, ".")

from dotenv import load_dotenv
load_dotenv()

from src.training.finetune import train
from src.training.config import EvoxtralTrainingConfig


def main():
    config = EvoxtralTrainingConfig()
    train(config)


if __name__ == "__main__":
    main()

Step 3: Create eval runner

scripts/run_eval.py:

"""Run Evoxtral-Bench evaluation on test set."""

import sys
sys.path.insert(0, ".")

from dotenv import load_dotenv
load_dotenv()

import torch
from datasets import load_from_disk
from src.api.model_service import ModelService
from src.eval.bench import evaluate, print_results, save_results


def main():
    print("Loading test dataset...")
    ds = load_from_disk("data/processed")
    test_ds = ds["test"]

    print("Loading model...")
    service = ModelService(
        adapter_path="model/evoxtral-lora",
    )
    service.load()

    print(f"Running inference on {len(test_ds)} test samples...")
    predictions = []
    ground_truths = []

    for i, example in enumerate(test_ds):
        # Save audio to temp file for inference
        import tempfile, soundfile as sf
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
            sf.write(tmp.name, example["audio"]["array"], example["audio"]["sampling_rate"])
            result = service.transcribe(tmp.name)
            predictions.append(result["tagged_text"])
            ground_truths.append(example["tagged_text"])

        if (i + 1) % 10 == 0:
            print(f"  {i + 1}/{len(test_ds)}")

    print("\nRunning Evoxtral-Bench...")
    result = evaluate(predictions, ground_truths)
    print_results(result)
    save_results(result, "eval_results.json")


if __name__ == "__main__":
    main()

Step 4: Commit

git add scripts/
git commit -m "feat: add end-to-end pipeline runners (data, train, eval)"

Task 10: Integration, Model Card & Submission

Files:

  • Create: model/README.md (HuggingFace model card)
  • Modify: README.md (add setup instructions)

Step 1: Write HuggingFace model card

model/README.md: ```markdown

license: apache-2.0 base_model: mistralai/Voxtral-Mini-3B-2507 tags: - speech - transcription - emotion - elevenlabs - audio-tags - lora datasets: - evoxtral-synthetic-1k

Evoxtral

LoRA adapter for Voxtral-Mini-3B that produces transcriptions with inline ElevenLabs v3 audio tags.

Usage

from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel

processor = AutoProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
model = AutoModelForVision2Seq.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
model = PeftModel.from_pretrained(model, "mistral-hackaton-2026/evoxtral-lora")

Evoxtral-Bench Results

Metric Score
WER TBD
Tag F1 TBD
Tag Position Accuracy TBD

Built For

Mistral AI Online Hackathon 2026 - Fine-tuning track


**Step 2: Run the full pipeline**

```bash
# 1. Generate data
python scripts/generate_data.py

# 2. Train
python scripts/run_training.py

# 3. Evaluate
python scripts/run_eval.py

# 4. Start backend
uvicorn src.api.main:app --host 0.0.0.0 --port 8000

# 5. Start frontend (separate terminal)
npm run dev

Step 3: Final commit

git add -A
git commit -m "feat: complete Evoxtral pipeline - data, training, eval, API, frontend"