POntAvignon-4b
A 4B reasoning model that annotates French theater programmes against the Linked Art performing-arts ontology extension. Given raw programme markdown from the Festival d'Avignon (1947–present), it extracts structured JSON-LD entities with chain-of-thought reasoning.
POntAvignon-4b is based on Qwen but relies on the SYNTH-syntax from Pleias' Baguettotron.
Model Details
| Base model | Qwen/Qwen3-4B |
| Parameters | 4B |
| Training | Full SFT, 3 epochs, bf16 |
| Context | 16k tokens |
| Output | <think> reasoning + Linked Art JSON-LD |
| Train loss | 0.360 |
| Token accuracy | 96.6% |
| Valid JSON rate | 97% (on held-out test set) |
Entity Types
The model extracts 7 entity types from a single programme, each as a separate query:
| Entity | Linked Art type | Description |
|---|---|---|
| A | PropositionalObject |
The abstract creative Work — title, creator, BnF role, adaptation links |
| B_meta | Activity |
Production metadata — title, venue, dates, genre, festival link |
| B_cast | Activity.produced_by |
All performers — actors, dancers, musicians with BnF roles and characters |
| B_crew | Activity.produced_by |
Creative/technical staff — director, lighting, costumes, set design with BnF roles |
| C | Activity |
Single Performance — one specific date/time, venue, parent production link |
| Festival | Activity |
Overall Event — festival edition (e.g. Festival d'Avignon 1996) |
| Text | LinguisticObject |
Source literary work — the play or text being adapted/performed |
Reasoning
The model uses <think> tags to produce SYNTH-style dense reasoning traces before outputting JSON-LD. The reasoning:
- Names the task explicitly and stays focused on it
- Engages with document structure, era, and typographic conventions
- Works through French theatrical vocabulary and BnF role mapping
- Resolves ontological boundaries (Work vs Production vs Performance)
- Uses confidence markers: ● sure, ◐ probable, ○ guess, ⚠ risk
Usage
With vLLM (recommended)
from vllm import LLM, SamplingParams
llm = LLM(model="PleIAs/linked-art-qwen3-4b", dtype="bfloat16", max_model_len=4096)
tokenizer = llm.get_tokenizer()
programme = """____ PAGE 1 ____
FESTIVAL D'AVIGNON
COUR D'HONNEUR DU PALAIS DES PAPES
7, 8, 12, 15, 18, 19 juillet à 22 h
# Médée
de Sénèque
Mise en scène Jacques Lassalle
"""
messages = [
{"role": "system", "content": "You annotate French theater programmes from the Festival d'Avignon against the Linked Art performing-arts ontology. Extract structured JSON-LD entities from programme markdown."},
{"role": "user", "content": f"Extract the Work entity (A) from this theater programme. Output valid Linked Art JSON-LD for the Work as a PropositionalObject, including title, creator with BnF role, source attribution, and any adaptation/influence links.\n\nSource: Medee_FDA1996.md\n\n---\n\n{programme}"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt += "<think>\n"
output = llm.generate([prompt], SamplingParams(max_tokens=2048, temperature=0.7, top_p=0.9))
print(output[0].outputs[0].text)
Demo notebook
A Colab notebook is available that runs all 7 entity types on a demo programme with a tabbed display. Runs on a free T4 GPU.
Training Data
The model was trained on 12,507 samples derived from ~1,400 Festival d'Avignon programmes (1971–2022), spanning three source collections:
- BnF fonds — digitized paper programmes (1971–2002)
- CommAvignon — born-digital programmes from festival communications (2007–2022)
- SiteAvignon — programmes scraped from the festival website (2018)
Each sample pairs a raw programme markdown with a Linked Art JSON-LD entity and a SYNTH-style reasoning trace. Reasoning traces were generated by Claude Sonnet (3,110 samples) and then scaled to the full dataset using a Gemma 12B backreasoning model trained on the Sonnet traces.
Ontology
The model targets the Linked Art Performing Arts extension (v0.9), developed in collaboration with the ERC "From Stage to Data" project (Clarisse Bardiot). Key features:
- BnF role vocabulary — French-language controlled terms from the Bibliothèque nationale de France, mapped to person attributions
- Deterministic IDs — content-derived identifiers (e.g.
W-MEDE-JACQ-1996-a3f1) for deduplication - Source attribution — every extracted fact links back to the programme document as primary source
Limitations
- Specialized corpus: trained exclusively on Festival d'Avignon programmes. Other festivals or theatrical traditions may need additional training data.
- French-centric: programme text, role vocabulary, and typographic conventions are French. Non-French programmes are out of scope.
- Large cast/crew: productions with 30+ team members may produce truncated B_cast/B_crew outputs near the context limit.
- Date inference: when the programme doesn't state the year explicitly, the model infers it from the filename (e.g.
FDA1996). Without a filename, year extraction may fail. - Reasoning traces improve accuracy and provide explainability but may contain intermediate errors that are corrected in the final JSON-LD output.
- Downloads last month
- 268