POntAvignon-4b

A 4B reasoning model that annotates French theater programmes against the Linked Art performing-arts ontology extension. Given raw programme markdown from the Festival d'Avignon (1947–present), it extracts structured JSON-LD entities with chain-of-thought reasoning.

POntAvignon-4b is based on Qwen but relies on the SYNTH-syntax from Pleias' Baguettotron.

Model Details


Base model	Qwen/Qwen3-4B
Parameters	4B
Training	Full SFT, 3 epochs, bf16
Context	16k tokens
Output	`<think>` reasoning + Linked Art JSON-LD
Train loss	0.360
Token accuracy	96.6%
Valid JSON rate	97% (on held-out test set)

Entity Types

The model extracts 7 entity types from a single programme, each as a separate query:

Entity	Linked Art type	Description
A	`PropositionalObject`	The abstract creative Work — title, creator, BnF role, adaptation links
B_meta	`Activity`	Production metadata — title, venue, dates, genre, festival link
B_cast	`Activity.produced_by`	All performers — actors, dancers, musicians with BnF roles and characters
B_crew	`Activity.produced_by`	Creative/technical staff — director, lighting, costumes, set design with BnF roles
C	`Activity`	Single Performance — one specific date/time, venue, parent production link
Festival	`Activity`	Overall Event — festival edition (e.g. Festival d'Avignon 1996)
Text	`LinguisticObject`	Source literary work — the play or text being adapted/performed

Reasoning

The model uses <think> tags to produce SYNTH-style dense reasoning traces before outputting JSON-LD. The reasoning:

Names the task explicitly and stays focused on it
Engages with document structure, era, and typographic conventions
Works through French theatrical vocabulary and BnF role mapping
Resolves ontological boundaries (Work vs Production vs Performance)
Uses confidence markers: ● sure, ◐ probable, ○ guess, ⚠ risk

Usage

With vLLM (recommended)

from vllm import LLM, SamplingParams

llm = LLM(model="PleIAs/linked-art-qwen3-4b", dtype="bfloat16", max_model_len=4096)
tokenizer = llm.get_tokenizer()

programme = """____ PAGE 1 ____

FESTIVAL D'AVIGNON
COUR D'HONNEUR DU PALAIS DES PAPES
7, 8, 12, 15, 18, 19 juillet à 22 h

# Médée

de Sénèque
Mise en scène Jacques Lassalle
"""

messages = [
    {"role": "system", "content": "You annotate French theater programmes from the Festival d'Avignon against the Linked Art performing-arts ontology. Extract structured JSON-LD entities from programme markdown."},
    {"role": "user", "content": f"Extract the Work entity (A) from this theater programme. Output valid Linked Art JSON-LD for the Work as a PropositionalObject, including title, creator with BnF role, source attribution, and any adaptation/influence links.\n\nSource: Medee_FDA1996.md\n\n---\n\n{programme}"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt += "<think>\n"

output = llm.generate([prompt], SamplingParams(max_tokens=2048, temperature=0.7, top_p=0.9))
print(output[0].outputs[0].text)

Demo notebook

A Colab notebook is available that runs all 7 entity types on a demo programme with a tabbed display. Runs on a free T4 GPU.

Training Data

The model was trained on 12,507 samples derived from ~1,400 Festival d'Avignon programmes (1971–2022), spanning three source collections:

BnF fonds — digitized paper programmes (1971–2002)
CommAvignon — born-digital programmes from festival communications (2007–2022)
SiteAvignon — programmes scraped from the festival website (2018)

Each sample pairs a raw programme markdown with a Linked Art JSON-LD entity and a SYNTH-style reasoning trace. Reasoning traces were generated by Claude Sonnet (3,110 samples) and then scaled to the full dataset using a Gemma 12B backreasoning model trained on the Sonnet traces.

Ontology

The model targets the Linked Art Performing Arts extension (v0.9), developed in collaboration with the ERC "From Stage to Data" project (Clarisse Bardiot). Key features:

BnF role vocabulary — French-language controlled terms from the Bibliothèque nationale de France, mapped to person attributions
Deterministic IDs — content-derived identifiers (e.g. W-MEDE-JACQ-1996-a3f1) for deduplication
Source attribution — every extracted fact links back to the programme document as primary source

Limitations

Specialized corpus: trained exclusively on Festival d'Avignon programmes. Other festivals or theatrical traditions may need additional training data.
French-centric: programme text, role vocabulary, and typographic conventions are French. Non-French programmes are out of scope.
Large cast/crew: productions with 30+ team members may produce truncated B_cast/B_crew outputs near the context limit.
Date inference: when the programme doesn't state the year explicitly, the model infers it from the filename (e.g. FDA1996). Without a filename, year extraction may fail.
Reasoning traces improve accuracy and provide explainability but may contain intermediate errors that are corrected in the final JSON-LD output.

Downloads last month: 103

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Pclanglais/POntAvignon-4b

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(683)

this model

Quantizations

2 models