POntAvignon-4b

A 4B reasoning model that annotates French theater programmes against the Linked Art performing-arts ontology extension. Given raw programme markdown from the Festival d'Avignon (1947–present), it extracts structured JSON-LD entities with chain-of-thought reasoning.

POntAvignon-4b is based on Qwen but relies on the SYNTH-syntax from Pleias' Baguettotron.

Model Details

Base model Qwen/Qwen3-4B
Parameters 4B
Training Full SFT, 3 epochs, bf16
Context 16k tokens
Output <think> reasoning + Linked Art JSON-LD
Train loss 0.360
Token accuracy 96.6%
Valid JSON rate 97% (on held-out test set)

Entity Types

The model extracts 7 entity types from a single programme, each as a separate query:

Entity Linked Art type Description
A PropositionalObject The abstract creative Work — title, creator, BnF role, adaptation links
B_meta Activity Production metadata — title, venue, dates, genre, festival link
B_cast Activity.produced_by All performers — actors, dancers, musicians with BnF roles and characters
B_crew Activity.produced_by Creative/technical staff — director, lighting, costumes, set design with BnF roles
C Activity Single Performance — one specific date/time, venue, parent production link
Festival Activity Overall Event — festival edition (e.g. Festival d'Avignon 1996)
Text LinguisticObject Source literary work — the play or text being adapted/performed

Reasoning

The model uses <think> tags to produce SYNTH-style dense reasoning traces before outputting JSON-LD. The reasoning:

  • Names the task explicitly and stays focused on it
  • Engages with document structure, era, and typographic conventions
  • Works through French theatrical vocabulary and BnF role mapping
  • Resolves ontological boundaries (Work vs Production vs Performance)
  • Uses confidence markers: ● sure, ◐ probable, ○ guess, ⚠ risk

Usage

With vLLM (recommended)

from vllm import LLM, SamplingParams

llm = LLM(model="PleIAs/linked-art-qwen3-4b", dtype="bfloat16", max_model_len=4096)
tokenizer = llm.get_tokenizer()

programme = """____ PAGE 1 ____

FESTIVAL D'AVIGNON
COUR D'HONNEUR DU PALAIS DES PAPES
7, 8, 12, 15, 18, 19 juillet à 22 h

# Médée

de Sénèque
Mise en scène Jacques Lassalle
"""

messages = [
    {"role": "system", "content": "You annotate French theater programmes from the Festival d'Avignon against the Linked Art performing-arts ontology. Extract structured JSON-LD entities from programme markdown."},
    {"role": "user", "content": f"Extract the Work entity (A) from this theater programme. Output valid Linked Art JSON-LD for the Work as a PropositionalObject, including title, creator with BnF role, source attribution, and any adaptation/influence links.\n\nSource: Medee_FDA1996.md\n\n---\n\n{programme}"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt += "<think>\n"

output = llm.generate([prompt], SamplingParams(max_tokens=2048, temperature=0.7, top_p=0.9))
print(output[0].outputs[0].text)

Demo notebook

A Colab notebook is available that runs all 7 entity types on a demo programme with a tabbed display. Runs on a free T4 GPU.

Training Data

The model was trained on 12,507 samples derived from ~1,400 Festival d'Avignon programmes (1971–2022), spanning three source collections:

  • BnF fonds — digitized paper programmes (1971–2002)
  • CommAvignon — born-digital programmes from festival communications (2007–2022)
  • SiteAvignon — programmes scraped from the festival website (2018)

Each sample pairs a raw programme markdown with a Linked Art JSON-LD entity and a SYNTH-style reasoning trace. Reasoning traces were generated by Claude Sonnet (3,110 samples) and then scaled to the full dataset using a Gemma 12B backreasoning model trained on the Sonnet traces.

Ontology

The model targets the Linked Art Performing Arts extension (v0.9), developed in collaboration with the ERC "From Stage to Data" project (Clarisse Bardiot). Key features:

  • BnF role vocabulary — French-language controlled terms from the Bibliothèque nationale de France, mapped to person attributions
  • Deterministic IDs — content-derived identifiers (e.g. W-MEDE-JACQ-1996-a3f1) for deduplication
  • Source attribution — every extracted fact links back to the programme document as primary source

Limitations

  • Specialized corpus: trained exclusively on Festival d'Avignon programmes. Other festivals or theatrical traditions may need additional training data.
  • French-centric: programme text, role vocabulary, and typographic conventions are French. Non-French programmes are out of scope.
  • Large cast/crew: productions with 30+ team members may produce truncated B_cast/B_crew outputs near the context limit.
  • Date inference: when the programme doesn't state the year explicitly, the model infers it from the filename (e.g. FDA1996). Without a filename, year extraction may fail.
  • Reasoning traces improve accuracy and provide explainability but may contain intermediate errors that are corrected in the final JSON-LD output.
Downloads last month
268
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Pclanglais/POntAvignon-4b

Finetuned
Qwen/Qwen3-4B
Finetuned
(545)
this model