cmboulanger commited on
Commit
1f74af3
·
1 Parent(s): 572de91

Add support for global rules

Browse files
.claude/skills/optimize-element-descriptions/SKILL.md CHANGED
@@ -2,8 +2,7 @@
2
  name: optimize-element-descriptions
3
  description: Iteratively improve TEIElement descriptions in _build_schema() to maximise F1 against the gold standard. Use when annotation quality is low or when evaluation shows missed or spurious spans.
4
  disable-model-invocation: true
5
- argument-hint: [--max-items N] [--provider gemini|kisski|all]
6
- allowed-tools: Read, Edit, Bash
7
  ---
8
 
9
  # optimize-element-descriptions
@@ -59,6 +58,7 @@ Key principles (summary):
59
  - Add negative constraints: "never tag X as Y"
60
  - Include textual triggers (keywords, position) and inline surface-form examples
61
  - Prefix critical constraints with `CRITICAL:`
 
62
 
63
  Only edit descriptions for elements where you identified a clear failure pattern.
64
 
 
2
  name: optimize-element-descriptions
3
  description: Iteratively improve TEIElement descriptions in _build_schema() to maximise F1 against the gold standard. Use when annotation quality is low or when evaluation shows missed or spurious spans.
4
  disable-model-invocation: true
5
+ argument-hint: "--max-items N --provider gemini|kisski|all"
 
6
  ---
7
 
8
  # optimize-element-descriptions
 
58
  - Add negative constraints: "never tag X as Y"
59
  - Include textual triggers (keywords, position) and inline surface-form examples
60
  - Prefix critical constraints with `CRITICAL:`
61
+ - If a failure pattern affects **multiple element types**, add the constraint to `TEISchema.rules` instead of duplicating it in each element description — the prompt renders `rules` as a numbered "General Rules" section before all element descriptions.
62
 
63
  Only edit descriptions for elements where you identified a clear failure pattern.
64
 
README.md CHANGED
@@ -100,7 +100,7 @@ API keys for real LLM endpoints go in `.env` (see `.env` for the expected variab
100
 
101
  ## Quick example
102
 
103
- Element descriptions are the primary signal the LLM uses to decide what to annotate and how. See [docs/tei-element-descriptions.md](docs/tei-element-descriptions.md) for guidelines on writing effective descriptions (span framing, multiplicity, parent–child span pairs, negative constraints, and more).
104
 
105
  ```python
106
  from tei_annotator import (
@@ -110,18 +110,24 @@ from tei_annotator import (
110
  )
111
 
112
  # 1. Describe the elements you want to annotate
113
- schema = TEISchema(elements=[
114
- TEIElement(
115
- tag="persName",
116
- description="a person's name",
117
- attributes=[TEIAttribute(name="ref", description="authority URI")],
118
- ),
119
- TEIElement(
120
- tag="placeName",
121
- description="a geographical place name",
122
- attributes=[],
123
- ),
124
- ])
 
 
 
 
 
 
125
 
126
  # 2. Wrap your inference endpoint
127
  def my_call_fn(prompt: str) -> str:
 
100
 
101
  ## Quick example
102
 
103
+ Element descriptions are the primary signal the LLM uses to decide what to annotate and how. Cross-element constraints that apply to multiple span types (e.g. "always emit a `surname` span inside an enclosing `author` span") can be placed in `TEISchema.rules` instead of duplicating them in every element description — the prompt builder renders them as a numbered "General Rules" section before the per-element descriptions. See [docs/tei-element-descriptions.md](docs/tei-element-descriptions.md) for full guidelines.
104
 
105
  ```python
106
  from tei_annotator import (
 
110
  )
111
 
112
  # 1. Describe the elements you want to annotate
113
+ schema = TEISchema(
114
+ rules=[
115
+ # Cross-element constraints stated once, rendered before element descriptions
116
+ "Emit a 'surname' span within every enclosing 'persName' span.",
117
+ ],
118
+ elements=[
119
+ TEIElement(
120
+ tag="persName",
121
+ description="a person's name",
122
+ attributes=[TEIAttribute(name="ref", description="authority URI")],
123
+ ),
124
+ TEIElement(
125
+ tag="placeName",
126
+ description="a geographical place name",
127
+ attributes=[],
128
+ ),
129
+ ],
130
+ )
131
 
132
  # 2. Wrap your inference endpoint
133
  def my_call_fn(prompt: str) -> str:
docs/tei-element-descriptions.md CHANGED
@@ -14,10 +14,10 @@ The LLM is asked to **emit spans** — tuples of *(element name, verbatim text,
14
  surrounding context)*. It never writes raw XML. Descriptions therefore should
15
  be phrased in terms of *emitting a span*, not *wrapping text in a tag*.
16
 
17
- | Avoid | Prefer |
18
- |-------|--------|
19
- | "Wrap the author name in `<author>`." | "Emit an `author` span covering the full name text." |
20
- | "Nest `<surname>` inside `<author>`." | "The `surname` span must fall within the enclosing `author` span's text." |
21
 
22
  ---
23
 
@@ -80,10 +80,10 @@ Examples of effective negative constraints:
80
 
81
  > "A person's name (or surname alone) that follows 'in' is an editor — emit an
82
  > `editor` span, **never** a `title` span."
83
-
84
  > "An institutional report name (e.g. 'Amok Internal Report') must be tagged as
85
  > `note` with type='report', **NOT** as `orgName` or `title`."
86
-
87
  > "A label is always a number or short code — **never** a word or name. An
88
  > ALL-CAPS word at the start of an entry is an author surname, not a label."
89
 
@@ -101,10 +101,10 @@ span represents semantically.
101
 
102
  > "An editor's name typically follows keywords such as 'in', 'ed.', 'éd.',
103
  > 'Hrsg.', 'dir.', '(ed.)', '(eds.)'."
104
-
105
  > "A label appears at the very start of a bibliographic entry, before any author
106
  > or title."
107
-
108
  > "The place of publication may appear in parentheses immediately after the
109
  > title, e.g. 'Title (City, Region)' — the parenthesised location is the
110
  > pubPlace."
@@ -119,7 +119,7 @@ text looks like:
119
  > "Typical label forms: a plain number ('17'), a number with a trailing period
120
  > ('17.'), a number in square brackets ('[77]', '[ACL30]'), or a compound number
121
  > ('5,6')."
122
-
123
  > "Institutional report designations — such as 'Amok Internal Report', 'USGS
124
  > Open-File Report 97-123', or 'Technical Report No. 5' — must be tagged as
125
  > `note`."
@@ -133,11 +133,54 @@ when surrounding punctuation could reasonably be included:
133
 
134
  > "The separator that follows the label (period, dash, or space) is NOT part of
135
  > the label."
136
-
137
  > "Do not include the surrounding parentheses in the pubPlace span."
138
 
139
  ---
140
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  ## Quick checklist
142
 
143
  Before finalising a description, ask:
@@ -149,3 +192,4 @@ Before finalising a description, ask:
149
  - [ ] Are there positional or keyword triggers that help the model find the span?
150
  - [ ] Are edge-case surface forms illustrated with a quoted example?
151
  - [ ] Are span boundaries (what's in / what's out) unambiguous?
 
 
14
  surrounding context)*. It never writes raw XML. Descriptions therefore should
15
  be phrased in terms of *emitting a span*, not *wrapping text in a tag*.
16
 
17
+ | Avoid | Prefer |
18
+ | --------------------------------------- | ------------------------------------------------------------------------- |
19
+ | "Wrap the author name in `<author>`." | "Emit an `author` span covering the full name text." |
20
+ | "Nest `<surname>` inside `<author>`." | "The `surname` span must fall within the enclosing `author` span's text." |
21
 
22
  ---
23
 
 
80
 
81
  > "A person's name (or surname alone) that follows 'in' is an editor — emit an
82
  > `editor` span, **never** a `title` span."
83
+ >
84
  > "An institutional report name (e.g. 'Amok Internal Report') must be tagged as
85
  > `note` with type='report', **NOT** as `orgName` or `title`."
86
+ >
87
  > "A label is always a number or short code — **never** a word or name. An
88
  > ALL-CAPS word at the start of an entry is an author surname, not a label."
89
 
 
101
 
102
  > "An editor's name typically follows keywords such as 'in', 'ed.', 'éd.',
103
  > 'Hrsg.', 'dir.', '(ed.)', '(eds.)'."
104
+ >
105
  > "A label appears at the very start of a bibliographic entry, before any author
106
  > or title."
107
+ >
108
  > "The place of publication may appear in parentheses immediately after the
109
  > title, e.g. 'Title (City, Region)' — the parenthesised location is the
110
  > pubPlace."
 
119
  > "Typical label forms: a plain number ('17'), a number with a trailing period
120
  > ('17.'), a number in square brackets ('[77]', '[ACL30]'), or a compound number
121
  > ('5,6')."
122
+ >
123
  > "Institutional report designations — such as 'Amok Internal Report', 'USGS
124
  > Open-File Report 97-123', or 'Technical Report No. 5' — must be tagged as
125
  > `note`."
 
133
 
134
  > "The separator that follows the label (period, dash, or space) is NOT part of
135
  > the label."
136
+ >
137
  > "Do not include the surrounding parentheses in the pubPlace span."
138
 
139
  ---
140
 
141
+ ### 8. Use `TEISchema.rules` for cross-element constraints
142
+
143
+ When the same constraint applies to **multiple element types**, put it in
144
+ `TEISchema.rules` rather than copying it into every element description.
145
+ The prompt builder renders `rules` as a numbered **"General Rules"** section
146
+ that appears before all per-element descriptions.
147
+
148
+ Good candidates for `rules`:
149
+
150
+ - Parent–child pairing constraints shared by several elements (e.g. "`surname`
151
+ and `forename` must always appear inside an enclosing `author` or `editor`
152
+ span")
153
+ - Constraints that span the same surface form from both sides (e.g. the rule
154
+ that `orgName` requires a sibling `author`/`editor` span, stated for both
155
+ `author` and `orgName`)
156
+ - Bibliographic conventions that apply across multiple roles (e.g. "a dash or
157
+ underscore may stand for a repeated author **or editor** name")
158
+
159
+ Keep the individual element `description` focused on element-specific cues
160
+ (triggers, surface forms, boundaries, negative constraints) and let `rules`
161
+ carry the shared structural invariants.
162
+
163
+ **Example** — in `_build_schema()`:
164
+
165
+ ```python
166
+ TEISchema(
167
+ rules=[
168
+ "For each person's name, emit an 'author' or 'editor' span covering "
169
+ "the full name AND separate 'surname', 'forename', or 'orgName' spans "
170
+ "for the individual name parts within that span.",
171
+ "Never emit 'surname', 'forename', or 'orgName' without a corresponding "
172
+ "enclosing 'author' or 'editor' span.",
173
+ ],
174
+ elements=[
175
+ TEIElement(tag="author", description="Names appearing at the start …"),
176
+ TEIElement(tag="surname", description="The inherited (family) name …"),
177
+ # 'surname' description no longer repeats the parent-span constraint
178
+ ],
179
+ )
180
+ ```
181
+
182
+ ---
183
+
184
  ## Quick checklist
185
 
186
  Before finalising a description, ask:
 
192
  - [ ] Are there positional or keyword triggers that help the model find the span?
193
  - [ ] Are edge-case surface forms illustrated with a quoted example?
194
  - [ ] Are span boundaries (what's in / what's out) unambiguous?
195
+ - [ ] Are cross-element constraints factored into `TEISchema.rules` rather than duplicated across descriptions?
scripts/evaluate_llm.py CHANGED
@@ -137,6 +137,19 @@ def _build_schema():
137
  return TEIAttribute(name=name, description=desc, allowed_values=allowed)
138
 
139
  return TEISchema(
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  elements=[
141
  TEIElement(
142
  tag="label",
@@ -156,16 +169,8 @@ def _build_schema():
156
  tag="author",
157
  description=(
158
  "Name(s) of the author(s) of the cited work. "
159
- "Emit a separate 'author' span for each distinct author — never merge multiple "
160
- "authors into a single span. "
161
- "Each 'author' span covers the full name text of one author. "
162
- "Also emit separate 'surname', 'forename', or 'orgName' spans for the "
163
- "individual name parts; those spans must fall within the 'author' span's text. "
164
- "When an organisation is the author, emit both an 'author' span and an "
165
- "'orgName' span covering the same text — never emit 'orgName' alone in that role. "
166
  "Names appearing at the start of a bibliographic entry before the title and "
167
- "date are authors. "
168
- "In a bibliography, a dash or underscore may stand for a repeated author name."
169
  ),
170
  allowed_children=['surname', 'forename', 'orgName'],
171
  attributes=[],
@@ -174,53 +179,32 @@ def _build_schema():
174
  tag="editor",
175
  description=(
176
  "Name of an editor of the cited work. "
177
- "Emit an 'editor' span covering the full name text; also emit separate "
178
- "'surname', 'forename', or 'orgName' spans for the individual name parts — "
179
- "those spans must fall within the 'editor' span's text. "
180
  "An editor's name typically follows keywords such as 'in', 'ed.', 'éd.', "
181
  "'Hrsg.', 'dir.', '(ed.)', '(eds.)'. "
182
  "CRITICAL: A person's name (or surname alone) that follows 'in' is an editor — "
183
- "emit an 'editor' span (plus name-part spans), never a 'title' span. "
184
- "In a bibliography, a dash or underscore may stand for a repeated editor name."
185
  ),
186
  allowed_children=['surname', 'forename', 'orgName'],
187
  attributes=[],
188
  ),
189
  TEIElement(
190
  tag="surname",
191
- description=(
192
- "The inherited (family) name of a person. "
193
- "Always emit together with an enclosing 'author' or 'editor' span covering "
194
- "the full name — never emit a 'surname' span without a corresponding "
195
- "'author' or 'editor' span."
196
- ),
197
  allowed_children=[],
198
  attributes=[],
199
  ),
200
  TEIElement(
201
  tag="forename",
202
- description=(
203
- "The given (first) name or initials of a person. "
204
- "Always emit together with an enclosing 'author' or 'editor' span covering "
205
- "the full name — never emit a 'forename' span without a corresponding "
206
- "'author' or 'editor' span."
207
- ),
208
  allowed_children=[],
209
  attributes=[],
210
  ),
211
  TEIElement(
212
  tag="orgName",
213
- description=(
214
- "Name of an organisation. "
215
- "When the organisation is an author or editor of the cited work, you MUST emit "
216
- "both the 'orgName' span and an enclosing 'author' (or 'editor') span covering "
217
- "the same text. For example, if 'Acme Research Group' is an author, emit an "
218
- "'author' span AND an 'orgName' span both covering 'Acme Research Group'. "
219
- "Never emit 'orgName' alone when the organisation acts as author or editor."
220
- ),
221
  allowed_children=[],
222
  attributes=[],
223
- ),
224
  TEIElement(
225
  tag="title",
226
  description="Title of the cited work.",
@@ -261,8 +245,8 @@ def _build_schema():
261
  tag="biblScope",
262
  description=(
263
  "Scope reference within the cited item (page range, volume, issue). "
264
- "Emit a separate biblScope span for volume and issue. "
265
- ),
266
  allowed_children=[],
267
  attributes=[
268
  attr(
 
137
  return TEIAttribute(name=name, description=desc, allowed_values=allowed)
138
 
139
  return TEISchema(
140
+ rules=[
141
+ "For each person's name, emit an 'author' or 'editor' span covering the full name "
142
+ "AND separate 'surname', 'forename', or 'orgName' spans for the individual name "
143
+ "parts within that span.",
144
+ "Never emit 'surname', 'forename', or 'orgName' without a corresponding enclosing "
145
+ "'author' or 'editor' span.",
146
+ "When an organisation acts as author or editor, emit BOTH an 'orgName' span AND an "
147
+ "enclosing 'author' (or 'editor') span covering the same text.",
148
+ "Emit a separate 'author' span for each distinct author — never merge multiple "
149
+ "authors into a single span.",
150
+ "In a bibliography, a dash or underscore may stand for a repeated author or editor "
151
+ "name — tag it as 'author' or 'editor' accordingly.",
152
+ ],
153
  elements=[
154
  TEIElement(
155
  tag="label",
 
169
  tag="author",
170
  description=(
171
  "Name(s) of the author(s) of the cited work. "
 
 
 
 
 
 
 
172
  "Names appearing at the start of a bibliographic entry before the title and "
173
+ "date are authors."
 
174
  ),
175
  allowed_children=['surname', 'forename', 'orgName'],
176
  attributes=[],
 
179
  tag="editor",
180
  description=(
181
  "Name of an editor of the cited work. "
 
 
 
182
  "An editor's name typically follows keywords such as 'in', 'ed.', 'éd.', "
183
  "'Hrsg.', 'dir.', '(ed.)', '(eds.)'. "
184
  "CRITICAL: A person's name (or surname alone) that follows 'in' is an editor — "
185
+ "emit an 'editor' span (plus name-part spans), never a 'title' span."
 
186
  ),
187
  allowed_children=['surname', 'forename', 'orgName'],
188
  attributes=[],
189
  ),
190
  TEIElement(
191
  tag="surname",
192
+ description="The inherited (family) name of a person.",
 
 
 
 
 
193
  allowed_children=[],
194
  attributes=[],
195
  ),
196
  TEIElement(
197
  tag="forename",
198
+ description="The given (first) name or initials of a person.",
 
 
 
 
 
199
  allowed_children=[],
200
  attributes=[],
201
  ),
202
  TEIElement(
203
  tag="orgName",
204
+ description="Name of an organisation.",
 
 
 
 
 
 
 
205
  allowed_children=[],
206
  attributes=[],
207
+ ),
208
  TEIElement(
209
  tag="title",
210
  description="Title of the cited work.",
 
245
  tag="biblScope",
246
  description=(
247
  "Scope reference within the cited item (page range, volume, issue). "
248
+ "Emit a separate biblScope span for volume and issue."
249
+ ),
250
  allowed_children=[],
251
  attributes=[
252
  attr(
tei_annotator/models/schema.py CHANGED
@@ -22,6 +22,7 @@ class TEIElement:
22
  @dataclass
23
  class TEISchema:
24
  elements: list[TEIElement] = field(default_factory=list)
 
25
 
26
  def get(self, tag: str) -> TEIElement | None:
27
  for elem in self.elements:
 
22
  @dataclass
23
  class TEISchema:
24
  elements: list[TEIElement] = field(default_factory=list)
25
+ rules: list[str] = field(default_factory=list)
26
 
27
  def get(self, tag: str) -> TEIElement | None:
28
  for elem in self.elements:
tei_annotator/prompting/templates/json_enforced.jinja2 CHANGED
@@ -1,6 +1,13 @@
1
  You are a TEI XML annotation assistant.
2
 
3
  ## TEI Schema
 
 
 
 
 
 
 
4
  {% for elem in schema.elements %}
5
  - `{{ elem.tag }}`: {{ elem.description }}{% if elem.attributes %} (attributes: {% for attr in elem.attributes %}`{{ attr.name }}`{% if not loop.last %}, {% endif %}{% endfor %}){% endif %}
6
  {% endfor %}
 
1
  You are a TEI XML annotation assistant.
2
 
3
  ## TEI Schema
4
+ {% if schema.rules %}
5
+ ### General Rules
6
+
7
+ {% for rule in schema.rules %}
8
+ {{ loop.index }}. {{ rule }}
9
+ {% endfor %}
10
+ {% endif %}
11
  {% for elem in schema.elements %}
12
  - `{{ elem.tag }}`: {{ elem.description }}{% if elem.attributes %} (attributes: {% for attr in elem.attributes %}`{{ attr.name }}`{% if not loop.last %}, {% endif %}{% endfor %}){% endif %}
13
  {% endfor %}
tei_annotator/prompting/templates/text_gen.jinja2 CHANGED
@@ -1,6 +1,14 @@
1
  You are a TEI XML annotation assistant. Your task is to identify named entities and spans in the source text and annotate them with TEI XML tags.
2
 
3
  ## TEI Schema
 
 
 
 
 
 
 
 
4
 
5
  The following TEI elements are in scope:
6
  {% for elem in schema.elements %}
 
1
  You are a TEI XML annotation assistant. Your task is to identify named entities and spans in the source text and annotate them with TEI XML tags.
2
 
3
  ## TEI Schema
4
+ {% if schema.rules %}
5
+ ### General Rules
6
+
7
+ {% for rule in schema.rules %}
8
+ {{ loop.index }}. {{ rule }}
9
+ {% endfor %}
10
+ {% endif %}
11
+ ### Element Descriptions
12
 
13
  The following TEI elements are in scope:
14
  {% for elem in schema.elements %}