jmarbach commited on
Commit
0fe4f42
·
1 Parent(s): 13e4574

update readMe

Browse files
Files changed (1) hide show
  1. README.md +62 -1
README.md CHANGED
@@ -2,4 +2,65 @@
2
  license: apache-2.0
3
  base_model:
4
  - answerdotai/ModernBERT-base
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  base_model:
4
  - answerdotai/ModernBERT-base
5
+ ---
6
+
7
+ # Model Summary
8
+ QuillIndex is an indexing model developed by the [ETH Library](https://library.ethz.ch/). It is trained on the handwritten documents of the [School Board minutes](https://sr.ethz.ch/) (1854-1902) of [ETH Zurich](https://ethz.ch/en.html). Trained on samples created by [ChronoQuill](https://github.com/eth-library/ChronoQuill), an HTR pipeline, QuillIndex assigns labels for a given agenda item. Its taxonomy is constrained to a derived set from the underlying data, the annual indexes and corresponding agenda items. Due to the nature of the model, it cannot hallucinate arbitrary labels.
9
+
10
+ ## Model Architecture & Evaluation
11
+ QuillIndex is an encoder-only sequence classifier and uses [ModernBERT](answerdotai/ModernBERT-base) as a pre-trained backbone. A complete technical report on QuillIndex, its architecture and evaluation can be found in the respective section in [here](https://www.research-collection.ethz.ch/server/api/core/bitstreams/8053d4d8-51b4-4103-8164-b5068ddb3903/content).
12
+
13
+ ## Environment Setup (Linux x86)
14
+
15
+ ```bash
16
+ uv venv quill_env --python 3.12
17
+ source quill_env/bin/activate
18
+
19
+ uv pip install torch torchvision # CUDA 12.8
20
+ uv pip install transformers
21
+ ```
22
+ ## Python Setup
23
+
24
+ ```python
25
+ import torch
26
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
27
+
28
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
29
+
30
+ model_id = "eth-library/QuillIndex"
31
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
32
+ model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
33
+
34
+ # Excerpt from an agenda item
35
+ input_string = """
36
+ § 270 Aufnahme von Schülern.
37
+ In Folge Berichtes & Antrages, des Direktors der polytechnischen Beitriefens von Schülern Schule Namens der Gesammtkonferenz und gestützt auf die vom Schulrathe ertheilte Vollmacht werden folgende in Zürich geprüfte Kandidaten als Schüler des Polytechnikums aufgenommen.
38
+ I. Bauschule I. Jahreskurs: 1. Köch, Johannes von Urner (Wohlfellen) 2. Pulpius, Leon n. Genf 3. Guasquet, Karl Jakob n. Basel 4. Kleffler, Henri n. Genf 5. Mglies, Carl Jakob n. Frankfurt II. Ingenieurschule 1 Jahreskurs. 6. Chialiva, Louis n. Lugano 7. Füsi, Carl n. Zürich 8. Schenker, Viktor n. Dornach (Solothurn)
39
+ """
40
+
41
+ input = tokenizer(input_string, return_tensors="pt").to(device)
42
+ logits = model(**input).logits
43
+ prediction = (torch.sigmoid(logits) > 0.5).int() # Adjust for more restrictive label assignment
44
+
45
+ id2label = model.config.id2label
46
+ predicted_labels = [id2label[i] for i in range(len(prediction[0])) if prediction[0][i] == 1]
47
+
48
+ print(predicted_labels)
49
+ # ['Antrag', 'Aufnahme', 'Bericht', 'Direktor', 'Ingenieurschule', 'Schüler', 'Vollmacht']
50
+ ```
51
+
52
+ # License
53
+ We release QuillIndex the model weights under the Apache 2.0 license.
54
+
55
+ # Citation
56
+ If you use this model, please cite:
57
+ ```bash
58
+ @article{marbach2026closed,
59
+ title={Closed-Vocabulary Multi-Label Indexing Pipeline for Historical Documents},
60
+ author={Marbach, Jeremy},
61
+ year={2026},
62
+ publisher={ETH Zurich},
63
+ url={https://www.research-collection.ethz.ch/server/api/core/bitstreams/8053d4d8-51b4-4103-8164-b5068ddb3903/content}
64
+ }
65
+ ```
66
+