| | --- |
| | license: apache-2.0 |
| | base_model: |
| | - answerdotai/ModernBERT-base |
| | --- |
| | |
| | # Model Summary |
| | QuillIndex is an indexing model developed by the [ETH Library](https://library.ethz.ch/). It is trained on the handwritten documents of the [School Board minutes](https://sr.ethz.ch/) (1854-1902) of [ETH Zurich](https://ethz.ch/en.html). Trained on samples created by [ChronoQuill](https://github.com/eth-library/ChronoQuill), an HTR pipeline, QuillIndex assigns labels for a given agenda item. Its taxonomy is constrained to a derived set from the underlying data, the annual indexes and corresponding agenda items. Due to the nature of the model, it cannot hallucinate arbitrary labels. |
| |
|
| | ## Model Architecture & Evaluation |
| | QuillIndex is an encoder-only sequence classifier and uses [ModernBERT](answerdotai/ModernBERT-base) as a pre-trained backbone. The taxonomy can be found within the config file. A complete technical report on QuillIndex, its architecture and evaluation can be found in the respective section in [here](https://www.research-collection.ethz.ch/server/api/core/bitstreams/8053d4d8-51b4-4103-8164-b5068ddb3903/content). |
| |
|
| | ## Environment Setup (Linux x86) |
| |
|
| | ```bash |
| | uv venv quill_env --python 3.12 |
| | source quill_env/bin/activate |
| | |
| | uv pip install torch torchvision # CUDA 12.8 |
| | uv pip install transformers |
| | ``` |
| | ## Python Setup |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoModelForSequenceClassification, AutoTokenizer |
| | |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | |
| | model_id = "eth-library/QuillIndex" |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device) |
| | |
| | # Excerpt from an agenda item |
| | input_string = """ |
| | § 270 Aufnahme von Schülern. |
| | In Folge Berichtes & Antrages, des Direktors der polytechnischen Beitriefens von Schülern Schule Namens der Gesammtkonferenz und gestützt auf die vom Schulrathe ertheilte Vollmacht werden folgende in Zürich geprüfte Kandidaten als Schüler des Polytechnikums aufgenommen. |
| | I. Bauschule I. Jahreskurs: 1. Köch, Johannes von Urner (Wohlfellen) 2. Pulpius, Leon n. Genf 3. Guasquet, Karl Jakob n. Basel 4. Kleffler, Henri n. Genf 5. Mglies, Carl Jakob n. Frankfurt II. Ingenieurschule 1 Jahreskurs. 6. Chialiva, Louis n. Lugano 7. Füsi, Carl n. Zürich 8. Schenker, Viktor n. Dornach (Solothurn) |
| | """ |
| | |
| | input = tokenizer(input_string, return_tensors="pt").to(device) |
| | logits = model(**input).logits |
| | prediction = (torch.sigmoid(logits) > 0.5).int() # Adjust for more restrictive label assignment |
| | |
| | id2label = model.config.id2label |
| | predicted_labels = [id2label[i] for i in range(len(prediction[0])) if prediction[0][i] == 1] |
| | |
| | print(predicted_labels) |
| | # ['Antrag', 'Aufnahme', 'Bericht', 'Direktor', 'Ingenieurschule', 'Schüler', 'Vollmacht'] |
| | ``` |
| |
|
| | ## Generalization |
| | The taxonomy is derived from 19th-century ETH School Board minutes. The model is fine-tuned exclusively on 19th-century German. Application to other domains or periods may be unreliable. |
| |
|
| | # License |
| | We release QuillIndex under the Apache 2.0 license. |
| |
|
| | # Citation |
| | If you use this model, please cite: |
| | ```bash |
| | @article{marbach2026closed, |
| | title={Closed-Vocabulary Multi-Label Indexing Pipeline for Historical Documents}, |
| | author={Marbach, Jeremy}, |
| | year={2026}, |
| | publisher={ETH Zurich}, |
| | url={https://www.research-collection.ethz.ch/server/api/core/bitstreams/8053d4d8-51b4-4103-8164-b5068ddb3903/content} |
| | } |
| | ``` |
| |
|
| |
|