|
|
--- |
|
|
base_model: unsloth/Phi-3.5-mini-instruct |
|
|
language: |
|
|
- de |
|
|
- fr |
|
|
- it |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- transformers |
|
|
- unsloth |
|
|
- llama |
|
|
- trl |
|
|
datasets: |
|
|
- ipst/slds |
|
|
metrics: |
|
|
- bertscore |
|
|
- bleu |
|
|
- rouge |
|
|
--- |
|
|
# Model Card for Phi-3.5-mini-instruct-SLDS |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
This model is a **Phi-3.5-mini-instruct fine-tuned on the Swiss Landmark Decisions Summarization (SLDS) dataset**. |
|
|
SLDS is a multilingual dataset of **20,000 Swiss Federal Supreme Court decisions** (1954–2024), each paired with **headnotes in German, French, and Italian**, resulting in ~60,000 decision–headnote pairs. |
|
|
|
|
|
The model is optimized for **legal abstractive summarization** and is capable of producing **concise, legally structured headnotes**. |
|
|
It can be used for both **monolingual** and **cross-lingual summarization** tasks. |
|
|
|
|
|
This model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
|
|
|
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- **Primary Task**: Judicial summarization (decision → headnote generation). |
|
|
- **Languages**: German (`de`), French (`fr`), Italian (`it`). |
|
|
- **Scenarios**: |
|
|
- Monolingual summarization: e.g., German decision → German headnote. |
|
|
- Cross-lingual summarization: e.g., German decision → French headnote. |
|
|
- Legal research support: assisting in retrieval and navigation of court decisions. |
|
|
|
|
|
**Not intended for**: |
|
|
- Replacing human legal expertise. |
|
|
- Serving as an authoritative legal source. |
|
|
- Automated legal advice or decision-making. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Dataset**: [Swiss Landmark Decisions Summarization (SLDS)](https://huggingface.co/datasets/ipst/slds). |
|
|
- **Size**: ~20K decisions, ~60K decision–headnote pairs. |
|
|
- **Splits**: Train (1954–2021), Validation (2022), Test (2023–2024). |
|
|
- **Source**: [Swiss Federal Supreme Court](https://www.bger.ch). |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
- **Base Models**: |
|
|
- Qwen2.5 family (0.5B–14B) |
|
|
- Llama 3.2 (3B) |
|
|
- Phi-3.5-mini |
|
|
|
|
|
- **Fine-tuning Objective**: Conditional generation (decision → headnote). |
|
|
- **Evaluation Metrics**: |
|
|
- Lexical: ROUGE-1/2/L, BLEU, BERTScore. |
|
|
- Domain-specific: LLM-as-a-Judge framework (DeepSeek V3) assessing five rubrics: accuracy, completeness, clarity, legal citations, and considerations. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
On the SLDS test set (2023–2024): |
|
|
|
|
|
| Model | Setting | BERTScore ↑ | BLEU ↑ | ROUGE-1 ↑ | ROUGE-2 ↑ | ROUGE-L ↑ | JUDGE ↑ | |
|
|
|:--- |:--- |:--- |:--- |:--- |:--- |:--- |:--- | |
|
|
| [Phi-3.5-mini](https://huggingface.co/ipst/Phi-3.5-mini-instruct-SLDS) | fine-tuned | 11.24 ± 3.82 | 34.84 ± 0.41 | 31.20 ± 2.08 | 14.11 ± 1.27 | 20.96 ± 1.35 | 15.25 ± 2.32 | |
|
|
| [Llama 3.2B](https://huggingface.co/ipst/Llama-3.2-3B-Instruct-SLDS) | fine-tuned | 15.20 ± 4.40 | 21.89 ± 0.42 | 31.89 ± 2.34 | 14.87 ± 1.61 | 22.49 ± 1.60 | 18.47 ± 2.99 | |
|
|
| [Qwen2.5 0.5B](https://huggingface.co/ipst/Qwen2.5-0.5B-Instruct-SLDS) | fine-tuned | -1.37 ± 3.85 | 32.20 ± 0.35 | 23.87 ± 1.68 | 9.46 ± 0.94 | 17.37 ± 1.09 | 5.80 ± 1.26 | |
|
|
| [Qwen2.5 1.5B](https://huggingface.co/ipst/Qwen2.5-1.5B-Instruct-SLDS) | fine-tuned | 19.81 ± 2.72 | 36.79 ± 0.34 | 33.03 ± 1.73 | 14.14 ± 1.08 | 22.67 ± 1.13 | 15.92 ± 2.27 | |
|
|
| [Qwen2.5 3B](https://huggingface.co/ipst/Qwen2.5-3B-Instruct-SLDS) | fine-tuned | 23.23 ± 2.80 | 38.42 ± 0.34 | 35.18 ± 1.79 | 15.66 ± 1.23 | 24.10 ± 1.17 | 20.31 ± 2.66 | |
|
|
| [Qwen2.5 7B](https://huggingface.co/ipst/Qwen2.5-7B-Instruct-SLDS) | fine-tuned | 29.59 ± 1.97 | 41.40 ± 0.34 | 39.24 ± 1.59 | 18.26 ± 1.25 | 26.44 ± 1.15 | 28.37 ± 3.07 | |
|
|
| [Qwen2.5 14B](https://huggingface.co/ipst/Qwen2.5-14B-Instruct-SLDS) | fine-tuned | **32.48 ± 1.98** | **41.80 ± 0.37** | 40.04 ± 1.74 | **19.99 ± 1.41** | **28.00 ± 1.28** | 31.38 ± 3.19 | |
|
|
| GPT-4o | one-shot | 30.44 ± 1.74 | 31.89 ± 0.25 | **42.12 ± 1.79** | 18.92 ± 1.22 | 25.92 ± 1.05 | 39.70 ± 2.66 | |
|
|
| Claude 3.5 Sonnet | one-shot | 5.53 ± 2.00 | 21.88 ± 0.25 | 41.86 ± 1.64 | 19.23 ± 1.19 | 27.67 ± 1.20 | 41.25 ± 2.90 | |
|
|
| DeepSeek-R1 | one-shot | 20.28 ± 1.45 | 22.37 ± 0.18 | 38.30 ± 1.82 | 15.97 ± 0.85 | 21.03 ± 0.84 | **42.28 ± 2.21** | |
|
|
| o3-mini | one-shot | 14.18 ± 1.31 | 20.55 ± 0.17 | 34.77 ± 1.43 | 11.92 ± 0.69 | 18.21 ± 0.67 | 34.82 ± 2.41 | |
|
|
|
|
|
- **Lexical metrics**: Fine-tuned models outperform in overlap-based scores. |
|
|
- **LLM-judge scores**: Larger proprietary and reasoning models outperform in legal precision. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Language imbalance**: German decisions dominate, while Italian remains underrepresented. |
|
|
- **Biases**: Headnotes reflect judicial style and conventions, not neutral summaries. |
|
|
- **Evaluation mismatch**: ROUGE and BLEU may not fully capture legal accuracy. |
|
|
- **Overfitting risk**: Models may overfit to formulaic headnote structures. |
|
|
- **Cross-lingual difficulty**: Some models struggle with non-monolingual headnote generation. |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- **Sensitive information**: All data is anonymized by the Swiss Federal Supreme Court before publication. |
|
|
- **Legal risk**: Generated headnotes must not be used as official legal advice. |
|
|
- **Fair use**: Ensure attribution when reusing outputs. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Cite |
|
|
|
|
|
If you use this model, please cite the dataset paper: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{rolshoven-etal-2025-unlocking, |
|
|
title = "Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in {S}witzerland", |
|
|
author = {Rolshoven, Luca and |
|
|
Rasiah, Vishvaksenan and |
|
|
Bose, Srinanda Br{\"u}gger and |
|
|
Hostettler, Sarah and |
|
|
Burkhalter, Lara and |
|
|
St{\"u}rmer, Matthias and |
|
|
Niklaus, Joel}, |
|
|
editor = "Christodoulopoulos, Christos and |
|
|
Chakraborty, Tanmoy and |
|
|
Rose, Carolyn and |
|
|
Peng, Violet", |
|
|
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", |
|
|
month = nov, |
|
|
year = "2025", |
|
|
address = "Suzhou, China", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://aclanthology.org/2025.findings-emnlp.832/", |
|
|
pages = "15382--15411", |
|
|
ISBN = "979-8-89176-335-7", |
|
|
} |
|
|
``` |