File size: 6,329 Bytes
f1dd719
 
 
bcdc73e
 
 
f1dd719
 
 
 
 
 
 
bcdc73e
 
 
 
 
 
f1dd719
bcdc73e
f1dd719
bcdc73e
f1dd719
bcdc73e
 
f1dd719
bcdc73e
 
 
 
f1dd719
 
bcdc73e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1a2dcf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcdc73e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
base_model: unsloth/Phi-3.5-mini-instruct
language:
- de
- fr
- it
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
datasets:
- ipst/slds
metrics:
- bertscore
- bleu
- rouge
---
# Model Card for Phi-3.5-mini-instruct-SLDS

## Model Summary

This model is a **Phi-3.5-mini-instruct fine-tuned on the Swiss Landmark Decisions Summarization (SLDS) dataset**.  
SLDS is a multilingual dataset of **20,000 Swiss Federal Supreme Court decisions** (1954–2024), each paired with **headnotes in German, French, and Italian**, resulting in ~60,000 decision–headnote pairs.  

The model is optimized for **legal abstractive summarization** and is capable of producing **concise, legally structured headnotes**.  
It can be used for both **monolingual** and **cross-lingual summarization** tasks.

This model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

---

## Intended Use

- **Primary Task**: Judicial summarization (decision → headnote generation).  
- **Languages**: German (`de`), French (`fr`), Italian (`it`).  
- **Scenarios**:  
  - Monolingual summarization: e.g., German decision → German headnote.  
  - Cross-lingual summarization: e.g., German decision → French headnote.  
  - Legal research support: assisting in retrieval and navigation of court decisions.  

**Not intended for**:  
- Replacing human legal expertise.  
- Serving as an authoritative legal source.  
- Automated legal advice or decision-making.

---

## Training Data

- **Dataset**: [Swiss Landmark Decisions Summarization (SLDS)](https://huggingface.co/datasets/ipst/slds).  
- **Size**: ~20K decisions, ~60K decision–headnote pairs.  
- **Splits**: Train (1954–2021), Validation (2022), Test (2023–2024).  
- **Source**: [Swiss Federal Supreme Court](https://www.bger.ch).  

---

## Training Procedure

- **Base Models**:  
  - Qwen2.5 family (0.5B–14B)  
  - Llama 3.2 (3B)  
  - Phi-3.5-mini  

- **Fine-tuning Objective**: Conditional generation (decision → headnote).  
- **Evaluation Metrics**:  
  - Lexical: ROUGE-1/2/L, BLEU, BERTScore.  
  - Domain-specific: LLM-as-a-Judge framework (DeepSeek V3) assessing five rubrics: accuracy, completeness, clarity, legal citations, and considerations.  

---

## Model Performance

On the SLDS test set (2023–2024):

| Model | Setting | BERTScore ↑ | BLEU ↑ | ROUGE-1 ↑ | ROUGE-2 ↑ | ROUGE-L ↑ | JUDGE ↑ |
|:--- |:--- |:--- |:--- |:--- |:--- |:--- |:--- |
| [Phi-3.5-mini](https://huggingface.co/ipst/Phi-3.5-mini-instruct-SLDS) | fine-tuned | 11.24 ± 3.82 | 34.84 ± 0.41 | 31.20 ± 2.08 | 14.11 ± 1.27 | 20.96 ± 1.35 | 15.25 ± 2.32 |
| [Llama 3.2B](https://huggingface.co/ipst/Llama-3.2-3B-Instruct-SLDS) | fine-tuned | 15.20 ± 4.40 | 21.89 ± 0.42 | 31.89 ± 2.34 | 14.87 ± 1.61 | 22.49 ± 1.60 | 18.47 ± 2.99 |
| [Qwen2.5 0.5B](https://huggingface.co/ipst/Qwen2.5-0.5B-Instruct-SLDS) | fine-tuned | -1.37 ± 3.85 | 32.20 ± 0.35 | 23.87 ± 1.68 | 9.46 ± 0.94 | 17.37 ± 1.09 | 5.80 ± 1.26 |
| [Qwen2.5 1.5B](https://huggingface.co/ipst/Qwen2.5-1.5B-Instruct-SLDS) | fine-tuned | 19.81 ± 2.72 | 36.79 ± 0.34 | 33.03 ± 1.73 | 14.14 ± 1.08 | 22.67 ± 1.13 | 15.92 ± 2.27 |
| [Qwen2.5 3B](https://huggingface.co/ipst/Qwen2.5-3B-Instruct-SLDS) | fine-tuned | 23.23 ± 2.80 | 38.42 ± 0.34 | 35.18 ± 1.79 | 15.66 ± 1.23 | 24.10 ± 1.17 | 20.31 ± 2.66 |
| [Qwen2.5 7B](https://huggingface.co/ipst/Qwen2.5-7B-Instruct-SLDS) | fine-tuned | 29.59 ± 1.97 | 41.40 ± 0.34 | 39.24 ± 1.59 | 18.26 ± 1.25 | 26.44 ± 1.15 | 28.37 ± 3.07 |
| [Qwen2.5 14B](https://huggingface.co/ipst/Qwen2.5-14B-Instruct-SLDS) | fine-tuned | **32.48 ± 1.98** | **41.80 ± 0.37** | 40.04 ± 1.74 | **19.99 ± 1.41** | **28.00 ± 1.28** | 31.38 ± 3.19 |
| GPT-4o | one-shot | 30.44 ± 1.74 | 31.89 ± 0.25 | **42.12 ± 1.79** | 18.92 ± 1.22 | 25.92 ± 1.05 | 39.70 ± 2.66 |
| Claude 3.5 Sonnet | one-shot | 5.53 ± 2.00 | 21.88 ± 0.25 | 41.86 ± 1.64 | 19.23 ± 1.19 | 27.67 ± 1.20 | 41.25 ± 2.90 |
| DeepSeek-R1 | one-shot | 20.28 ± 1.45 | 22.37 ± 0.18 | 38.30 ± 1.82 | 15.97 ± 0.85 | 21.03 ± 0.84 | **42.28 ± 2.21** |
| o3-mini | one-shot | 14.18 ± 1.31 | 20.55 ± 0.17 | 34.77 ± 1.43 | 11.92 ± 0.69 | 18.21 ± 0.67 | 34.82 ± 2.41 |

- **Lexical metrics**: Fine-tuned models outperform in overlap-based scores.  
- **LLM-judge scores**: Larger proprietary and reasoning models outperform in legal precision.  

---

## Limitations

- **Language imbalance**: German decisions dominate, while Italian remains underrepresented.  
- **Biases**: Headnotes reflect judicial style and conventions, not neutral summaries.  
- **Evaluation mismatch**: ROUGE and BLEU may not fully capture legal accuracy.  
- **Overfitting risk**: Models may overfit to formulaic headnote structures.  
- **Cross-lingual difficulty**: Some models struggle with non-monolingual headnote generation.  

---

## Ethical Considerations

- **Sensitive information**: All data is anonymized by the Swiss Federal Supreme Court before publication.  
- **Legal risk**: Generated headnotes must not be used as official legal advice.  
- **Fair use**: Ensure attribution when reusing outputs.  

---

## How to Cite

If you use this model, please cite the dataset paper:

```bibtex
@inproceedings{rolshoven-etal-2025-unlocking,
    title = "Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in {S}witzerland",
    author = {Rolshoven, Luca  and
      Rasiah, Vishvaksenan  and
      Bose, Srinanda Br{\"u}gger  and
      Hostettler, Sarah  and
      Burkhalter, Lara  and
      St{\"u}rmer, Matthias  and
      Niklaus, Joel},
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.832/",
    pages = "15382--15411",
    ISBN = "979-8-89176-335-7",
}
```