Update README.md
Browse files
README.md
CHANGED
|
@@ -19,9 +19,9 @@ language:
|
|
| 19 |
---
|
| 20 |
|
| 21 |
|
| 22 |
-
# Phi-4-mini N5
|
| 23 |
|
| 24 |
-
This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for
|
| 25 |
|
| 26 |
## Model Details
|
| 27 |
|
|
@@ -44,6 +44,7 @@ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://hu
|
|
| 44 |
- **Steps:** 1,735
|
| 45 |
- **Training Metrics:**
|
| 46 |
- Final Training Loss: 0.7864
|
|
|
|
| 47 |
- Training samples/second: 2.209
|
| 48 |
- Learning rate (final): 6.26e-10
|
| 49 |
|
|
@@ -85,18 +86,18 @@ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://hu
|
|
| 85 |
|
| 86 |
### Intended Uses
|
| 87 |
|
| 88 |
-
- **
|
| 89 |
-
- **
|
| 90 |
-
- **
|
| 91 |
-
- **
|
| 92 |
|
| 93 |
### Limitations
|
| 94 |
|
| 95 |
-
- Trained on signaalberichten dataset (
|
| 96 |
-
-
|
| 97 |
-
-
|
| 98 |
-
- Limited to 4K token context (sufficient for
|
| 99 |
-
-
|
| 100 |
|
| 101 |
## How to Use
|
| 102 |
|
|
@@ -116,35 +117,32 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 116 |
)
|
| 117 |
tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
|
| 118 |
|
| 119 |
-
# Prepare input -
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
},
|
| 127 |
-
"reportedBy": {
|
| 128 |
-
"@type": "Person",
|
| 129 |
-
"address": {
|
| 130 |
-
"@type": "PostalAddress",
|
| 131 |
-
"addressLocality": "Amsterdam"
|
| 132 |
-
}
|
| 133 |
-
}
|
| 134 |
-
}
|
| 135 |
|
| 136 |
messages = [
|
| 137 |
{
|
| 138 |
"role": "system",
|
| 139 |
-
"content": "
|
| 140 |
},
|
| 141 |
{
|
| 142 |
"role": "user",
|
| 143 |
-
"content": f"""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
-
|
|
|
|
| 146 |
|
| 147 |
-
|
|
|
|
| 148 |
}
|
| 149 |
]
|
| 150 |
|
|
@@ -199,27 +197,20 @@ tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-adapter")
|
|
| 199 |
|
| 200 |
## Expected Output Format
|
| 201 |
|
| 202 |
-
The model
|
| 203 |
|
| 204 |
```json
|
| 205 |
{
|
| 206 |
-
"
|
| 207 |
-
"
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
"
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
"rdfs:label": "Melder",
|
| 217 |
-
"address": {
|
| 218 |
-
"@type": "PostalAddress",
|
| 219 |
-
"rdfs:label": "Adres in Amsterdam",
|
| 220 |
-
"addressLocality": "Amsterdam"
|
| 221 |
-
}
|
| 222 |
-
}
|
| 223 |
}
|
| 224 |
```
|
| 225 |
|
|
@@ -228,14 +219,14 @@ The model adds `rdfs:label` properties to make JSON-LD more human-readable:
|
|
| 228 |
The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
|
| 229 |
|
| 230 |
- **Source**: Signaalberichten (citizen complaints to municipalities)
|
| 231 |
-
- **Domain**:
|
| 232 |
-
- **N5 Examples**: 4,525
|
| 233 |
- **Average Token Length**: 1,636 tokens
|
| 234 |
- **Max Token Length**: 2,332 tokens
|
| 235 |
- **Format**: ChatML-formatted instruction-following examples
|
| 236 |
-
- **Task**:
|
| 237 |
|
| 238 |
-
**Important**: This
|
| 239 |
|
| 240 |
## Training Results
|
| 241 |
|
|
@@ -262,17 +253,20 @@ The model completed 3.1 epochs through the dataset:
|
|
| 262 |
- Large adapter due to r=512
|
| 263 |
- Includes all training configurations
|
| 264 |
|
| 265 |
-
##
|
| 266 |
|
| 267 |
-
|
| 268 |
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
|
| 275 |
-
|
|
|
|
|
|
|
|
|
|
| 276 |
|
| 277 |
## Performance Characteristics
|
| 278 |
|
|
@@ -289,9 +283,9 @@ If you use this model, please cite:
|
|
| 289 |
```bibtex
|
| 290 |
@misc{wim-n5-phi4-mini,
|
| 291 |
author = {UWV InnovatieHub},
|
| 292 |
-
title = {Phi-4-mini N5
|
| 293 |
year = {2025},
|
| 294 |
publisher = {HuggingFace},
|
| 295 |
url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
|
| 296 |
}
|
| 297 |
-
```
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
|
| 22 |
+
# Phi-4-mini N5 Complaint Categorization Fine-tune
|
| 23 |
|
| 24 |
+
This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for categorizing citizen complaints into predefined topic and experience labels, trained on the signaalberichten dataset.
|
| 25 |
|
| 26 |
## Model Details
|
| 27 |
|
|
|
|
| 44 |
- **Steps:** 1,735
|
| 45 |
- **Training Metrics:**
|
| 46 |
- Final Training Loss: 0.7864
|
| 47 |
+
- Final Eval Loss: 0.7796
|
| 48 |
- Training samples/second: 2.209
|
| 49 |
- Learning rate (final): 6.26e-10
|
| 50 |
|
|
|
|
| 86 |
|
| 87 |
### Intended Uses
|
| 88 |
|
| 89 |
+
- **Complaint Categorization**: Classify citizen complaints into topic and experience categories
|
| 90 |
+
- **Municipal Service Analysis**: Analyze phone transcripts and written complaints
|
| 91 |
+
- **Topic Detection**: Identify what the complaint is about (e.g., waste, parking, permits)
|
| 92 |
+
- **Experience Analysis**: Determine how citizens experience the service (e.g., communication, speed, clarity)
|
| 93 |
|
| 94 |
### Limitations
|
| 95 |
|
| 96 |
+
- Trained on signaalberichten dataset (Dutch municipal complaints)
|
| 97 |
+
- Fixed label vocabulary (cannot create new labels)
|
| 98 |
+
- Best performance on complaint/service interaction texts
|
| 99 |
+
- Limited to 4K token context (sufficient for most complaints)
|
| 100 |
+
- Specific to Dutch government/municipal contexts
|
| 101 |
|
| 102 |
## How to Use
|
| 103 |
|
|
|
|
| 117 |
)
|
| 118 |
tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
|
| 119 |
|
| 120 |
+
# Prepare input - complaint text for categorization
|
| 121 |
+
complaint_text = """
|
| 122 |
+
Burger: Nou, waar ik dus over wil klagen is het afval in de buurt.
|
| 123 |
+
Het is echt niet normaal meer, met al die vuilniszakken die op straat worden gegooid.
|
| 124 |
+
De containers zijn vaak vol en er komen ook ratten.
|
| 125 |
+
Ik had al eens gebeld maar er wordt niks aan gedaan!
|
| 126 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
messages = [
|
| 129 |
{
|
| 130 |
"role": "system",
|
| 131 |
+
"content": "Jij bent een expert in het toewijzen van labels aan een tekst."
|
| 132 |
},
|
| 133 |
{
|
| 134 |
"role": "user",
|
| 135 |
+
"content": f"""Analyseer de onderstaande tekst en bepaal welke labels van toepassing zijn.
|
| 136 |
+
|
| 137 |
+
**Onderwerp labels** (selecteer wat van toepassing is):
|
| 138 |
+
Vuil/ongedierte overlast, Bruikbaarheid/beschikbaarheid afvalcontainers,
|
| 139 |
+
Parkeeroverlast, Vergunningen, etc.
|
| 140 |
|
| 141 |
+
**Beleving labels** (selecteer wat van toepassing is):
|
| 142 |
+
Communicatie, Op de hoogte houden, Statusinformatie, Snelheid van afhandeling, etc.
|
| 143 |
|
| 144 |
+
**Tekst om te analyseren**:
|
| 145 |
+
{complaint_text}"""
|
| 146 |
}
|
| 147 |
]
|
| 148 |
|
|
|
|
| 197 |
|
| 198 |
## Expected Output Format
|
| 199 |
|
| 200 |
+
The model outputs a JSON response with categorization results:
|
| 201 |
|
| 202 |
```json
|
| 203 |
{
|
| 204 |
+
"reasoning": "Omdat de burger klaagt over afval dat op straat wordt gegooid, volle containers en rattenoverlast, zijn de onderwerpen 'Vuil/ongedierte overlast' en 'Bruikbaarheid/beschikbaarheid afvalcontainers' het meest van toepassing. De beleving is negatief: de burger ervaart frustratie over het uitblijven van actie en het gebrek aan terugkoppeling.",
|
| 205 |
+
"onderwerp_labels": [
|
| 206 |
+
"Vuil/ongedierte overlast",
|
| 207 |
+
"Bruikbaarheid/beschikbaarheid afvalcontainers"
|
| 208 |
+
],
|
| 209 |
+
"beleving_labels": [
|
| 210 |
+
"Op de hoogte houden",
|
| 211 |
+
"Statusinformatie",
|
| 212 |
+
"Communicatie"
|
| 213 |
+
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
}
|
| 215 |
```
|
| 216 |
|
|
|
|
| 219 |
The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
|
| 220 |
|
| 221 |
- **Source**: Signaalberichten (citizen complaints to municipalities)
|
| 222 |
+
- **Domain**: Phone transcripts and written complaints about municipal services
|
| 223 |
+
- **N5 Examples**: 4,525 complaint categorization tasks
|
| 224 |
- **Average Token Length**: 1,636 tokens
|
| 225 |
- **Max Token Length**: 2,332 tokens
|
| 226 |
- **Format**: ChatML-formatted instruction-following examples
|
| 227 |
+
- **Task**: Categorize complaints into predefined topic and experience labels
|
| 228 |
|
| 229 |
+
**Important**: This is a different task and dataset from the WIM pipeline (N1-N4) which focuses on Wikipedia to JSON-LD conversion.
|
| 230 |
|
| 231 |
## Training Results
|
| 232 |
|
|
|
|
| 253 |
- Large adapter due to r=512
|
| 254 |
- Includes all training configurations
|
| 255 |
|
| 256 |
+
## Model Context
|
| 257 |
|
| 258 |
+
**Note**: Despite the "n5" naming, this model is NOT part of the WIM (Wikipedia to Knowledge Graph) pipeline that includes N1-N4. This is a separate task focused on complaint categorization.
|
| 259 |
|
| 260 |
+
### WIM Pipeline (Wikipedia to JSON-LD):
|
| 261 |
+
1. **N1**: Entity Extraction from Wikipedia text
|
| 262 |
+
2. **N2**: Schema.org Type Selection for entities
|
| 263 |
+
3. **N3**: Transform to JSON-LD format
|
| 264 |
+
4. **N4**: Validation of JSON-LD
|
| 265 |
|
| 266 |
+
### This Model (N5 - Complaint Categorization):
|
| 267 |
+
- **Task**: Categorize citizen complaints into topic and experience labels
|
| 268 |
+
- **Dataset**: Signaalberichten (municipal complaints)
|
| 269 |
+
- **Domain**: Government services and citizen interactions
|
| 270 |
|
| 271 |
## Performance Characteristics
|
| 272 |
|
|
|
|
| 283 |
```bibtex
|
| 284 |
@misc{wim-n5-phi4-mini,
|
| 285 |
author = {UWV InnovatieHub},
|
| 286 |
+
title = {Phi-4-mini N5 Complaint Categorization Model},
|
| 287 |
year = {2025},
|
| 288 |
publisher = {HuggingFace},
|
| 289 |
url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
|
| 290 |
}
|
| 291 |
+
```
|