Update README.md

Browse files

Files changed (1) hide show

README.md +57 -63

README.md CHANGED Viewed

@@ -19,9 +19,9 @@ language:
 ---
-# Phi-4-mini N5 Label Addition Fine-tune
-This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for adding human-readable labels (rdfs:label) to JSON-LD structures, trained as part of the WIM (Text-to-Knowledge Graph) pipeline on the signaalberichten dataset.
 ## Model Details
@@ -44,6 +44,7 @@ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://hu
 - **Steps:** 1,735
 - **Training Metrics:**
   - Final Training Loss: 0.7864
   - Training samples/second: 2.209
   - Learning rate (final): 6.26e-10
@@ -85,18 +86,18 @@ This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://hu
 ### Intended Uses
-- **Label Addition**: Add human-readable Dutch labels (rdfs:label) to JSON-LD structures
-- **Knowledge Graph Enhancement**: Fifth step (N5) in the WIM pipeline
-- **Government Services**: Optimized for citizen complaints and government service descriptions
-- **JSON-LD Enrichment**: Make knowledge graphs more accessible with descriptive labels
 ### Limitations
-- Trained on signaalberichten dataset (different domain than N1-N3)
-- Best performance on government/municipal service contexts
-- Requires well-formed JSON-LD as input
-- Limited to 4K token context (sufficient for label addition)
-- Small training dataset (4,525 examples)
 ## How to Use
@@ -116,35 +117,32 @@ model = AutoModelForCausalLM.from_pretrained(
 )
 tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
-# Prepare input - JSON-LD without labels (citizen complaint)
-json_ld = {
-    "@context": "https://schema.org",
-    "@type": "Report",
-    "about": {
-        "@type": "CivicStructure",
-        "name": "Speeltuin Vondelpark"
-    },
-    "reportedBy": {
-        "@type": "Person",
-        "address": {
-            "@type": "PostalAddress",
-            "addressLocality": "Amsterdam"
-        }
-    }
-}
 messages = [
     {
         "role": "system",
-        "content": "Je bent een expert in het toevoegen van Nederlandse labels aan JSON-LD."
     },
     {
         "role": "user",
-        "content": f"""Voeg rdfs:label toe aan de volgende JSON-LD:
-{json.dumps(json_ld, ensure_ascii=False, indent=2)}
-Geef de complete JSON-LD terug met labels."""
     }
 ]
@@ -199,27 +197,20 @@ tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-adapter")
 ## Expected Output Format
-The model adds `rdfs:label` properties to make JSON-LD more human-readable:
 ```json
 {
-    "@context": "https://schema.org",
-    "@type": "Report",
-    "rdfs:label": "Melding",
-    "about": {
-        "@type": "CivicStructure",
-        "rdfs:label": "Speeltuin Vondelpark",
-        "name": "Speeltuin Vondelpark"
-    },
-    "reportedBy": {
-        "@type": "Person",
-        "rdfs:label": "Melder",
-        "address": {
-            "@type": "PostalAddress",
-            "rdfs:label": "Adres in Amsterdam",
-            "addressLocality": "Amsterdam"
-        }
-    }
 }
 ```
@@ -228,14 +219,14 @@ The model adds `rdfs:label` properties to make JSON-LD more human-readable:
 The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
 - **Source**: Signaalberichten (citizen complaints to municipalities)
-- **Domain**: Government services and municipal operations
-- **N5 Examples**: 4,525 label addition tasks
 - **Average Token Length**: 1,636 tokens
 - **Max Token Length**: 2,332 tokens
 - **Format**: ChatML-formatted instruction-following examples
-- **Task**: Add Dutch rdfs:label properties to JSON-LD
-**Important**: This dataset is different from the Wikipedia-based dataset used for N1-N3 models.
 ## Training Results
@@ -262,17 +253,20 @@ The model completed 3.1 epochs through the dataset:
   - Large adapter due to r=512
   - Includes all training configurations
-## Pipeline Context
-This model is part of the WIM (Text-to-Knowledge Graph) pipeline:
-1. **N1**: Entity Extraction
-2. **N2**: Schema.org Type Selection
-3. **N3**: Transform to JSON-LD
-4. **N4**: Validation
-5. **N5 (This Model)**: Add Human-Readable Labels
-N5 is trained on a different dataset (signaalberichten) than N1-N3, focusing on government services and citizen interactions rather than encyclopedic content.
 ## Performance Characteristics
@@ -289,9 +283,9 @@ If you use this model, please cite:
 ```bibtex
 @misc{wim-n5-phi4-mini,
   author = {UWV InnovatieHub},
-  title = {Phi-4-mini N5 Label Addition Model},
   year = {2025},
   publisher = {HuggingFace},
   url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
 }
-```

 ---
+# Phi-4-mini N5 Complaint Categorization Fine-tune
+This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for categorizing citizen complaints into predefined topic and experience labels, trained on the signaalberichten dataset.
 ## Model Details
 - **Steps:** 1,735
 - **Training Metrics:**
   - Final Training Loss: 0.7864
+  - Final Eval Loss: 0.7796
   - Training samples/second: 2.209
   - Learning rate (final): 6.26e-10
 ### Intended Uses
+- **Complaint Categorization**: Classify citizen complaints into topic and experience categories
+- **Municipal Service Analysis**: Analyze phone transcripts and written complaints
+- **Topic Detection**: Identify what the complaint is about (e.g., waste, parking, permits)
+- **Experience Analysis**: Determine how citizens experience the service (e.g., communication, speed, clarity)
 ### Limitations
+- Trained on signaalberichten dataset (Dutch municipal complaints)
+- Fixed label vocabulary (cannot create new labels)
+- Best performance on complaint/service interaction texts
+- Limited to 4K token context (sufficient for most complaints)
+- Specific to Dutch government/municipal contexts
 ## How to Use
 )
 tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n5-phi4-mini-merged")
+# Prepare input - complaint text for categorization
+complaint_text = """
+Burger: Nou, waar ik dus over wil klagen is het afval in de buurt.
+Het is echt niet normaal meer, met al die vuilniszakken die op straat worden gegooid.
+De containers zijn vaak vol en er komen ook ratten.
+Ik had al eens gebeld maar er wordt niks aan gedaan!
+"""
 messages = [
     {
         "role": "system",
+        "content": "Jij bent een expert in het toewijzen van labels aan een tekst."
     },
     {
         "role": "user",
+        "content": f"""Analyseer de onderstaande tekst en bepaal welke labels van toepassing zijn.
+**Onderwerp labels** (selecteer wat van toepassing is):
+Vuil/ongedierte overlast, Bruikbaarheid/beschikbaarheid afvalcontainers,
+Parkeeroverlast, Vergunningen, etc.
+**Beleving labels** (selecteer wat van toepassing is):
+Communicatie, Op de hoogte houden, Statusinformatie, Snelheid van afhandeling, etc.
+**Tekst om te analyseren**:
+{complaint_text}"""
     }
 ]
 ## Expected Output Format
+The model outputs a JSON response with categorization results:
 ```json
 {
+    "reasoning": "Omdat de burger klaagt over afval dat op straat wordt gegooid, volle containers en rattenoverlast, zijn de onderwerpen 'Vuil/ongedierte overlast' en 'Bruikbaarheid/beschikbaarheid afvalcontainers' het meest van toepassing. De beleving is negatief: de burger ervaart frustratie over het uitblijven van actie en het gebrek aan terugkoppeling.",
+    "onderwerp_labels": [
+        "Vuil/ongedierte overlast",
+        "Bruikbaarheid/beschikbaarheid afvalcontainers"
+    ],
+    "beleving_labels": [
+        "Op de hoogte houden",
+        "Statusinformatie",
+        "Communicatie"
+    ]
 }
 ```
 The model was trained on the [UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-signaalberichten-to-jsonld-agent-steps) dataset, which contains:
 - **Source**: Signaalberichten (citizen complaints to municipalities)
+- **Domain**: Phone transcripts and written complaints about municipal services
+- **N5 Examples**: 4,525 complaint categorization tasks
 - **Average Token Length**: 1,636 tokens
 - **Max Token Length**: 2,332 tokens
 - **Format**: ChatML-formatted instruction-following examples
+- **Task**: Categorize complaints into predefined topic and experience labels
+**Important**: This is a different task and dataset from the WIM pipeline (N1-N4) which focuses on Wikipedia to JSON-LD conversion.
 ## Training Results
   - Large adapter due to r=512
   - Includes all training configurations
+## Model Context
+**Note**: Despite the "n5" naming, this model is NOT part of the WIM (Wikipedia to Knowledge Graph) pipeline that includes N1-N4. This is a separate task focused on complaint categorization.
+### WIM Pipeline (Wikipedia to JSON-LD):
+1. **N1**: Entity Extraction from Wikipedia text
+2. **N2**: Schema.org Type Selection for entities
+3. **N3**: Transform to JSON-LD format
+4. **N4**: Validation of JSON-LD
+### This Model (N5 - Complaint Categorization):
+- **Task**: Categorize citizen complaints into topic and experience labels
+- **Dataset**: Signaalberichten (municipal complaints)
+- **Domain**: Government services and citizen interactions
 ## Performance Characteristics
 ```bibtex
 @misc{wim-n5-phi4-mini,
   author = {UWV InnovatieHub},
+  title = {Phi-4-mini N5 Complaint Categorization Model},
   year = {2025},
   publisher = {HuggingFace},
   url = {https://huggingface.co/UWV/wim-n5-phi4-mini-merged}
 }
+```