Update README.md

Browse files

Files changed (1) hide show

README.md +246 -29

README.md CHANGED Viewed

@@ -11,53 +11,270 @@ tags:
 - unsloth
 licence: license
 pipeline_tag: text-generation
 ---
-# Model Card for n2_schema_retrieval_model
-This model is a fine-tuned version of [unsloth/Phi-4-mini-instruct-bnb-4bit](https://huggingface.co/unsloth/Phi-4-mini-instruct-bnb-4bit).
-It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
 ```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="None", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
 ```
-## Training procedure
-[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/yepster/n2-schema-retrieval-unsloth/runs/5pwqqkrw)
-This model was trained with SFT.
-### Framework versions
-- PEFT 0.16.0
-- TRL: 0.19.1
-- Transformers: 4.53.1
-- Pytorch: 2.7.1
-- Datasets: 4.0.0
-- Tokenizers: 0.21.2
-## Citations
-Cite TRL as:
 ```bibtex
-@misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
 }
 ```

 - unsloth
 licence: license
 pipeline_tag: text-generation
+license: apache-2.0
+datasets:
+- UWV/wim-instruct-wiki-to-jsonld-agent-steps
+language:
+- nl
 ---
+# Phi-4-mini N2 Schema.org Retrieval Fine-tune
+This model is a fine-tuned version of [microsoft/Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) optimized for Schema.org type selection from entity descriptions, trained as part of the WIM (Wikipedia to Knowledge Graph) pipeline.
+## Model Details
+### Model Description
+- **Developed by:** UWV InnovatieHub
+- **Model type:** Causal Language Model with LoRA fine-tuning
+- **Language(s):** Dutch (nl)
+- **License:** MIT
+- **Finetuned from:** microsoft/Phi-4-mini-instruct (3.82B parameters)
+- **Training Framework:** Unsloth (optimized training for efficient processing)
+### Training Details
+- **Dataset:** [UWV/wim-instruct-wiki-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-wiki-to-jsonld-agent-steps)
+- **Dataset Size:** 104,684 N2-specific examples (schema retrieval tasks)
+- **Training Duration:** 16 hours 33 minutes
+- **Hardware:** NVIDIA A100 80GB
+- **Epochs:** 1.56
+- **Steps:** 5,000
+- **Training Metrics:**
+  - Final Training Loss: 0.9303
+  - Final Eval Loss: 0.7903
+  - Training samples/second: 2.684
+  - Gradient norm (final): ~0.57
+### LoRA Configuration
 ```python
+{
+    "r": 512,                    # Rank (same as N1 for consistency)
+    "lora_alpha": 1024,         # Alpha (2:1 ratio)
+    "lora_dropout": 0.05,       # Dropout for regularization
+    "bias": "none",
+    "task_type": "CAUSAL_LM",
+    "target_modules": [
+        "q_proj", "k_proj", "v_proj", "o_proj"  # Attention layers only
+    ]
+}
+```
+### Training Configuration
+```python
+{
+    "model": "phi4-mini",
+    "max_seq_length": 8192,
+    "batch_size": 32,
+    "gradient_accumulation_steps": 1,
+    "effective_batch_size": 32,
+    "learning_rate": 2e-5,
+    "warmup_steps": 100,
+    "max_grad_norm": 1.0,
+    "lr_scheduler": "cosine",
+    "optimizer": "paged_adamw_8bit",
+    "bf16": True,
+    "seed": 42
+}
 ```
+## Intended Uses & Limitations
+### Intended Uses
+- **Schema.org Type Selection**: Select appropriate Schema.org types for entities
+- **Knowledge Graph Construction**: Second step (N2) in the WIM pipeline
+- **Entity Classification**: Map entity descriptions to standardized Schema.org vocabulary
+- **High-throughput Processing**: Optimized for batch processing with short sequences
+### Limitations
+- Optimized for Schema.org vocabulary only
+- Best performance on entity descriptions from encyclopedic content
+- Requires entity descriptions from N1 output
+- Limited to 8K token context (sufficient for all N2 examples)
+## How to Use
+### Option 1: Using the Merged Model (Recommended)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+import json
+# Load the merged model (ready to use)
+model = AutoModelForCausalLM.from_pretrained(
+    "UWV/wim-n2-phi4-mini-merged",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n2-phi4-mini-merged")
+# Prepare input (example from Dutch Wikipedia)
+entities = [
+    {
+        "name": "Pedro Nunesplein",
+        "description": "Een plein in Amsterdam genoemd naar Pedro Nunes"
+    },
+    {
+        "name": "Amsterdam",
+        "description": "Hoofdstad van Nederland"
+    }
+]
+messages = [
+    {
+        "role": "system",
+        "content": "Je bent een expert in schema.org vocabulaire en semantische mapping."
+    },
+    {
+        "role": "user",
+        "content": f"""Selecteer voor elke entiteit het meest passende Schema.org type:
+{json.dumps(entities, ensure_ascii=False, indent=2)}
+Geef een JSON array met elke entiteit en het Schema.org type."""
+    }
+]
+# Apply chat template and generate
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=8192)
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=500,
+        temperature=0.1,  # Low temperature for consistent classification
+        do_sample=True,
+        top_p=0.95,
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+# Decode response
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+if "assistant:" in response:
+    response = response.split("assistant:")[-1].strip()
+print(response)
+```
+### Option 2: Using the LoRA Adapter
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+import torch
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "microsoft/Phi-4-mini-instruct",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+# Load adapter
+model = PeftModel.from_pretrained(
+    base_model,
+    "UWV/wim-n2-phi4-mini-adapter"
+)
+tokenizer = AutoTokenizer.from_pretrained("UWV/wim-n2-phi4-mini-adapter")
+# Use same inference code as above...
+```
+## Expected Output Format
+The model outputs JSON with Schema.org type selections:
+```json
+[
+  {
+    "name": "Pedro Nunesplein",
+    "schema_type": "Place",
+    "schema_url": "https://schema.org/Place"
+  },
+  {
+    "name": "Amsterdam",
+    "schema_type": "City",
+    "schema_url": "https://schema.org/City"
+  }
+]
+```
+## Dataset Information
+The model was trained on the [UWV/wim-instruct-wiki-to-jsonld-agent-steps](https://huggingface.co/datasets/UWV/wim-instruct-wiki-to-jsonld-agent-steps) dataset, which contains:
+- **Source**: Entity descriptions from N1 processing of Dutch Wikipedia
+- **Processing**: Multi-agent pipeline converting text to JSON-LD
+- **N2 Examples**: 104,684 schema selection tasks (largest subset)
+- **Average Token Length**: 663 tokens (very short sequences)
+- **Max Token Length**: 7,488 tokens
+- **Format**: ChatML-formatted instruction-following examples
+- **Task**: Select appropriate Schema.org types for entities
+## Training Results
+The model completed 1.56 epochs through the large dataset:
+- **Final Training Loss**: 0.9303
+- **Training Efficiency**: 2.684 samples/second
+### Loss Progression
+- Started at ~0.77 loss
+- Stable training with gradual improvement
+- Learning rate: Cosine decay to 2e-12
+- Gradient norms: Stable around 0.5-0.7
+## Model Versions
+- **Merged Model**: `UWV/wim-n2-phi4-mini-merged` (7.17 GB)
+  - Ready to use without adapter loading
+  - Recommended for production inference
+  - Successfully merged (no Phi-4 issues)
+- **LoRA Adapter**: `UWV/wim-n2-phi4-mini-adapter` (~1.14 GB)
+  - Requires base Phi-4-mini-instruct model
+  - Useful for further fine-tuning or experiments
+  - Large adapter due to r=512 (same as N1)
+## Pipeline Context
+This model is part of the WIM (Wikipedia to Knowledge Graph) pipeline:
+1. **N1**: Entity Extraction
+2. **N2 (This Model)**: Schema.org Type Selection
+3. **N3**: Transform to JSON-LD
+4. **N4**: Validation
+5. **N5**: Add Human-Readable Labels
+N2 processes the largest number of examples (104K) but with the shortest sequences, making it highly efficient for batch processing. Despite using a larger LoRA configuration (r=512) than typically needed for this simpler task, the model trained efficiently and merged successfully.
+## Performance Characteristics
+- **Sequence Length**: Average 663 tokens (10x shorter than N1, 60x shorter than N3)
+- **Batch Processing**: Can handle batch size 32+ due to short sequences
+- **Inference Speed**: Very fast due to short context requirements
+- **Memory Usage**: ~11GB VRAM with 8K context
+## Citation
+If you use this model, please cite:
 ```bibtex
+@misc{wim-n2-phi4-mini,
+  author = {UWV InnovatieHub},
+  title = {Phi-4-mini N2 Schema.org Retrieval Model},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/UWV/wim-n2-phi4-mini-merged}
 }
 ```