Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -34,17 +34,17 @@ model-index:
|
|
| 34 |
config: mixed-domains
|
| 35 |
split: test
|
| 36 |
metrics:
|
| 37 |
-
- type: completeness-score
|
| 38 |
-
value: 0.640
|
| 39 |
-
name: Overall Completeness
|
| 40 |
- type: pii-detection-rate
|
| 41 |
-
value:
|
| 42 |
name: PII Detection Rate
|
|
|
|
|
|
|
|
|
|
| 43 |
- type: semantic-preservation
|
| 44 |
-
value: 0.
|
| 45 |
name: Semantic Preservation
|
| 46 |
- type: latency
|
| 47 |
-
value:
|
| 48 |
name: Average Latency (ms)
|
| 49 |
---
|
| 50 |
|
|
@@ -83,26 +83,42 @@ model-index:
|
|
| 83 |
|
| 84 |
1. **Install llama.cpp** (if not already installed):
|
| 85 |
```bash
|
|
|
|
| 86 |
git clone https://github.com/ggerganov/llama.cpp
|
| 87 |
-
cd llama.cpp
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
```
|
| 89 |
|
| 90 |
-
2. **Download
|
| 91 |
```bash
|
| 92 |
-
# Download model files
|
| 93 |
wget https://huggingface.co/Minibase/DeId-Small/resolve/main/model.gguf
|
| 94 |
wget https://huggingface.co/Minibase/DeId-Small/resolve/main/deid_inference.py
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
```
|
| 100 |
|
| 101 |
-
|
| 102 |
```python
|
| 103 |
import requests
|
| 104 |
|
| 105 |
-
# De-identify text
|
| 106 |
response = requests.post("http://127.0.0.1:8000/completion", json={
|
| 107 |
"prompt": "Instruction: De-identify this text by replacing all personal information with placeholders.\n\nInput: Patient John Smith, born 1985-03-15, lives at 123 Main St.\n\nResponse: ",
|
| 108 |
"max_tokens": 256,
|
|
@@ -110,22 +126,63 @@ model-index:
|
|
| 110 |
})
|
| 111 |
|
| 112 |
result = response.json()
|
| 113 |
-
print(result["content"])
|
|
|
|
| 114 |
```
|
| 115 |
|
| 116 |
-
### Python Client
|
| 117 |
|
| 118 |
```python
|
|
|
|
| 119 |
from deid_inference import DeIdClient
|
| 120 |
|
| 121 |
-
# Initialize client
|
| 122 |
client = DeIdClient()
|
| 123 |
|
| 124 |
-
# De-identify text
|
| 125 |
sensitive_text = "Dr. Sarah Johnson called from (555) 123-4567 about patient Michael Brown."
|
| 126 |
clean_text = client.deidentify_text(sensitive_text)
|
| 127 |
|
| 128 |
-
print(clean_text)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
```
|
| 130 |
|
| 131 |
## π Benchmarks & Performance
|
|
@@ -135,9 +192,9 @@ print(clean_text) # "Dr. [FIRSTNAME_1] [LASTNAME_1] called from [PHONE_1] about
|
|
| 135 |
| Metric | Score | Description |
|
| 136 |
|--------|-------|-------------|
|
| 137 |
| **PII Detection Rate** | **100%** | **Perfect detection when PII is present in input** |
|
| 138 |
-
| **Completeness Score** | **
|
| 139 |
-
| Semantic Preservation |
|
| 140 |
-
| **Average Latency** | **
|
| 141 |
|
| 142 |
### Performance Insights
|
| 143 |
|
|
@@ -370,17 +427,11 @@ If you use DeId-Small in your research, please cite:
|
|
| 370 |
}
|
| 371 |
```
|
| 372 |
|
| 373 |
-
##
|
| 374 |
|
| 375 |
- **Website**: [minibase.ai](https://minibase.ai)
|
| 376 |
-
- **Discord
|
| 377 |
-
- **
|
| 378 |
-
- **Email**: hello@minibase.ai
|
| 379 |
-
|
| 380 |
-
### Support
|
| 381 |
-
- π **Documentation**: [docs.minibase.ai](https://docs.minibase.ai)
|
| 382 |
-
- π¬ **Community Forum**: [forum.minibase.ai](https://forum.minibase.ai)
|
| 383 |
-
- π **Bug Reports**: [GitHub Issues](https://github.com/minibase-ai/deid-small/issues)
|
| 384 |
|
| 385 |
## π License
|
| 386 |
|
|
|
|
| 34 |
config: mixed-domains
|
| 35 |
split: test
|
| 36 |
metrics:
|
|
|
|
|
|
|
|
|
|
| 37 |
- type: pii-detection-rate
|
| 38 |
+
value: 1.000
|
| 39 |
name: PII Detection Rate
|
| 40 |
+
- type: completeness-score
|
| 41 |
+
value: 0.650
|
| 42 |
+
name: Completeness Score
|
| 43 |
- type: semantic-preservation
|
| 44 |
+
value: 0.811
|
| 45 |
name: Semantic Preservation
|
| 46 |
- type: latency
|
| 47 |
+
value: 477.0
|
| 48 |
name: Average Latency (ms)
|
| 49 |
---
|
| 50 |
|
|
|
|
| 83 |
|
| 84 |
1. **Install llama.cpp** (if not already installed):
|
| 85 |
```bash
|
| 86 |
+
# Clone and build llama.cpp
|
| 87 |
git clone https://github.com/ggerganov/llama.cpp
|
| 88 |
+
cd llama.cpp
|
| 89 |
+
make
|
| 90 |
+
|
| 91 |
+
# Return to project directory
|
| 92 |
+
cd ../de-id-small
|
| 93 |
```
|
| 94 |
|
| 95 |
+
2. **Download the GGUF model**:
|
| 96 |
```bash
|
| 97 |
+
# Download model files from HuggingFace
|
| 98 |
wget https://huggingface.co/Minibase/DeId-Small/resolve/main/model.gguf
|
| 99 |
wget https://huggingface.co/Minibase/DeId-Small/resolve/main/deid_inference.py
|
| 100 |
+
wget https://huggingface.co/Minibase/DeId-Small/resolve/main/config.json
|
| 101 |
+
wget https://huggingface.co/Minibase/DeId-Small/resolve/main/tokenizer_config.json
|
| 102 |
+
wget https://huggingface.co/Minibase/DeId-Small/resolve/main/generation_config.json
|
| 103 |
+
```
|
| 104 |
|
| 105 |
+
3. **Start the model server**:
|
| 106 |
+
```bash
|
| 107 |
+
# Start llama.cpp server with the GGUF model
|
| 108 |
+
../llama.cpp/llama-server \
|
| 109 |
+
-m model.gguf \
|
| 110 |
+
--host 127.0.0.1 \
|
| 111 |
+
--port 8000 \
|
| 112 |
+
--ctx-size 2048 \
|
| 113 |
+
--n-gpu-layers 0 \
|
| 114 |
+
--chat-template
|
| 115 |
```
|
| 116 |
|
| 117 |
+
4. **Make API calls**:
|
| 118 |
```python
|
| 119 |
import requests
|
| 120 |
|
| 121 |
+
# De-identify text via REST API
|
| 122 |
response = requests.post("http://127.0.0.1:8000/completion", json={
|
| 123 |
"prompt": "Instruction: De-identify this text by replacing all personal information with placeholders.\n\nInput: Patient John Smith, born 1985-03-15, lives at 123 Main St.\n\nResponse: ",
|
| 124 |
"max_tokens": 256,
|
|
|
|
| 126 |
})
|
| 127 |
|
| 128 |
result = response.json()
|
| 129 |
+
print(result["content"])
|
| 130 |
+
# Output: "Patient [FIRSTNAME_1] [LASTNAME_1], born [DOB_1], lives at [BUILDINGNUMBER_1] [STREET_1]."
|
| 131 |
```
|
| 132 |
|
| 133 |
+
### Python Client (Recommended)
|
| 134 |
|
| 135 |
```python
|
| 136 |
+
# Download and use the provided Python client
|
| 137 |
from deid_inference import DeIdClient
|
| 138 |
|
| 139 |
+
# Initialize client (connects to local server)
|
| 140 |
client = DeIdClient()
|
| 141 |
|
| 142 |
+
# De-identify sensitive text
|
| 143 |
sensitive_text = "Dr. Sarah Johnson called from (555) 123-4567 about patient Michael Brown."
|
| 144 |
clean_text = client.deidentify_text(sensitive_text)
|
| 145 |
|
| 146 |
+
print(clean_text)
|
| 147 |
+
# Output: "Dr. [FIRSTNAME_1] [LASTNAME_1] called from [PHONE_1] about patient [FIRSTNAME_2] [LASTNAME_2]."
|
| 148 |
+
|
| 149 |
+
# Batch processing
|
| 150 |
+
texts = [
|
| 151 |
+
"Employee John Doe earns $85,000 annually.",
|
| 152 |
+
"Contact jane.smith@company.com for details."
|
| 153 |
+
]
|
| 154 |
+
clean_texts = client.deidentify_batch(texts)
|
| 155 |
+
print(clean_texts)
|
| 156 |
+
# Output: ["Employee [FIRSTNAME_1] Doe earns [CURRENCYSYMBOL_1][AMOUNT_1] annually.", "Contact [EMAIL_1] for details."]
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### Direct llama.cpp Usage
|
| 160 |
+
|
| 161 |
+
```python
|
| 162 |
+
# Alternative: Use llama.cpp directly without server
|
| 163 |
+
import subprocess
|
| 164 |
+
import json
|
| 165 |
+
|
| 166 |
+
def deidentify_with_llama_cpp(text: str) -> str:
|
| 167 |
+
prompt = f"Instruction: De-identify this text by replacing all personal information with placeholders.\n\nInput: {text}\n\nResponse: "
|
| 168 |
+
|
| 169 |
+
# Run llama.cpp directly
|
| 170 |
+
cmd = [
|
| 171 |
+
"../llama.cpp/llama-cli",
|
| 172 |
+
"-m", "model.gguf",
|
| 173 |
+
"--prompt", prompt,
|
| 174 |
+
"--ctx-size", "2048",
|
| 175 |
+
"--n-predict", "256",
|
| 176 |
+
"--temp", "0.1",
|
| 177 |
+
"--log-disable"
|
| 178 |
+
]
|
| 179 |
+
|
| 180 |
+
result = subprocess.run(cmd, capture_output=True, text=True, cwd=".")
|
| 181 |
+
return result.stdout.strip()
|
| 182 |
+
|
| 183 |
+
# Usage
|
| 184 |
+
result = deidentify_with_llama_cpp("Patient Sarah Johnson, DOB 05/12/1980.")
|
| 185 |
+
print(result)
|
| 186 |
```
|
| 187 |
|
| 188 |
## π Benchmarks & Performance
|
|
|
|
| 192 |
| Metric | Score | Description |
|
| 193 |
|--------|-------|-------------|
|
| 194 |
| **PII Detection Rate** | **100%** | **Perfect detection when PII is present in input** |
|
| 195 |
+
| **Completeness Score** | **65.0%** | **Percentage of texts fully de-identified** |
|
| 196 |
+
| **Semantic Preservation** | **81.1%** | **How well original meaning is preserved** |
|
| 197 |
+
| **Average Latency** | **477ms** | **Response time performance** |
|
| 198 |
|
| 199 |
### Performance Insights
|
| 200 |
|
|
|
|
| 427 |
}
|
| 428 |
```
|
| 429 |
|
| 430 |
+
## π€ Community & Support
|
| 431 |
|
| 432 |
- **Website**: [minibase.ai](https://minibase.ai)
|
| 433 |
+
- **Discord**: [Join our community](https://discord.com/invite/BrJn4D2Guh)
|
| 434 |
+
- **Documentation**: [docs.minibase.ai](https://docs.minibase.ai)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 435 |
|
| 436 |
## π License
|
| 437 |
|