File size: 7,537 Bytes
5f5806d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 |
# spaCy NER Implementation
## Why spaCy for NER?
Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because:
1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning
2. **Context-aware**: Understands sentence structure and context
3. **Robust**: Handles various name formats (first, last, full, stage names)
4. **Language support**: Works with multiple languages and scripts
5. **Industry standard**: Used in production NLP applications
## How It Works
### Pipeline Overview
```
Original Name
β
1. Translate Leetspeak (4βa, 3βe, 1βi)
β
2. Remove Noise (emoji, LoRA terms, versions)
β
3. spaCy NER - Extract PERSON entities
β
4. Fallback to capitalized words if needed
β
Cleaned Name
```
### Detailed Steps
#### Step 1: Leetspeak Translation
```python
"4kira LoRA v2" β "akira LoRA v2"
"1rene Model" β "irene Model"
"3mma Watson" β "emma Watson"
```
#### Step 2: Noise Removal
```python
"akira LoRA v2" β "akira"
"irene Model" β "irene"
"emma Watson" β "emma Watson"
```
#### Step 3: spaCy NER
```python
nlp("akira")
# Entities: [("akira", PERSON)]
# Result: "akira"
nlp("emma Watson")
# Entities: [("emma Watson", PERSON)]
# Result: "emma Watson"
```
#### Step 4: Fallback
If spaCy doesn't find a PERSON entity:
- Extract capitalized words (likely names)
- Or return cleaned text as-is
## Examples
### Case 1: Simple Name
```
Input: "IU"
Output: "IU"
Process:
- Preprocess: "IU" (no noise)
- spaCy NER: Recognizes "IU" as PERSON
- Result: "IU"
```
### Case 2: Name with LoRA Terms
```
Input: "Scarlett JohanssonγLoRaγ"
Output: "Scarlett Johansson"
Process:
- Preprocess: "Scarlett Johansson" (removed γLoRaγ)
- spaCy NER: Recognizes "Scarlett Johansson" as PERSON
- Result: "Scarlett Johansson"
```
### Case 3: Leetspeak Name
```
Input: "4kira Anime Character v1"
Output: "akira"
Process:
- Leetspeak: "akira Anime Character v1"
- Preprocess: "akira Anime Character"
- spaCy NER: Recognizes "akira" as PERSON
- Result: "akira"
```
### Case 4: Complex Format
```
Input: "Gakki | Aragaki Yui | ζ°ε£η΅θ‘£"
Output: "Gakki"
Process:
- Preprocess: "Gakki" (kept first part before |)
- spaCy NER: Recognizes "Gakki" as PERSON
- Result: "Gakki"
```
### Case 5: With Metadata
```
Input: "Emma Watson (JG) v3.5"
Output: "Emma Watson"
Process:
- Preprocess: "Emma Watson" (removed (JG) and v3.5)
- spaCy NER: Recognizes "Emma Watson" as PERSON
- Result: "Emma Watson"
```
## Advantages Over Regex-Only
### Old Approach (Regex Only)
```python
# Just remove noise and hope for the best
name = remove_noise(name)
name = name.strip()
# Result: May include non-name words
```
Problems:
- Can't distinguish names from other capitalized words
- May include words like "Model", "Anime", "Character"
- No context awareness
- Language-dependent regex patterns needed
### New Approach (spaCy NER)
```python
# Intelligent entity extraction
preprocessed = remove_noise(name)
doc = nlp(preprocessed)
person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
# Result: Only actual person names
```
Benefits:
- β
Identifies actual person entities
- β
Ignores non-person words
- β
Context-aware (understands "Emma Watson" is one entity)
- β
Multi-language support
- β
Handles various name formats
## Comparison Examples
| Input | Regex Only | spaCy NER |
|-------|------------|-----------|
| `"Emma Watson Model"` | `"Emma Watson Model"` β | `"Emma Watson"` β
|
| `"Anime Character Levi"` | `"Anime Character Levi"` β | `"Levi"` β
|
| `"Taylor Swift v2"` | `"Taylor Swift"` β
| `"Taylor Swift"` β
|
| `"K4te Middleton"` | `"K4te Middleton"` β | `"Kate Middleton"` β
|
| `"Celebrity IU"` | `"Celebrity IU"` β | `"IU"` β
|
## spaCy Model Information
### Model Used
- **Name**: `en_core_web_sm`
- **Language**: English (but works reasonably with romanized names)
- **Size**: ~13 MB
- **Entities**: Recognizes PERSON, ORG, GPE, etc.
### Installation
```bash
# Install spaCy
pip install spacy
# Download model
python -m spacy download en_core_web_sm
```
The notebook automatically downloads the model if not found.
### Performance
- **Speed**: ~1000-5000 docs/second
- **Accuracy**: High for common names
- **Memory**: Low (~100MB loaded)
## Fallback Strategy
If spaCy doesn't recognize a PERSON entity:
1. **Extract capitalized words**:
```python
"unknown name here" β ["unknown"]
```
2. **Return first few capitalized words**:
```python
"Celebrity Model Actor" β "Celebrity Model Actor"
```
3. **Last resort**: Return cleaned text as-is
This ensures we always get something, even for:
- Uncommon/rare names
- Nicknames
- Non-English names
- Stage names
## Testing
### How to Verify spaCy is Working
Run Cell 5 and check the output:
```
β
spaCy model loaded: en_core_web_sm
π Name cleaning examples (with spaCy NER):
===================================================================================================
Original Name | Cleaned Name
===================================================================================================
Scarlett JohanssonγLoRaγ | Scarlett Johansson
Emma Watson (JG) | Emma Watson
IU | IU
Belle Delphine | Belle Delphine
...
```
### Key Indicators
β
**Good signs**:
- Person names cleanly extracted
- No extra words like "Model", "LoRA", "Celebrity"
- Multi-word names kept together (e.g., "Emma Watson" not just "Emma")
β **Issues to watch**:
- Empty results (increase fallback logic)
- Partial names (e.g., only first name)
- Non-names included (tune preprocessing)
## Customization
### Add More Languages
For better support of non-English names:
```python
# Download multilingual model
python -m spacy download xx_ent_wiki_sm
# Use in code
nlp = spacy.load("xx_ent_wiki_sm")
```
### Adjust Entity Extraction
To extract other entities:
```python
# Extract organizations too
entities = [ent.text for ent in doc.ents
if ent.label_ in ["PERSON", "ORG"]]
```
### Custom Entity Rules
Add custom patterns for names spaCy might miss:
```python
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
# Add patterns for specific name formats
```
## Benefits for This Project
### Better Person Identification
With cleaner names:
- LLMs receive recognizable names
- "Emma Watson" instead of "Emma Watson Model LoRA v3"
- Better identification accuracy
### Reduced Ambiguity
spaCy helps distinguish:
- Person names vs. descriptive words
- "Celebrity IU" β "IU" (person)
- "Model Bella" β "Bella" (person)
### Improved Context for LLMs
Cleaner input = better prompts:
```
Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
After: "Given 'Emma Watson' (actress)..."
```
The LLM can now focus on identifying the person, not parsing the noise.
## Summary
β
**spaCy NER** provides intelligent, context-aware name extraction
β
**Better than regex** for handling complex name formats
β
**Fallback strategy** ensures we always get a result
β
**Industry standard** tool used in production NLP
β
**Easy to use** with minimal code
The combination of:
1. Leetspeak translation
2. Noise removal
3. spaCy NER
4. Smart fallbacks
...results in clean, accurate person names ready for LLM annotation!
|