code / md /SPACY_NER_EXPLANATION.md
Laura Wagner
to commit or not commit that is the question
5f5806d
# spaCy NER Implementation
## Why spaCy for NER?
Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because:
1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning
2. **Context-aware**: Understands sentence structure and context
3. **Robust**: Handles various name formats (first, last, full, stage names)
4. **Language support**: Works with multiple languages and scripts
5. **Industry standard**: Used in production NLP applications
## How It Works
### Pipeline Overview
```
Original Name
↓
1. Translate Leetspeak (4β†’a, 3β†’e, 1β†’i)
↓
2. Remove Noise (emoji, LoRA terms, versions)
↓
3. spaCy NER - Extract PERSON entities
↓
4. Fallback to capitalized words if needed
↓
Cleaned Name
```
### Detailed Steps
#### Step 1: Leetspeak Translation
```python
"4kira LoRA v2" β†’ "akira LoRA v2"
"1rene Model" β†’ "irene Model"
"3mma Watson" β†’ "emma Watson"
```
#### Step 2: Noise Removal
```python
"akira LoRA v2" β†’ "akira"
"irene Model" β†’ "irene"
"emma Watson" β†’ "emma Watson"
```
#### Step 3: spaCy NER
```python
nlp("akira")
# Entities: [("akira", PERSON)]
# Result: "akira"
nlp("emma Watson")
# Entities: [("emma Watson", PERSON)]
# Result: "emma Watson"
```
#### Step 4: Fallback
If spaCy doesn't find a PERSON entity:
- Extract capitalized words (likely names)
- Or return cleaned text as-is
## Examples
### Case 1: Simple Name
```
Input: "IU"
Output: "IU"
Process:
- Preprocess: "IU" (no noise)
- spaCy NER: Recognizes "IU" as PERSON
- Result: "IU"
```
### Case 2: Name with LoRA Terms
```
Input: "Scarlett Johanssonγ€ŒLoRa」"
Output: "Scarlett Johansson"
Process:
- Preprocess: "Scarlett Johansson" (removed γ€ŒLoRa」)
- spaCy NER: Recognizes "Scarlett Johansson" as PERSON
- Result: "Scarlett Johansson"
```
### Case 3: Leetspeak Name
```
Input: "4kira Anime Character v1"
Output: "akira"
Process:
- Leetspeak: "akira Anime Character v1"
- Preprocess: "akira Anime Character"
- spaCy NER: Recognizes "akira" as PERSON
- Result: "akira"
```
### Case 4: Complex Format
```
Input: "Gakki | Aragaki Yui | ζ–°εž£η΅θ‘£"
Output: "Gakki"
Process:
- Preprocess: "Gakki" (kept first part before |)
- spaCy NER: Recognizes "Gakki" as PERSON
- Result: "Gakki"
```
### Case 5: With Metadata
```
Input: "Emma Watson (JG) v3.5"
Output: "Emma Watson"
Process:
- Preprocess: "Emma Watson" (removed (JG) and v3.5)
- spaCy NER: Recognizes "Emma Watson" as PERSON
- Result: "Emma Watson"
```
## Advantages Over Regex-Only
### Old Approach (Regex Only)
```python
# Just remove noise and hope for the best
name = remove_noise(name)
name = name.strip()
# Result: May include non-name words
```
Problems:
- Can't distinguish names from other capitalized words
- May include words like "Model", "Anime", "Character"
- No context awareness
- Language-dependent regex patterns needed
### New Approach (spaCy NER)
```python
# Intelligent entity extraction
preprocessed = remove_noise(name)
doc = nlp(preprocessed)
person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
# Result: Only actual person names
```
Benefits:
- βœ… Identifies actual person entities
- βœ… Ignores non-person words
- βœ… Context-aware (understands "Emma Watson" is one entity)
- βœ… Multi-language support
- βœ… Handles various name formats
## Comparison Examples
| Input | Regex Only | spaCy NER |
|-------|------------|-----------|
| `"Emma Watson Model"` | `"Emma Watson Model"` ❌ | `"Emma Watson"` βœ… |
| `"Anime Character Levi"` | `"Anime Character Levi"` ❌ | `"Levi"` βœ… |
| `"Taylor Swift v2"` | `"Taylor Swift"` βœ… | `"Taylor Swift"` βœ… |
| `"K4te Middleton"` | `"K4te Middleton"` ❌ | `"Kate Middleton"` βœ… |
| `"Celebrity IU"` | `"Celebrity IU"` ❌ | `"IU"` βœ… |
## spaCy Model Information
### Model Used
- **Name**: `en_core_web_sm`
- **Language**: English (but works reasonably with romanized names)
- **Size**: ~13 MB
- **Entities**: Recognizes PERSON, ORG, GPE, etc.
### Installation
```bash
# Install spaCy
pip install spacy
# Download model
python -m spacy download en_core_web_sm
```
The notebook automatically downloads the model if not found.
### Performance
- **Speed**: ~1000-5000 docs/second
- **Accuracy**: High for common names
- **Memory**: Low (~100MB loaded)
## Fallback Strategy
If spaCy doesn't recognize a PERSON entity:
1. **Extract capitalized words**:
```python
"unknown name here" β†’ ["unknown"]
```
2. **Return first few capitalized words**:
```python
"Celebrity Model Actor" β†’ "Celebrity Model Actor"
```
3. **Last resort**: Return cleaned text as-is
This ensures we always get something, even for:
- Uncommon/rare names
- Nicknames
- Non-English names
- Stage names
## Testing
### How to Verify spaCy is Working
Run Cell 5 and check the output:
```
βœ… spaCy model loaded: en_core_web_sm
πŸ“Š Name cleaning examples (with spaCy NER):
===================================================================================================
Original Name | Cleaned Name
===================================================================================================
Scarlett Johanssonγ€ŒLoRa」 | Scarlett Johansson
Emma Watson (JG) | Emma Watson
IU | IU
Belle Delphine | Belle Delphine
...
```
### Key Indicators
βœ… **Good signs**:
- Person names cleanly extracted
- No extra words like "Model", "LoRA", "Celebrity"
- Multi-word names kept together (e.g., "Emma Watson" not just "Emma")
❌ **Issues to watch**:
- Empty results (increase fallback logic)
- Partial names (e.g., only first name)
- Non-names included (tune preprocessing)
## Customization
### Add More Languages
For better support of non-English names:
```python
# Download multilingual model
python -m spacy download xx_ent_wiki_sm
# Use in code
nlp = spacy.load("xx_ent_wiki_sm")
```
### Adjust Entity Extraction
To extract other entities:
```python
# Extract organizations too
entities = [ent.text for ent in doc.ents
if ent.label_ in ["PERSON", "ORG"]]
```
### Custom Entity Rules
Add custom patterns for names spaCy might miss:
```python
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
# Add patterns for specific name formats
```
## Benefits for This Project
### Better Person Identification
With cleaner names:
- LLMs receive recognizable names
- "Emma Watson" instead of "Emma Watson Model LoRA v3"
- Better identification accuracy
### Reduced Ambiguity
spaCy helps distinguish:
- Person names vs. descriptive words
- "Celebrity IU" β†’ "IU" (person)
- "Model Bella" β†’ "Bella" (person)
### Improved Context for LLMs
Cleaner input = better prompts:
```
Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
After: "Given 'Emma Watson' (actress)..."
```
The LLM can now focus on identifying the person, not parsing the noise.
## Summary
βœ… **spaCy NER** provides intelligent, context-aware name extraction
βœ… **Better than regex** for handling complex name formats
βœ… **Fallback strategy** ensures we always get a result
βœ… **Industry standard** tool used in production NLP
βœ… **Easy to use** with minimal code
The combination of:
1. Leetspeak translation
2. Noise removal
3. spaCy NER
4. Smart fallbacks
...results in clean, accurate person names ready for LLM annotation!