spaCy NER Implementation
Why spaCy for NER?
Using spaCy's Named Entity Recognition (NER) is significantly better than regex-based cleaning because:
- Intelligent entity extraction: Recognizes PERSON entities using machine learning
- Context-aware: Understands sentence structure and context
- Robust: Handles various name formats (first, last, full, stage names)
- Language support: Works with multiple languages and scripts
- Industry standard: Used in production NLP applications
How It Works
Pipeline Overview
Original Name
β
1. Translate Leetspeak (4βa, 3βe, 1βi)
β
2. Remove Noise (emoji, LoRA terms, versions)
β
3. spaCy NER - Extract PERSON entities
β
4. Fallback to capitalized words if needed
β
Cleaned Name
Detailed Steps
Step 1: Leetspeak Translation
"4kira LoRA v2" β "akira LoRA v2"
"1rene Model" β "irene Model"
"3mma Watson" β "emma Watson"
Step 2: Noise Removal
"akira LoRA v2" β "akira"
"irene Model" β "irene"
"emma Watson" β "emma Watson"
Step 3: spaCy NER
nlp("akira")
# Entities: [("akira", PERSON)]
# Result: "akira"
nlp("emma Watson")
# Entities: [("emma Watson", PERSON)]
# Result: "emma Watson"
Step 4: Fallback
If spaCy doesn't find a PERSON entity:
- Extract capitalized words (likely names)
- Or return cleaned text as-is
Examples
Case 1: Simple Name
Input: "IU"
Output: "IU"
Process:
- Preprocess: "IU" (no noise)
- spaCy NER: Recognizes "IU" as PERSON
- Result: "IU"
Case 2: Name with LoRA Terms
Input: "Scarlett JohanssonγLoRaγ"
Output: "Scarlett Johansson"
Process:
- Preprocess: "Scarlett Johansson" (removed γLoRaγ)
- spaCy NER: Recognizes "Scarlett Johansson" as PERSON
- Result: "Scarlett Johansson"
Case 3: Leetspeak Name
Input: "4kira Anime Character v1"
Output: "akira"
Process:
- Leetspeak: "akira Anime Character v1"
- Preprocess: "akira Anime Character"
- spaCy NER: Recognizes "akira" as PERSON
- Result: "akira"
Case 4: Complex Format
Input: "Gakki | Aragaki Yui | ζ°ε£η΅θ‘£"
Output: "Gakki"
Process:
- Preprocess: "Gakki" (kept first part before |)
- spaCy NER: Recognizes "Gakki" as PERSON
- Result: "Gakki"
Case 5: With Metadata
Input: "Emma Watson (JG) v3.5"
Output: "Emma Watson"
Process:
- Preprocess: "Emma Watson" (removed (JG) and v3.5)
- spaCy NER: Recognizes "Emma Watson" as PERSON
- Result: "Emma Watson"
Advantages Over Regex-Only
Old Approach (Regex Only)
# Just remove noise and hope for the best
name = remove_noise(name)
name = name.strip()
# Result: May include non-name words
Problems:
- Can't distinguish names from other capitalized words
- May include words like "Model", "Anime", "Character"
- No context awareness
- Language-dependent regex patterns needed
New Approach (spaCy NER)
# Intelligent entity extraction
preprocessed = remove_noise(name)
doc = nlp(preprocessed)
person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
# Result: Only actual person names
Benefits:
- β Identifies actual person entities
- β Ignores non-person words
- β Context-aware (understands "Emma Watson" is one entity)
- β Multi-language support
- β Handles various name formats
Comparison Examples
| Input | Regex Only | spaCy NER |
|---|---|---|
"Emma Watson Model" |
"Emma Watson Model" β |
"Emma Watson" β
|
"Anime Character Levi" |
"Anime Character Levi" β |
"Levi" β
|
"Taylor Swift v2" |
"Taylor Swift" β
|
"Taylor Swift" β
|
"K4te Middleton" |
"K4te Middleton" β |
"Kate Middleton" β
|
"Celebrity IU" |
"Celebrity IU" β |
"IU" β
|
spaCy Model Information
Model Used
- Name:
en_core_web_sm - Language: English (but works reasonably with romanized names)
- Size: ~13 MB
- Entities: Recognizes PERSON, ORG, GPE, etc.
Installation
# Install spaCy
pip install spacy
# Download model
python -m spacy download en_core_web_sm
The notebook automatically downloads the model if not found.
Performance
- Speed: ~1000-5000 docs/second
- Accuracy: High for common names
- Memory: Low (~100MB loaded)
Fallback Strategy
If spaCy doesn't recognize a PERSON entity:
Extract capitalized words:
"unknown name here" β ["unknown"]Return first few capitalized words:
"Celebrity Model Actor" β "Celebrity Model Actor"Last resort: Return cleaned text as-is
This ensures we always get something, even for:
- Uncommon/rare names
- Nicknames
- Non-English names
- Stage names
Testing
How to Verify spaCy is Working
Run Cell 5 and check the output:
β
spaCy model loaded: en_core_web_sm
π Name cleaning examples (with spaCy NER):
===================================================================================================
Original Name | Cleaned Name
===================================================================================================
Scarlett JohanssonγLoRaγ | Scarlett Johansson
Emma Watson (JG) | Emma Watson
IU | IU
Belle Delphine | Belle Delphine
...
Key Indicators
β Good signs:
- Person names cleanly extracted
- No extra words like "Model", "LoRA", "Celebrity"
- Multi-word names kept together (e.g., "Emma Watson" not just "Emma")
β Issues to watch:
- Empty results (increase fallback logic)
- Partial names (e.g., only first name)
- Non-names included (tune preprocessing)
Customization
Add More Languages
For better support of non-English names:
# Download multilingual model
python -m spacy download xx_ent_wiki_sm
# Use in code
nlp = spacy.load("xx_ent_wiki_sm")
Adjust Entity Extraction
To extract other entities:
# Extract organizations too
entities = [ent.text for ent in doc.ents
if ent.label_ in ["PERSON", "ORG"]]
Custom Entity Rules
Add custom patterns for names spaCy might miss:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
# Add patterns for specific name formats
Benefits for This Project
Better Person Identification
With cleaner names:
- LLMs receive recognizable names
- "Emma Watson" instead of "Emma Watson Model LoRA v3"
- Better identification accuracy
Reduced Ambiguity
spaCy helps distinguish:
- Person names vs. descriptive words
- "Celebrity IU" β "IU" (person)
- "Model Bella" β "Bella" (person)
Improved Context for LLMs
Cleaner input = better prompts:
Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
After: "Given 'Emma Watson' (actress)..."
The LLM can now focus on identifying the person, not parsing the noise.
Summary
β spaCy NER provides intelligent, context-aware name extraction β Better than regex for handling complex name formats β Fallback strategy ensures we always get a result β Industry standard tool used in production NLP β Easy to use with minimal code
The combination of:
- Leetspeak translation
- Noise removal
- spaCy NER
- Smart fallbacks
...results in clean, accurate person names ready for LLM annotation!