code / md /SPACY_NER_EXPLANATION.md
Laura Wagner
to commit or not commit that is the question
5f5806d
|
Raw
History Blame Contribute Delete
7.54 kB
# spaCy NER Implementation
## Why spaCy for NER?
Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because:
1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning
2. **Context-aware**: Understands sentence structure and context
3. **Robust**: Handles various name formats (first, last, full, stage names)
4. **Language support**: Works with multiple languages and scripts
5. **Industry standard**: Used in production NLP applications
## How It Works
### Pipeline Overview
```
Original Name
1. Translate Leetspeak (4→a, 3→e, 1→i)
2. Remove Noise (emoji, LoRA terms, versions)
3. spaCy NER - Extract PERSON entities
4. Fallback to capitalized words if needed
Cleaned Name
```
### Detailed Steps
#### Step 1: Leetspeak Translation
```python
"4kira LoRA v2" → "akira LoRA v2"
"1rene Model" → "irene Model"
"3mma Watson" → "emma Watson"
```
#### Step 2: Noise Removal
```python
"akira LoRA v2" → "akira"
"irene Model" → "irene"
"emma Watson" → "emma Watson"
```
#### Step 3: spaCy NER
```python
nlp("akira")
# Entities: [("akira", PERSON)]
# Result: "akira"
nlp("emma Watson")
# Entities: [("emma Watson", PERSON)]
# Result: "emma Watson"
```
#### Step 4: Fallback
If spaCy doesn't find a PERSON entity:
- Extract capitalized words (likely names)
- Or return cleaned text as-is
## Examples
### Case 1: Simple Name
```
Input: "IU"
Output: "IU"
Process:
- Preprocess: "IU" (no noise)
- spaCy NER: Recognizes "IU" as PERSON
- Result: "IU"
```
### Case 2: Name with LoRA Terms
```
Input: "Scarlett Johansson「LoRa」"
Output: "Scarlett Johansson"
Process:
- Preprocess: "Scarlett Johansson" (removed 「LoRa」)
- spaCy NER: Recognizes "Scarlett Johansson" as PERSON
- Result: "Scarlett Johansson"
```
### Case 3: Leetspeak Name
```
Input: "4kira Anime Character v1"
Output: "akira"
Process:
- Leetspeak: "akira Anime Character v1"
- Preprocess: "akira Anime Character"
- spaCy NER: Recognizes "akira" as PERSON
- Result: "akira"
```
### Case 4: Complex Format
```
Input: "Gakki | Aragaki Yui | 新垣結衣"
Output: "Gakki"
Process:
- Preprocess: "Gakki" (kept first part before |)
- spaCy NER: Recognizes "Gakki" as PERSON
- Result: "Gakki"
```
### Case 5: With Metadata
```
Input: "Emma Watson (JG) v3.5"
Output: "Emma Watson"
Process:
- Preprocess: "Emma Watson" (removed (JG) and v3.5)
- spaCy NER: Recognizes "Emma Watson" as PERSON
- Result: "Emma Watson"
```
## Advantages Over Regex-Only
### Old Approach (Regex Only)
```python
# Just remove noise and hope for the best
name = remove_noise(name)
name = name.strip()
# Result: May include non-name words
```
Problems:
- Can't distinguish names from other capitalized words
- May include words like "Model", "Anime", "Character"
- No context awareness
- Language-dependent regex patterns needed
### New Approach (spaCy NER)
```python
# Intelligent entity extraction
preprocessed = remove_noise(name)
doc = nlp(preprocessed)
person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
# Result: Only actual person names
```
Benefits:
- ✅ Identifies actual person entities
- ✅ Ignores non-person words
- ✅ Context-aware (understands "Emma Watson" is one entity)
- ✅ Multi-language support
- ✅ Handles various name formats
## Comparison Examples
| Input | Regex Only | spaCy NER |
|-------|------------|-----------|
| `"Emma Watson Model"` | `"Emma Watson Model"` ❌ | `"Emma Watson"` ✅ |
| `"Anime Character Levi"` | `"Anime Character Levi"` ❌ | `"Levi"` ✅ |
| `"Taylor Swift v2"` | `"Taylor Swift"` ✅ | `"Taylor Swift"` ✅ |
| `"K4te Middleton"` | `"K4te Middleton"` ❌ | `"Kate Middleton"` ✅ |
| `"Celebrity IU"` | `"Celebrity IU"` ❌ | `"IU"` ✅ |
## spaCy Model Information
### Model Used
- **Name**: `en_core_web_sm`
- **Language**: English (but works reasonably with romanized names)
- **Size**: ~13 MB
- **Entities**: Recognizes PERSON, ORG, GPE, etc.
### Installation
```bash
# Install spaCy
pip install spacy
# Download model
python -m spacy download en_core_web_sm
```
The notebook automatically downloads the model if not found.
### Performance
- **Speed**: ~1000-5000 docs/second
- **Accuracy**: High for common names
- **Memory**: Low (~100MB loaded)
## Fallback Strategy
If spaCy doesn't recognize a PERSON entity:
1. **Extract capitalized words**:
```python
"unknown name here" → ["unknown"]
```
2. **Return first few capitalized words**:
```python
"Celebrity Model Actor" → "Celebrity Model Actor"
```
3. **Last resort**: Return cleaned text as-is
This ensures we always get something, even for:
- Uncommon/rare names
- Nicknames
- Non-English names
- Stage names
## Testing
### How to Verify spaCy is Working
Run Cell 5 and check the output:
```
✅ spaCy model loaded: en_core_web_sm
📊 Name cleaning examples (with spaCy NER):
===================================================================================================
Original Name | Cleaned Name
===================================================================================================
Scarlett Johansson「LoRa」 | Scarlett Johansson
Emma Watson (JG) | Emma Watson
IU | IU
Belle Delphine | Belle Delphine
...
```
### Key Indicators
**Good signs**:
- Person names cleanly extracted
- No extra words like "Model", "LoRA", "Celebrity"
- Multi-word names kept together (e.g., "Emma Watson" not just "Emma")
**Issues to watch**:
- Empty results (increase fallback logic)
- Partial names (e.g., only first name)
- Non-names included (tune preprocessing)
## Customization
### Add More Languages
For better support of non-English names:
```python
# Download multilingual model
python -m spacy download xx_ent_wiki_sm
# Use in code
nlp = spacy.load("xx_ent_wiki_sm")
```
### Adjust Entity Extraction
To extract other entities:
```python
# Extract organizations too
entities = [ent.text for ent in doc.ents
if ent.label_ in ["PERSON", "ORG"]]
```
### Custom Entity Rules
Add custom patterns for names spaCy might miss:
```python
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
# Add patterns for specific name formats
```
## Benefits for This Project
### Better Person Identification
With cleaner names:
- LLMs receive recognizable names
- "Emma Watson" instead of "Emma Watson Model LoRA v3"
- Better identification accuracy
### Reduced Ambiguity
spaCy helps distinguish:
- Person names vs. descriptive words
- "Celebrity IU" → "IU" (person)
- "Model Bella" → "Bella" (person)
### Improved Context for LLMs
Cleaner input = better prompts:
```
Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
After: "Given 'Emma Watson' (actress)..."
```
The LLM can now focus on identifying the person, not parsing the noise.
## Summary
**spaCy NER** provides intelligent, context-aware name extraction
**Better than regex** for handling complex name formats
**Fallback strategy** ensures we always get a result
**Industry standard** tool used in production NLP
**Easy to use** with minimal code
The combination of:
1. Leetspeak translation
2. Noise removal
3. spaCy NER
4. Smart fallbacks
...results in clean, accurate person names ready for LLM annotation!