code / md /SPACY_NER_EXPLANATION.md

Laura Wagner

to commit or not commit that is the question

5f5806d about 2 months ago

7.54 kB

	# spaCy NER Implementation

	## Why spaCy for NER?

	Using spaCy's Named Entity Recognition (NER) is significantly better than regex-based cleaning because:

	1. Intelligent entity extraction: Recognizes PERSON entities using machine learning
	2. Context-aware: Understands sentence structure and context
	3. Robust: Handles various name formats (first, last, full, stage names)
	4. Language support: Works with multiple languages and scripts
	5. Industry standard: Used in production NLP applications

	## How It Works

	### Pipeline Overview

	```
	Original Name
	↓
	1. Translate Leetspeak (4→a, 3→e, 1→i)
	↓
	2. Remove Noise (emoji, LoRA terms, versions)
	↓
	3. spaCy NER - Extract PERSON entities
	↓
	4. Fallback to capitalized words if needed
	↓
	Cleaned Name
	```

	### Detailed Steps

	#### Step 1: Leetspeak Translation
	```python
	"4kira LoRA v2" → "akira LoRA v2"
	"1rene Model" → "irene Model"
	"3mma Watson" → "emma Watson"
	```

	#### Step 2: Noise Removal
	```python
	"akira LoRA v2" → "akira"
	"irene Model" → "irene"
	"emma Watson" → "emma Watson"
	```

	#### Step 3: spaCy NER
	```python
	nlp("akira")
	# Entities: [("akira", PERSON)]
	# Result: "akira"

	nlp("emma Watson")
	# Entities: [("emma Watson", PERSON)]
	# Result: "emma Watson"
	```

	#### Step 4: Fallback
	If spaCy doesn't find a PERSON entity:
	- Extract capitalized words (likely names)
	- Or return cleaned text as-is

	## Examples

	### Case 1: Simple Name
	```
	Input: "IU"
	Output: "IU"

	Process:
	- Preprocess: "IU" (no noise)
	- spaCy NER: Recognizes "IU" as PERSON
	- Result: "IU"
	```

	### Case 2: Name with LoRA Terms
	```
	Input: "Scarlett Johansson「LoRa」"
	Output: "Scarlett Johansson"

	Process:
	- Preprocess: "Scarlett Johansson" (removed 「LoRa」)
	- spaCy NER: Recognizes "Scarlett Johansson" as PERSON
	- Result: "Scarlett Johansson"
	```

	### Case 3: Leetspeak Name
	```
	Input: "4kira Anime Character v1"
	Output: "akira"

	Process:
	- Leetspeak: "akira Anime Character v1"
	- Preprocess: "akira Anime Character"
	- spaCy NER: Recognizes "akira" as PERSON
	- Result: "akira"
	```

	### Case 4: Complex Format
	```
	Input: "Gakki \| Aragaki Yui \| 新垣結衣"
	Output: "Gakki"

	Process:
	- Preprocess: "Gakki" (kept first part before \|)
	- spaCy NER: Recognizes "Gakki" as PERSON
	- Result: "Gakki"
	```

	### Case 5: With Metadata
	```
	Input: "Emma Watson (JG) v3.5"
	Output: "Emma Watson"

	Process:
	- Preprocess: "Emma Watson" (removed (JG) and v3.5)
	- spaCy NER: Recognizes "Emma Watson" as PERSON
	- Result: "Emma Watson"
	```

	## Advantages Over Regex-Only

	### Old Approach (Regex Only)
	```python
	# Just remove noise and hope for the best
	name = remove_noise(name)
	name = name.strip()
	# Result: May include non-name words
	```

	Problems:
	- Can't distinguish names from other capitalized words
	- May include words like "Model", "Anime", "Character"
	- No context awareness
	- Language-dependent regex patterns needed

	### New Approach (spaCy NER)
	```python
	# Intelligent entity extraction
	preprocessed = remove_noise(name)
	doc = nlp(preprocessed)
	person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
	# Result: Only actual person names
	```

	Benefits:
	- ✅ Identifies actual person entities
	- ✅ Ignores non-person words
	- ✅ Context-aware (understands "Emma Watson" is one entity)
	- ✅ Multi-language support
	- ✅ Handles various name formats

	## Comparison Examples

	\| Input \| Regex Only \| spaCy NER \|
	\|-------\|------------\|-----------\|
	\| `"Emma Watson Model"` \| `"Emma Watson Model"` ❌ \| `"Emma Watson"` ✅ \|
	\| `"Anime Character Levi"` \| `"Anime Character Levi"` ❌ \| `"Levi"` ✅ \|
	\| `"Taylor Swift v2"` \| `"Taylor Swift"` ✅ \| `"Taylor Swift"` ✅ \|
	\| `"K4te Middleton"` \| `"K4te Middleton"` ❌ \| `"Kate Middleton"` ✅ \|
	\| `"Celebrity IU"` \| `"Celebrity IU"` ❌ \| `"IU"` ✅ \|

	## spaCy Model Information

	### Model Used
	- Name: `en_core_web_sm`
	- Language: English (but works reasonably with romanized names)
	- Size: ~13 MB
	- Entities: Recognizes PERSON, ORG, GPE, etc.

	### Installation
	```bash
	# Install spaCy
	pip install spacy

	# Download model
	python -m spacy download en_core_web_sm
	```

	The notebook automatically downloads the model if not found.

	### Performance
	- Speed: ~1000-5000 docs/second
	- Accuracy: High for common names
	- Memory: Low (~100MB loaded)

	## Fallback Strategy

	If spaCy doesn't recognize a PERSON entity:

	1. Extract capitalized words:
	```python
	"unknown name here" → ["unknown"]
	```

	2. Return first few capitalized words:
	```python
	"Celebrity Model Actor" → "Celebrity Model Actor"
	```

	3. Last resort: Return cleaned text as-is

	This ensures we always get something, even for:
	- Uncommon/rare names
	- Nicknames
	- Non-English names
	- Stage names

	## Testing

	### How to Verify spaCy is Working

	Run Cell 5 and check the output:

	```
	✅ spaCy model loaded: en_core_web_sm

	📊 Name cleaning examples (with spaCy NER):
	===================================================================================================
	Original Name \| Cleaned Name
	===================================================================================================
	Scarlett Johansson「LoRa」 \| Scarlett Johansson
	Emma Watson (JG) \| Emma Watson
	IU \| IU
	Belle Delphine \| Belle Delphine
	...
	```

	### Key Indicators

	✅ Good signs:
	- Person names cleanly extracted
	- No extra words like "Model", "LoRA", "Celebrity"
	- Multi-word names kept together (e.g., "Emma Watson" not just "Emma")

	❌ Issues to watch:
	- Empty results (increase fallback logic)
	- Partial names (e.g., only first name)
	- Non-names included (tune preprocessing)

	## Customization

	### Add More Languages

	For better support of non-English names:

	```python
	# Download multilingual model
	python -m spacy download xx_ent_wiki_sm

	# Use in code
	nlp = spacy.load("xx_ent_wiki_sm")
	```

	### Adjust Entity Extraction

	To extract other entities:

	```python
	# Extract organizations too
	entities = [ent.text for ent in doc.ents
	if ent.label_ in ["PERSON", "ORG"]]
	```

	### Custom Entity Rules

	Add custom patterns for names spaCy might miss:

	```python
	from spacy.matcher import Matcher

	matcher = Matcher(nlp.vocab)
	# Add patterns for specific name formats
	```

	## Benefits for This Project

	### Better Person Identification

	With cleaner names:
	- LLMs receive recognizable names
	- "Emma Watson" instead of "Emma Watson Model LoRA v3"
	- Better identification accuracy

	### Reduced Ambiguity

	spaCy helps distinguish:
	- Person names vs. descriptive words
	- "Celebrity IU" → "IU" (person)
	- "Model Bella" → "Bella" (person)

	### Improved Context for LLMs

	Cleaner input = better prompts:
	```
	Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
	After: "Given 'Emma Watson' (actress)..."
	```

	The LLM can now focus on identifying the person, not parsing the noise.

	## Summary

	✅ spaCy NER provides intelligent, context-aware name extraction
	✅ Better than regex for handling complex name formats
	✅ Fallback strategy ensures we always get a result
	✅ Industry standard tool used in production NLP
	✅ Easy to use with minimal code

	The combination of:
	1. Leetspeak translation
	2. Noise removal
	3. spaCy NER
	4. Smart fallbacks

	...results in clean, accurate person names ready for LLM annotation!