code / md /SPACY_NER_EXPLANATION.md

Laura Wagner

to commit or not commit that is the question

5f5806d about 2 months ago

preview code

raw

history blame contribute delete

7.54 kB

spaCy NER Implementation

Why spaCy for NER?

Using spaCy's Named Entity Recognition (NER) is significantly better than regex-based cleaning because:

Intelligent entity extraction: Recognizes PERSON entities using machine learning
Context-aware: Understands sentence structure and context
Robust: Handles various name formats (first, last, full, stage names)
Language support: Works with multiple languages and scripts
Industry standard: Used in production NLP applications

How It Works

Pipeline Overview

Original Name
    ↓
1. Translate Leetspeak (4→a, 3→e, 1→i)
    ↓
2. Remove Noise (emoji, LoRA terms, versions)
    ↓
3. spaCy NER - Extract PERSON entities
    ↓
4. Fallback to capitalized words if needed
    ↓
Cleaned Name

Detailed Steps

Step 1: Leetspeak Translation

"4kira LoRA v2" → "akira LoRA v2"
"1rene Model" → "irene Model"
"3mma Watson" → "emma Watson"

Step 2: Noise Removal

"akira LoRA v2" → "akira"
"irene Model" → "irene"
"emma Watson" → "emma Watson"

Step 3: spaCy NER

nlp("akira")
# Entities: [("akira", PERSON)]
# Result: "akira"

nlp("emma Watson")
# Entities: [("emma Watson", PERSON)]
# Result: "emma Watson"

Step 4: Fallback

If spaCy doesn't find a PERSON entity:

Extract capitalized words (likely names)
Or return cleaned text as-is

Examples

Case 1: Simple Name

Input:  "IU"
Output: "IU"

Process:
  - Preprocess: "IU" (no noise)
  - spaCy NER: Recognizes "IU" as PERSON
  - Result: "IU"

Case 2: Name with LoRA Terms

Input:  "Scarlett Johansson「LoRa」"
Output: "Scarlett Johansson"

Process:
  - Preprocess: "Scarlett Johansson" (removed 「LoRa」)
  - spaCy NER: Recognizes "Scarlett Johansson" as PERSON
  - Result: "Scarlett Johansson"

Case 3: Leetspeak Name

Input:  "4kira Anime Character v1"
Output: "akira"

Process:
  - Leetspeak: "akira Anime Character v1"
  - Preprocess: "akira Anime Character"
  - spaCy NER: Recognizes "akira" as PERSON
  - Result: "akira"

Case 4: Complex Format

Input:  "Gakki | Aragaki Yui | 新垣結衣"
Output: "Gakki"

Process:
  - Preprocess: "Gakki" (kept first part before |)
  - spaCy NER: Recognizes "Gakki" as PERSON
  - Result: "Gakki"

Case 5: With Metadata

Input:  "Emma Watson (JG) v3.5"
Output: "Emma Watson"

Process:
  - Preprocess: "Emma Watson" (removed (JG) and v3.5)
  - spaCy NER: Recognizes "Emma Watson" as PERSON
  - Result: "Emma Watson"

Advantages Over Regex-Only

Old Approach (Regex Only)

# Just remove noise and hope for the best
name = remove_noise(name)
name = name.strip()
# Result: May include non-name words

Problems:

Can't distinguish names from other capitalized words
May include words like "Model", "Anime", "Character"
No context awareness
Language-dependent regex patterns needed

New Approach (spaCy NER)

# Intelligent entity extraction
preprocessed = remove_noise(name)
doc = nlp(preprocessed)
person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
# Result: Only actual person names

Benefits:

✅ Identifies actual person entities
✅ Ignores non-person words
✅ Context-aware (understands "Emma Watson" is one entity)
✅ Multi-language support
✅ Handles various name formats

Comparison Examples

Input	Regex Only	spaCy NER
`"Emma Watson Model"`	`"Emma Watson Model"` ❌	`"Emma Watson"` ✅
`"Anime Character Levi"`	`"Anime Character Levi"` ❌	`"Levi"` ✅
`"Taylor Swift v2"`	`"Taylor Swift"` ✅	`"Taylor Swift"` ✅
`"K4te Middleton"`	`"K4te Middleton"` ❌	`"Kate Middleton"` ✅
`"Celebrity IU"`	`"Celebrity IU"` ❌	`"IU"` ✅

spaCy Model Information

Model Used

Name: en_core_web_sm
Language: English (but works reasonably with romanized names)
Size: ~13 MB
Entities: Recognizes PERSON, ORG, GPE, etc.

Installation

# Install spaCy
pip install spacy

# Download model
python -m spacy download en_core_web_sm

The notebook automatically downloads the model if not found.

Performance

Speed: ~1000-5000 docs/second
Accuracy: High for common names
Memory: Low (~100MB loaded)

Fallback Strategy

If spaCy doesn't recognize a PERSON entity:

Extract capitalized words:
```
"unknown name here" → ["unknown"]
```

Return first few capitalized words:

"Celebrity Model Actor" → "Celebrity Model Actor"

Last resort: Return cleaned text as-is

This ensures we always get something, even for:

Uncommon/rare names
Nicknames
Non-English names
Stage names

Testing

How to Verify spaCy is Working

Run Cell 5 and check the output:

✅ spaCy model loaded: en_core_web_sm

📊 Name cleaning examples (with spaCy NER):
===================================================================================================
Original Name                                      | Cleaned Name
===================================================================================================
Scarlett Johansson「LoRa」                        | Scarlett Johansson
Emma Watson (JG)                                  | Emma Watson
IU                                                | IU
Belle Delphine                                    | Belle Delphine
...

Key Indicators

✅ Good signs:

Person names cleanly extracted
No extra words like "Model", "LoRA", "Celebrity"
Multi-word names kept together (e.g., "Emma Watson" not just "Emma")

❌ Issues to watch:

Empty results (increase fallback logic)
Partial names (e.g., only first name)
Non-names included (tune preprocessing)

Customization

Add More Languages

For better support of non-English names:

# Download multilingual model
python -m spacy download xx_ent_wiki_sm

# Use in code
nlp = spacy.load("xx_ent_wiki_sm")

Adjust Entity Extraction

To extract other entities:

# Extract organizations too
entities = [ent.text for ent in doc.ents
            if ent.label_ in ["PERSON", "ORG"]]

Custom Entity Rules

Add custom patterns for names spaCy might miss:

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
# Add patterns for specific name formats

Benefits for This Project

Better Person Identification

With cleaner names:

LLMs receive recognizable names
"Emma Watson" instead of "Emma Watson Model LoRA v3"
Better identification accuracy

Reduced Ambiguity

spaCy helps distinguish:

Person names vs. descriptive words
"Celebrity IU" → "IU" (person)
"Model Bella" → "Bella" (person)

Improved Context for LLMs

Cleaner input = better prompts:

Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
After:  "Given 'Emma Watson' (actress)..."

The LLM can now focus on identifying the person, not parsing the noise.

Summary

✅ spaCy NER provides intelligent, context-aware name extraction ✅ Better than regex for handling complex name formats ✅ Fallback strategy ensures we always get a result ✅ Industry standard tool used in production NLP ✅ Easy to use with minimal code

The combination of:

Leetspeak translation
Noise removal
spaCy NER
Smart fallbacks

...results in clean, accurate person names ready for LLM annotation!