code / md /SPACY_NER_EXPLANATION.md
Laura Wagner
to commit or not commit that is the question
5f5806d

spaCy NER Implementation

Why spaCy for NER?

Using spaCy's Named Entity Recognition (NER) is significantly better than regex-based cleaning because:

  1. Intelligent entity extraction: Recognizes PERSON entities using machine learning
  2. Context-aware: Understands sentence structure and context
  3. Robust: Handles various name formats (first, last, full, stage names)
  4. Language support: Works with multiple languages and scripts
  5. Industry standard: Used in production NLP applications

How It Works

Pipeline Overview

Original Name
    ↓
1. Translate Leetspeak (4β†’a, 3β†’e, 1β†’i)
    ↓
2. Remove Noise (emoji, LoRA terms, versions)
    ↓
3. spaCy NER - Extract PERSON entities
    ↓
4. Fallback to capitalized words if needed
    ↓
Cleaned Name

Detailed Steps

Step 1: Leetspeak Translation

"4kira LoRA v2" β†’ "akira LoRA v2"
"1rene Model" β†’ "irene Model"
"3mma Watson" β†’ "emma Watson"

Step 2: Noise Removal

"akira LoRA v2" β†’ "akira"
"irene Model" β†’ "irene"
"emma Watson" β†’ "emma Watson"

Step 3: spaCy NER

nlp("akira")
# Entities: [("akira", PERSON)]
# Result: "akira"

nlp("emma Watson")
# Entities: [("emma Watson", PERSON)]
# Result: "emma Watson"

Step 4: Fallback

If spaCy doesn't find a PERSON entity:

  • Extract capitalized words (likely names)
  • Or return cleaned text as-is

Examples

Case 1: Simple Name

Input:  "IU"
Output: "IU"

Process:
  - Preprocess: "IU" (no noise)
  - spaCy NER: Recognizes "IU" as PERSON
  - Result: "IU"

Case 2: Name with LoRA Terms

Input:  "Scarlett Johanssonγ€ŒLoRa」"
Output: "Scarlett Johansson"

Process:
  - Preprocess: "Scarlett Johansson" (removed γ€ŒLoRa」)
  - spaCy NER: Recognizes "Scarlett Johansson" as PERSON
  - Result: "Scarlett Johansson"

Case 3: Leetspeak Name

Input:  "4kira Anime Character v1"
Output: "akira"

Process:
  - Leetspeak: "akira Anime Character v1"
  - Preprocess: "akira Anime Character"
  - spaCy NER: Recognizes "akira" as PERSON
  - Result: "akira"

Case 4: Complex Format

Input:  "Gakki | Aragaki Yui | ζ–°εž£η΅θ‘£"
Output: "Gakki"

Process:
  - Preprocess: "Gakki" (kept first part before |)
  - spaCy NER: Recognizes "Gakki" as PERSON
  - Result: "Gakki"

Case 5: With Metadata

Input:  "Emma Watson (JG) v3.5"
Output: "Emma Watson"

Process:
  - Preprocess: "Emma Watson" (removed (JG) and v3.5)
  - spaCy NER: Recognizes "Emma Watson" as PERSON
  - Result: "Emma Watson"

Advantages Over Regex-Only

Old Approach (Regex Only)

# Just remove noise and hope for the best
name = remove_noise(name)
name = name.strip()
# Result: May include non-name words

Problems:

  • Can't distinguish names from other capitalized words
  • May include words like "Model", "Anime", "Character"
  • No context awareness
  • Language-dependent regex patterns needed

New Approach (spaCy NER)

# Intelligent entity extraction
preprocessed = remove_noise(name)
doc = nlp(preprocessed)
person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
# Result: Only actual person names

Benefits:

  • βœ… Identifies actual person entities
  • βœ… Ignores non-person words
  • βœ… Context-aware (understands "Emma Watson" is one entity)
  • βœ… Multi-language support
  • βœ… Handles various name formats

Comparison Examples

Input Regex Only spaCy NER
"Emma Watson Model" "Emma Watson Model" ❌ "Emma Watson" βœ…
"Anime Character Levi" "Anime Character Levi" ❌ "Levi" βœ…
"Taylor Swift v2" "Taylor Swift" βœ… "Taylor Swift" βœ…
"K4te Middleton" "K4te Middleton" ❌ "Kate Middleton" βœ…
"Celebrity IU" "Celebrity IU" ❌ "IU" βœ…

spaCy Model Information

Model Used

  • Name: en_core_web_sm
  • Language: English (but works reasonably with romanized names)
  • Size: ~13 MB
  • Entities: Recognizes PERSON, ORG, GPE, etc.

Installation

# Install spaCy
pip install spacy

# Download model
python -m spacy download en_core_web_sm

The notebook automatically downloads the model if not found.

Performance

  • Speed: ~1000-5000 docs/second
  • Accuracy: High for common names
  • Memory: Low (~100MB loaded)

Fallback Strategy

If spaCy doesn't recognize a PERSON entity:

  1. Extract capitalized words:

    "unknown name here" β†’ ["unknown"]
    
  2. Return first few capitalized words:

    "Celebrity Model Actor" β†’ "Celebrity Model Actor"
    
  3. Last resort: Return cleaned text as-is

This ensures we always get something, even for:

  • Uncommon/rare names
  • Nicknames
  • Non-English names
  • Stage names

Testing

How to Verify spaCy is Working

Run Cell 5 and check the output:

βœ… spaCy model loaded: en_core_web_sm

πŸ“Š Name cleaning examples (with spaCy NER):
===================================================================================================
Original Name                                      | Cleaned Name
===================================================================================================
Scarlett Johanssonγ€ŒLoRa」                        | Scarlett Johansson
Emma Watson (JG)                                  | Emma Watson
IU                                                | IU
Belle Delphine                                    | Belle Delphine
...

Key Indicators

βœ… Good signs:

  • Person names cleanly extracted
  • No extra words like "Model", "LoRA", "Celebrity"
  • Multi-word names kept together (e.g., "Emma Watson" not just "Emma")

❌ Issues to watch:

  • Empty results (increase fallback logic)
  • Partial names (e.g., only first name)
  • Non-names included (tune preprocessing)

Customization

Add More Languages

For better support of non-English names:

# Download multilingual model
python -m spacy download xx_ent_wiki_sm

# Use in code
nlp = spacy.load("xx_ent_wiki_sm")

Adjust Entity Extraction

To extract other entities:

# Extract organizations too
entities = [ent.text for ent in doc.ents
            if ent.label_ in ["PERSON", "ORG"]]

Custom Entity Rules

Add custom patterns for names spaCy might miss:

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
# Add patterns for specific name formats

Benefits for This Project

Better Person Identification

With cleaner names:

  • LLMs receive recognizable names
  • "Emma Watson" instead of "Emma Watson Model LoRA v3"
  • Better identification accuracy

Reduced Ambiguity

spaCy helps distinguish:

  • Person names vs. descriptive words
  • "Celebrity IU" β†’ "IU" (person)
  • "Model Bella" β†’ "Bella" (person)

Improved Context for LLMs

Cleaner input = better prompts:

Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
After:  "Given 'Emma Watson' (actress)..."

The LLM can now focus on identifying the person, not parsing the noise.

Summary

βœ… spaCy NER provides intelligent, context-aware name extraction βœ… Better than regex for handling complex name formats βœ… Fallback strategy ensures we always get a result βœ… Industry standard tool used in production NLP βœ… Easy to use with minimal code

The combination of:

  1. Leetspeak translation
  2. Noise removal
  3. spaCy NER
  4. Smart fallbacks

...results in clean, accurate person names ready for LLM annotation!