# spaCy NER Implementation ## Why spaCy for NER? Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because: 1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning 2. **Context-aware**: Understands sentence structure and context 3. **Robust**: Handles various name formats (first, last, full, stage names) 4. **Language support**: Works with multiple languages and scripts 5. **Industry standard**: Used in production NLP applications ## How It Works ### Pipeline Overview ``` Original Name ↓ 1. Translate Leetspeak (4→a, 3→e, 1→i) ↓ 2. Remove Noise (emoji, LoRA terms, versions) ↓ 3. spaCy NER - Extract PERSON entities ↓ 4. Fallback to capitalized words if needed ↓ Cleaned Name ``` ### Detailed Steps #### Step 1: Leetspeak Translation ```python "4kira LoRA v2" → "akira LoRA v2" "1rene Model" → "irene Model" "3mma Watson" → "emma Watson" ``` #### Step 2: Noise Removal ```python "akira LoRA v2" → "akira" "irene Model" → "irene" "emma Watson" → "emma Watson" ``` #### Step 3: spaCy NER ```python nlp("akira") # Entities: [("akira", PERSON)] # Result: "akira" nlp("emma Watson") # Entities: [("emma Watson", PERSON)] # Result: "emma Watson" ``` #### Step 4: Fallback If spaCy doesn't find a PERSON entity: - Extract capitalized words (likely names) - Or return cleaned text as-is ## Examples ### Case 1: Simple Name ``` Input: "IU" Output: "IU" Process: - Preprocess: "IU" (no noise) - spaCy NER: Recognizes "IU" as PERSON - Result: "IU" ``` ### Case 2: Name with LoRA Terms ``` Input: "Scarlett Johansson「LoRa」" Output: "Scarlett Johansson" Process: - Preprocess: "Scarlett Johansson" (removed 「LoRa」) - spaCy NER: Recognizes "Scarlett Johansson" as PERSON - Result: "Scarlett Johansson" ``` ### Case 3: Leetspeak Name ``` Input: "4kira Anime Character v1" Output: "akira" Process: - Leetspeak: "akira Anime Character v1" - Preprocess: "akira Anime Character" - spaCy NER: Recognizes "akira" as PERSON - Result: "akira" ``` ### Case 4: Complex Format ``` Input: "Gakki | Aragaki Yui | 新垣結衣" Output: "Gakki" Process: - Preprocess: "Gakki" (kept first part before |) - spaCy NER: Recognizes "Gakki" as PERSON - Result: "Gakki" ``` ### Case 5: With Metadata ``` Input: "Emma Watson (JG) v3.5" Output: "Emma Watson" Process: - Preprocess: "Emma Watson" (removed (JG) and v3.5) - spaCy NER: Recognizes "Emma Watson" as PERSON - Result: "Emma Watson" ``` ## Advantages Over Regex-Only ### Old Approach (Regex Only) ```python # Just remove noise and hope for the best name = remove_noise(name) name = name.strip() # Result: May include non-name words ``` Problems: - Can't distinguish names from other capitalized words - May include words like "Model", "Anime", "Character" - No context awareness - Language-dependent regex patterns needed ### New Approach (spaCy NER) ```python # Intelligent entity extraction preprocessed = remove_noise(name) doc = nlp(preprocessed) person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"] # Result: Only actual person names ``` Benefits: - ✅ Identifies actual person entities - ✅ Ignores non-person words - ✅ Context-aware (understands "Emma Watson" is one entity) - ✅ Multi-language support - ✅ Handles various name formats ## Comparison Examples | Input | Regex Only | spaCy NER | |-------|------------|-----------| | `"Emma Watson Model"` | `"Emma Watson Model"` ❌ | `"Emma Watson"` ✅ | | `"Anime Character Levi"` | `"Anime Character Levi"` ❌ | `"Levi"` ✅ | | `"Taylor Swift v2"` | `"Taylor Swift"` ✅ | `"Taylor Swift"` ✅ | | `"K4te Middleton"` | `"K4te Middleton"` ❌ | `"Kate Middleton"` ✅ | | `"Celebrity IU"` | `"Celebrity IU"` ❌ | `"IU"` ✅ | ## spaCy Model Information ### Model Used - **Name**: `en_core_web_sm` - **Language**: English (but works reasonably with romanized names) - **Size**: ~13 MB - **Entities**: Recognizes PERSON, ORG, GPE, etc. ### Installation ```bash # Install spaCy pip install spacy # Download model python -m spacy download en_core_web_sm ``` The notebook automatically downloads the model if not found. ### Performance - **Speed**: ~1000-5000 docs/second - **Accuracy**: High for common names - **Memory**: Low (~100MB loaded) ## Fallback Strategy If spaCy doesn't recognize a PERSON entity: 1. **Extract capitalized words**: ```python "unknown name here" → ["unknown"] ``` 2. **Return first few capitalized words**: ```python "Celebrity Model Actor" → "Celebrity Model Actor" ``` 3. **Last resort**: Return cleaned text as-is This ensures we always get something, even for: - Uncommon/rare names - Nicknames - Non-English names - Stage names ## Testing ### How to Verify spaCy is Working Run Cell 5 and check the output: ``` ✅ spaCy model loaded: en_core_web_sm 📊 Name cleaning examples (with spaCy NER): =================================================================================================== Original Name | Cleaned Name =================================================================================================== Scarlett Johansson「LoRa」 | Scarlett Johansson Emma Watson (JG) | Emma Watson IU | IU Belle Delphine | Belle Delphine ... ``` ### Key Indicators ✅ **Good signs**: - Person names cleanly extracted - No extra words like "Model", "LoRA", "Celebrity" - Multi-word names kept together (e.g., "Emma Watson" not just "Emma") ❌ **Issues to watch**: - Empty results (increase fallback logic) - Partial names (e.g., only first name) - Non-names included (tune preprocessing) ## Customization ### Add More Languages For better support of non-English names: ```python # Download multilingual model python -m spacy download xx_ent_wiki_sm # Use in code nlp = spacy.load("xx_ent_wiki_sm") ``` ### Adjust Entity Extraction To extract other entities: ```python # Extract organizations too entities = [ent.text for ent in doc.ents if ent.label_ in ["PERSON", "ORG"]] ``` ### Custom Entity Rules Add custom patterns for names spaCy might miss: ```python from spacy.matcher import Matcher matcher = Matcher(nlp.vocab) # Add patterns for specific name formats ``` ## Benefits for This Project ### Better Person Identification With cleaner names: - LLMs receive recognizable names - "Emma Watson" instead of "Emma Watson Model LoRA v3" - Better identification accuracy ### Reduced Ambiguity spaCy helps distinguish: - Person names vs. descriptive words - "Celebrity IU" → "IU" (person) - "Model Bella" → "Bella" (person) ### Improved Context for LLMs Cleaner input = better prompts: ``` Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..." After: "Given 'Emma Watson' (actress)..." ``` The LLM can now focus on identifying the person, not parsing the noise. ## Summary ✅ **spaCy NER** provides intelligent, context-aware name extraction ✅ **Better than regex** for handling complex name formats ✅ **Fallback strategy** ensures we always get a result ✅ **Industry standard** tool used in production NLP ✅ **Easy to use** with minimal code The combination of: 1. Leetspeak translation 2. Noise removal 3. spaCy NER 4. Smart fallbacks ...results in clean, accurate person names ready for LLM annotation!