| # spaCy NER Implementation |
|
|
| ## Why spaCy for NER? |
|
|
| Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because: |
|
|
| 1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning |
| 2. **Context-aware**: Understands sentence structure and context |
| 3. **Robust**: Handles various name formats (first, last, full, stage names) |
| 4. **Language support**: Works with multiple languages and scripts |
| 5. **Industry standard**: Used in production NLP applications |
|
|
| ## How It Works |
|
|
| ### Pipeline Overview |
|
|
| ``` |
| Original Name |
| ↓ |
| 1. Translate Leetspeak (4→a, 3→e, 1→i) |
| ↓ |
| 2. Remove Noise (emoji, LoRA terms, versions) |
| ↓ |
| 3. spaCy NER - Extract PERSON entities |
| ↓ |
| 4. Fallback to capitalized words if needed |
| ↓ |
| Cleaned Name |
| ``` |
|
|
| ### Detailed Steps |
|
|
| #### Step 1: Leetspeak Translation |
| ```python |
| "4kira LoRA v2" → "akira LoRA v2" |
| "1rene Model" → "irene Model" |
| "3mma Watson" → "emma Watson" |
| ``` |
|
|
| #### Step 2: Noise Removal |
| ```python |
| "akira LoRA v2" → "akira" |
| "irene Model" → "irene" |
| "emma Watson" → "emma Watson" |
| ``` |
|
|
| #### Step 3: spaCy NER |
| ```python |
| nlp("akira") |
| # Entities: [("akira", PERSON)] |
| # Result: "akira" |
| |
| nlp("emma Watson") |
| # Entities: [("emma Watson", PERSON)] |
| # Result: "emma Watson" |
| ``` |
|
|
| #### Step 4: Fallback |
| If spaCy doesn't find a PERSON entity: |
| - Extract capitalized words (likely names) |
| - Or return cleaned text as-is |
|
|
| ## Examples |
|
|
| ### Case 1: Simple Name |
| ``` |
| Input: "IU" |
| Output: "IU" |
| |
| Process: |
| - Preprocess: "IU" (no noise) |
| - spaCy NER: Recognizes "IU" as PERSON |
| - Result: "IU" |
| ``` |
|
|
| ### Case 2: Name with LoRA Terms |
| ``` |
| Input: "Scarlett Johansson「LoRa」" |
| Output: "Scarlett Johansson" |
| |
| Process: |
| - Preprocess: "Scarlett Johansson" (removed 「LoRa」) |
| - spaCy NER: Recognizes "Scarlett Johansson" as PERSON |
| - Result: "Scarlett Johansson" |
| ``` |
|
|
| ### Case 3: Leetspeak Name |
| ``` |
| Input: "4kira Anime Character v1" |
| Output: "akira" |
| |
| Process: |
| - Leetspeak: "akira Anime Character v1" |
| - Preprocess: "akira Anime Character" |
| - spaCy NER: Recognizes "akira" as PERSON |
| - Result: "akira" |
| ``` |
|
|
| ### Case 4: Complex Format |
| ``` |
| Input: "Gakki | Aragaki Yui | 新垣結衣" |
| Output: "Gakki" |
| |
| Process: |
| - Preprocess: "Gakki" (kept first part before |) |
| - spaCy NER: Recognizes "Gakki" as PERSON |
| - Result: "Gakki" |
| ``` |
|
|
| ### Case 5: With Metadata |
| ``` |
| Input: "Emma Watson (JG) v3.5" |
| Output: "Emma Watson" |
| |
| Process: |
| - Preprocess: "Emma Watson" (removed (JG) and v3.5) |
| - spaCy NER: Recognizes "Emma Watson" as PERSON |
| - Result: "Emma Watson" |
| ``` |
|
|
| ## Advantages Over Regex-Only |
|
|
| ### Old Approach (Regex Only) |
| ```python |
| # Just remove noise and hope for the best |
| name = remove_noise(name) |
| name = name.strip() |
| # Result: May include non-name words |
| ``` |
|
|
| Problems: |
| - Can't distinguish names from other capitalized words |
| - May include words like "Model", "Anime", "Character" |
| - No context awareness |
| - Language-dependent regex patterns needed |
|
|
| ### New Approach (spaCy NER) |
| ```python |
| # Intelligent entity extraction |
| preprocessed = remove_noise(name) |
| doc = nlp(preprocessed) |
| person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"] |
| # Result: Only actual person names |
| ``` |
|
|
| Benefits: |
| - ✅ Identifies actual person entities |
| - ✅ Ignores non-person words |
| - ✅ Context-aware (understands "Emma Watson" is one entity) |
| - ✅ Multi-language support |
| - ✅ Handles various name formats |
|
|
| ## Comparison Examples |
|
|
| | Input | Regex Only | spaCy NER | |
| |-------|------------|-----------| |
| | `"Emma Watson Model"` | `"Emma Watson Model"` ❌ | `"Emma Watson"` ✅ | |
| | `"Anime Character Levi"` | `"Anime Character Levi"` ❌ | `"Levi"` ✅ | |
| | `"Taylor Swift v2"` | `"Taylor Swift"` ✅ | `"Taylor Swift"` ✅ | |
| | `"K4te Middleton"` | `"K4te Middleton"` ❌ | `"Kate Middleton"` ✅ | |
| | `"Celebrity IU"` | `"Celebrity IU"` ❌ | `"IU"` ✅ | |
|
|
| ## spaCy Model Information |
|
|
| ### Model Used |
| - **Name**: `en_core_web_sm` |
| - **Language**: English (but works reasonably with romanized names) |
| - **Size**: ~13 MB |
| - **Entities**: Recognizes PERSON, ORG, GPE, etc. |
|
|
| ### Installation |
| ```bash |
| # Install spaCy |
| pip install spacy |
| |
| # Download model |
| python -m spacy download en_core_web_sm |
| ``` |
|
|
| The notebook automatically downloads the model if not found. |
|
|
| ### Performance |
| - **Speed**: ~1000-5000 docs/second |
| - **Accuracy**: High for common names |
| - **Memory**: Low (~100MB loaded) |
|
|
| ## Fallback Strategy |
|
|
| If spaCy doesn't recognize a PERSON entity: |
|
|
| 1. **Extract capitalized words**: |
| ```python |
| "unknown name here" → ["unknown"] |
| ``` |
|
|
| 2. **Return first few capitalized words**: |
| ```python |
| "Celebrity Model Actor" → "Celebrity Model Actor" |
| ``` |
|
|
| 3. **Last resort**: Return cleaned text as-is |
|
|
| This ensures we always get something, even for: |
| - Uncommon/rare names |
| - Nicknames |
| - Non-English names |
| - Stage names |
|
|
| ## Testing |
|
|
| ### How to Verify spaCy is Working |
|
|
| Run Cell 5 and check the output: |
|
|
| ``` |
| ✅ spaCy model loaded: en_core_web_sm |
| |
| 📊 Name cleaning examples (with spaCy NER): |
| =================================================================================================== |
| Original Name | Cleaned Name |
| =================================================================================================== |
| Scarlett Johansson「LoRa」 | Scarlett Johansson |
| Emma Watson (JG) | Emma Watson |
| IU | IU |
| Belle Delphine | Belle Delphine |
| ... |
| ``` |
|
|
| ### Key Indicators |
|
|
| ✅ **Good signs**: |
| - Person names cleanly extracted |
| - No extra words like "Model", "LoRA", "Celebrity" |
| - Multi-word names kept together (e.g., "Emma Watson" not just "Emma") |
|
|
| ❌ **Issues to watch**: |
| - Empty results (increase fallback logic) |
| - Partial names (e.g., only first name) |
| - Non-names included (tune preprocessing) |
|
|
| ## Customization |
|
|
| ### Add More Languages |
|
|
| For better support of non-English names: |
|
|
| ```python |
| # Download multilingual model |
| python -m spacy download xx_ent_wiki_sm |
| |
| # Use in code |
| nlp = spacy.load("xx_ent_wiki_sm") |
| ``` |
|
|
| ### Adjust Entity Extraction |
|
|
| To extract other entities: |
|
|
| ```python |
| # Extract organizations too |
| entities = [ent.text for ent in doc.ents |
| if ent.label_ in ["PERSON", "ORG"]] |
| ``` |
|
|
| ### Custom Entity Rules |
|
|
| Add custom patterns for names spaCy might miss: |
|
|
| ```python |
| from spacy.matcher import Matcher |
| |
| matcher = Matcher(nlp.vocab) |
| # Add patterns for specific name formats |
| ``` |
|
|
| ## Benefits for This Project |
|
|
| ### Better Person Identification |
|
|
| With cleaner names: |
| - LLMs receive recognizable names |
| - "Emma Watson" instead of "Emma Watson Model LoRA v3" |
| - Better identification accuracy |
|
|
| ### Reduced Ambiguity |
|
|
| spaCy helps distinguish: |
| - Person names vs. descriptive words |
| - "Celebrity IU" → "IU" (person) |
| - "Model Bella" → "Bella" (person) |
|
|
| ### Improved Context for LLMs |
|
|
| Cleaner input = better prompts: |
| ``` |
| Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..." |
| After: "Given 'Emma Watson' (actress)..." |
| ``` |
|
|
| The LLM can now focus on identifying the person, not parsing the noise. |
|
|
| ## Summary |
|
|
| ✅ **spaCy NER** provides intelligent, context-aware name extraction |
| ✅ **Better than regex** for handling complex name formats |
| ✅ **Fallback strategy** ensures we always get a result |
| ✅ **Industry standard** tool used in production NLP |
| ✅ **Easy to use** with minimal code |
|
|
| The combination of: |
| 1. Leetspeak translation |
| 2. Noise removal |
| 3. spaCy NER |
| 4. Smart fallbacks |
|
|
| ...results in clean, accurate person names ready for LLM annotation! |
|
|