| # spaCy NER Implementation | |
| ## Why spaCy for NER? | |
| Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because: | |
| 1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning | |
| 2. **Context-aware**: Understands sentence structure and context | |
| 3. **Robust**: Handles various name formats (first, last, full, stage names) | |
| 4. **Language support**: Works with multiple languages and scripts | |
| 5. **Industry standard**: Used in production NLP applications | |
| ## How It Works | |
| ### Pipeline Overview | |
| ``` | |
| Original Name | |
| β | |
| 1. Translate Leetspeak (4βa, 3βe, 1βi) | |
| β | |
| 2. Remove Noise (emoji, LoRA terms, versions) | |
| β | |
| 3. spaCy NER - Extract PERSON entities | |
| β | |
| 4. Fallback to capitalized words if needed | |
| β | |
| Cleaned Name | |
| ``` | |
| ### Detailed Steps | |
| #### Step 1: Leetspeak Translation | |
| ```python | |
| "4kira LoRA v2" β "akira LoRA v2" | |
| "1rene Model" β "irene Model" | |
| "3mma Watson" β "emma Watson" | |
| ``` | |
| #### Step 2: Noise Removal | |
| ```python | |
| "akira LoRA v2" β "akira" | |
| "irene Model" β "irene" | |
| "emma Watson" β "emma Watson" | |
| ``` | |
| #### Step 3: spaCy NER | |
| ```python | |
| nlp("akira") | |
| # Entities: [("akira", PERSON)] | |
| # Result: "akira" | |
| nlp("emma Watson") | |
| # Entities: [("emma Watson", PERSON)] | |
| # Result: "emma Watson" | |
| ``` | |
| #### Step 4: Fallback | |
| If spaCy doesn't find a PERSON entity: | |
| - Extract capitalized words (likely names) | |
| - Or return cleaned text as-is | |
| ## Examples | |
| ### Case 1: Simple Name | |
| ``` | |
| Input: "IU" | |
| Output: "IU" | |
| Process: | |
| - Preprocess: "IU" (no noise) | |
| - spaCy NER: Recognizes "IU" as PERSON | |
| - Result: "IU" | |
| ``` | |
| ### Case 2: Name with LoRA Terms | |
| ``` | |
| Input: "Scarlett JohanssonγLoRaγ" | |
| Output: "Scarlett Johansson" | |
| Process: | |
| - Preprocess: "Scarlett Johansson" (removed γLoRaγ) | |
| - spaCy NER: Recognizes "Scarlett Johansson" as PERSON | |
| - Result: "Scarlett Johansson" | |
| ``` | |
| ### Case 3: Leetspeak Name | |
| ``` | |
| Input: "4kira Anime Character v1" | |
| Output: "akira" | |
| Process: | |
| - Leetspeak: "akira Anime Character v1" | |
| - Preprocess: "akira Anime Character" | |
| - spaCy NER: Recognizes "akira" as PERSON | |
| - Result: "akira" | |
| ``` | |
| ### Case 4: Complex Format | |
| ``` | |
| Input: "Gakki | Aragaki Yui | ζ°ε£η΅θ‘£" | |
| Output: "Gakki" | |
| Process: | |
| - Preprocess: "Gakki" (kept first part before |) | |
| - spaCy NER: Recognizes "Gakki" as PERSON | |
| - Result: "Gakki" | |
| ``` | |
| ### Case 5: With Metadata | |
| ``` | |
| Input: "Emma Watson (JG) v3.5" | |
| Output: "Emma Watson" | |
| Process: | |
| - Preprocess: "Emma Watson" (removed (JG) and v3.5) | |
| - spaCy NER: Recognizes "Emma Watson" as PERSON | |
| - Result: "Emma Watson" | |
| ``` | |
| ## Advantages Over Regex-Only | |
| ### Old Approach (Regex Only) | |
| ```python | |
| # Just remove noise and hope for the best | |
| name = remove_noise(name) | |
| name = name.strip() | |
| # Result: May include non-name words | |
| ``` | |
| Problems: | |
| - Can't distinguish names from other capitalized words | |
| - May include words like "Model", "Anime", "Character" | |
| - No context awareness | |
| - Language-dependent regex patterns needed | |
| ### New Approach (spaCy NER) | |
| ```python | |
| # Intelligent entity extraction | |
| preprocessed = remove_noise(name) | |
| doc = nlp(preprocessed) | |
| person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"] | |
| # Result: Only actual person names | |
| ``` | |
| Benefits: | |
| - β Identifies actual person entities | |
| - β Ignores non-person words | |
| - β Context-aware (understands "Emma Watson" is one entity) | |
| - β Multi-language support | |
| - β Handles various name formats | |
| ## Comparison Examples | |
| | Input | Regex Only | spaCy NER | | |
| |-------|------------|-----------| | |
| | `"Emma Watson Model"` | `"Emma Watson Model"` β | `"Emma Watson"` β | | |
| | `"Anime Character Levi"` | `"Anime Character Levi"` β | `"Levi"` β | | |
| | `"Taylor Swift v2"` | `"Taylor Swift"` β | `"Taylor Swift"` β | | |
| | `"K4te Middleton"` | `"K4te Middleton"` β | `"Kate Middleton"` β | | |
| | `"Celebrity IU"` | `"Celebrity IU"` β | `"IU"` β | | |
| ## spaCy Model Information | |
| ### Model Used | |
| - **Name**: `en_core_web_sm` | |
| - **Language**: English (but works reasonably with romanized names) | |
| - **Size**: ~13 MB | |
| - **Entities**: Recognizes PERSON, ORG, GPE, etc. | |
| ### Installation | |
| ```bash | |
| # Install spaCy | |
| pip install spacy | |
| # Download model | |
| python -m spacy download en_core_web_sm | |
| ``` | |
| The notebook automatically downloads the model if not found. | |
| ### Performance | |
| - **Speed**: ~1000-5000 docs/second | |
| - **Accuracy**: High for common names | |
| - **Memory**: Low (~100MB loaded) | |
| ## Fallback Strategy | |
| If spaCy doesn't recognize a PERSON entity: | |
| 1. **Extract capitalized words**: | |
| ```python | |
| "unknown name here" β ["unknown"] | |
| ``` | |
| 2. **Return first few capitalized words**: | |
| ```python | |
| "Celebrity Model Actor" β "Celebrity Model Actor" | |
| ``` | |
| 3. **Last resort**: Return cleaned text as-is | |
| This ensures we always get something, even for: | |
| - Uncommon/rare names | |
| - Nicknames | |
| - Non-English names | |
| - Stage names | |
| ## Testing | |
| ### How to Verify spaCy is Working | |
| Run Cell 5 and check the output: | |
| ``` | |
| β spaCy model loaded: en_core_web_sm | |
| π Name cleaning examples (with spaCy NER): | |
| =================================================================================================== | |
| Original Name | Cleaned Name | |
| =================================================================================================== | |
| Scarlett JohanssonγLoRaγ | Scarlett Johansson | |
| Emma Watson (JG) | Emma Watson | |
| IU | IU | |
| Belle Delphine | Belle Delphine | |
| ... | |
| ``` | |
| ### Key Indicators | |
| β **Good signs**: | |
| - Person names cleanly extracted | |
| - No extra words like "Model", "LoRA", "Celebrity" | |
| - Multi-word names kept together (e.g., "Emma Watson" not just "Emma") | |
| β **Issues to watch**: | |
| - Empty results (increase fallback logic) | |
| - Partial names (e.g., only first name) | |
| - Non-names included (tune preprocessing) | |
| ## Customization | |
| ### Add More Languages | |
| For better support of non-English names: | |
| ```python | |
| # Download multilingual model | |
| python -m spacy download xx_ent_wiki_sm | |
| # Use in code | |
| nlp = spacy.load("xx_ent_wiki_sm") | |
| ``` | |
| ### Adjust Entity Extraction | |
| To extract other entities: | |
| ```python | |
| # Extract organizations too | |
| entities = [ent.text for ent in doc.ents | |
| if ent.label_ in ["PERSON", "ORG"]] | |
| ``` | |
| ### Custom Entity Rules | |
| Add custom patterns for names spaCy might miss: | |
| ```python | |
| from spacy.matcher import Matcher | |
| matcher = Matcher(nlp.vocab) | |
| # Add patterns for specific name formats | |
| ``` | |
| ## Benefits for This Project | |
| ### Better Person Identification | |
| With cleaner names: | |
| - LLMs receive recognizable names | |
| - "Emma Watson" instead of "Emma Watson Model LoRA v3" | |
| - Better identification accuracy | |
| ### Reduced Ambiguity | |
| spaCy helps distinguish: | |
| - Person names vs. descriptive words | |
| - "Celebrity IU" β "IU" (person) | |
| - "Model Bella" β "Bella" (person) | |
| ### Improved Context for LLMs | |
| Cleaner input = better prompts: | |
| ``` | |
| Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..." | |
| After: "Given 'Emma Watson' (actress)..." | |
| ``` | |
| The LLM can now focus on identifying the person, not parsing the noise. | |
| ## Summary | |
| β **spaCy NER** provides intelligent, context-aware name extraction | |
| β **Better than regex** for handling complex name formats | |
| β **Fallback strategy** ensures we always get a result | |
| β **Industry standard** tool used in production NLP | |
| β **Easy to use** with minimal code | |
| The combination of: | |
| 1. Leetspeak translation | |
| 2. Noise removal | |
| 3. spaCy NER | |
| 4. Smart fallbacks | |
| ...results in clean, accurate person names ready for LLM annotation! | |