code

File size: 7,537 Bytes

5f5806d

# spaCy NER Implementation

## Why spaCy for NER?

Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because:

1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning
2. **Context-aware**: Understands sentence structure and context
3. **Robust**: Handles various name formats (first, last, full, stage names)
4. **Language support**: Works with multiple languages and scripts
5. **Industry standard**: Used in production NLP applications

## How It Works

### Pipeline Overview

```
Original Name
    ↓
1. Translate Leetspeak (4→a, 3→e, 1→i)
    ↓
2. Remove Noise (emoji, LoRA terms, versions)
    ↓
3. spaCy NER - Extract PERSON entities
    ↓
4. Fallback to capitalized words if needed
    ↓
Cleaned Name
```

### Detailed Steps

#### Step 1: Leetspeak Translation
```python
"4kira LoRA v2" → "akira LoRA v2"
"1rene Model" → "irene Model"
"3mma Watson" → "emma Watson"
```

#### Step 2: Noise Removal
```python
"akira LoRA v2" → "akira"
"irene Model" → "irene"
"emma Watson" → "emma Watson"
```

#### Step 3: spaCy NER
```python
nlp("akira")
# Entities: [("akira", PERSON)]
# Result: "akira"

nlp("emma Watson")
# Entities: [("emma Watson", PERSON)]
# Result: "emma Watson"
```

#### Step 4: Fallback
If spaCy doesn't find a PERSON entity:
- Extract capitalized words (likely names)
- Or return cleaned text as-is

## Examples

### Case 1: Simple Name
```
Input:  "IU"
Output: "IU"

Process:
  - Preprocess: "IU" (no noise)
  - spaCy NER: Recognizes "IU" as PERSON
  - Result: "IU"
```

### Case 2: Name with LoRA Terms
```
Input:  "Scarlett Johansson「LoRa」"
Output: "Scarlett Johansson"

Process:
  - Preprocess: "Scarlett Johansson" (removed 「LoRa」)
  - spaCy NER: Recognizes "Scarlett Johansson" as PERSON
  - Result: "Scarlett Johansson"
```

### Case 3: Leetspeak Name
```
Input:  "4kira Anime Character v1"
Output: "akira"

Process:
  - Leetspeak: "akira Anime Character v1"
  - Preprocess: "akira Anime Character"
  - spaCy NER: Recognizes "akira" as PERSON
  - Result: "akira"
```

### Case 4: Complex Format
```
Input:  "Gakki | Aragaki Yui | 新垣結衣"
Output: "Gakki"

Process:
  - Preprocess: "Gakki" (kept first part before |)
  - spaCy NER: Recognizes "Gakki" as PERSON
  - Result: "Gakki"
```

### Case 5: With Metadata
```
Input:  "Emma Watson (JG) v3.5"
Output: "Emma Watson"

Process:
  - Preprocess: "Emma Watson" (removed (JG) and v3.5)
  - spaCy NER: Recognizes "Emma Watson" as PERSON
  - Result: "Emma Watson"
```

## Advantages Over Regex-Only

### Old Approach (Regex Only)
```python
# Just remove noise and hope for the best
name = remove_noise(name)
name = name.strip()
# Result: May include non-name words
```

Problems:
- Can't distinguish names from other capitalized words
- May include words like "Model", "Anime", "Character"
- No context awareness
- Language-dependent regex patterns needed

### New Approach (spaCy NER)
```python
# Intelligent entity extraction
preprocessed = remove_noise(name)
doc = nlp(preprocessed)
person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
# Result: Only actual person names
```

Benefits:
- ✅ Identifies actual person entities
- ✅ Ignores non-person words
- ✅ Context-aware (understands "Emma Watson" is one entity)
- ✅ Multi-language support
- ✅ Handles various name formats

## Comparison Examples

| Input | Regex Only | spaCy NER |
|-------|------------|-----------|
| `"Emma Watson Model"` | `"Emma Watson Model"` ❌ | `"Emma Watson"` ✅ |
| `"Anime Character Levi"` | `"Anime Character Levi"` ❌ | `"Levi"` ✅ |
| `"Taylor Swift v2"` | `"Taylor Swift"` ✅ | `"Taylor Swift"` ✅ |
| `"K4te Middleton"` | `"K4te Middleton"` ❌ | `"Kate Middleton"` ✅ |
| `"Celebrity IU"` | `"Celebrity IU"` ❌ | `"IU"` ✅ |

## spaCy Model Information

### Model Used
- **Name**: `en_core_web_sm`
- **Language**: English (but works reasonably with romanized names)
- **Size**: ~13 MB
- **Entities**: Recognizes PERSON, ORG, GPE, etc.

### Installation
```bash
# Install spaCy
pip install spacy

# Download model
python -m spacy download en_core_web_sm
```

The notebook automatically downloads the model if not found.

### Performance
- **Speed**: ~1000-5000 docs/second
- **Accuracy**: High for common names
- **Memory**: Low (~100MB loaded)

## Fallback Strategy

If spaCy doesn't recognize a PERSON entity:

1. **Extract capitalized words**:
   ```python
   "unknown name here" → ["unknown"]
   ```

2. **Return first few capitalized words**:
   ```python
   "Celebrity Model Actor" → "Celebrity Model Actor"
   ```

3. **Last resort**: Return cleaned text as-is

This ensures we always get something, even for:
- Uncommon/rare names
- Nicknames
- Non-English names
- Stage names

## Testing

### How to Verify spaCy is Working

Run Cell 5 and check the output:

```
✅ spaCy model loaded: en_core_web_sm

📊 Name cleaning examples (with spaCy NER):
===================================================================================================
Original Name                                      | Cleaned Name
===================================================================================================
Scarlett Johansson「LoRa」                        | Scarlett Johansson
Emma Watson (JG)                                  | Emma Watson
IU                                                | IU
Belle Delphine                                    | Belle Delphine
...
```

### Key Indicators

✅ **Good signs**:
- Person names cleanly extracted
- No extra words like "Model", "LoRA", "Celebrity"
- Multi-word names kept together (e.g., "Emma Watson" not just "Emma")

❌ **Issues to watch**:
- Empty results (increase fallback logic)
- Partial names (e.g., only first name)
- Non-names included (tune preprocessing)

## Customization

### Add More Languages

For better support of non-English names:

```python
# Download multilingual model
python -m spacy download xx_ent_wiki_sm

# Use in code
nlp = spacy.load("xx_ent_wiki_sm")
```

### Adjust Entity Extraction

To extract other entities:

```python
# Extract organizations too
entities = [ent.text for ent in doc.ents
            if ent.label_ in ["PERSON", "ORG"]]
```

### Custom Entity Rules

Add custom patterns for names spaCy might miss:

```python
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
# Add patterns for specific name formats
```

## Benefits for This Project

### Better Person Identification

With cleaner names:
- LLMs receive recognizable names
- "Emma Watson" instead of "Emma Watson Model LoRA v3"
- Better identification accuracy

### Reduced Ambiguity

spaCy helps distinguish:
- Person names vs. descriptive words
- "Celebrity IU" → "IU" (person)
- "Model Bella" → "Bella" (person)

### Improved Context for LLMs

Cleaner input = better prompts:
```
Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
After:  "Given 'Emma Watson' (actress)..."
```

The LLM can now focus on identifying the person, not parsing the noise.

## Summary

✅ **spaCy NER** provides intelligent, context-aware name extraction
✅ **Better than regex** for handling complex name formats
✅ **Fallback strategy** ensures we always get a result
✅ **Industry standard** tool used in production NLP
✅ **Easy to use** with minimal code

The combination of:
1. Leetspeak translation
2. Noise removal
3. spaCy NER
4. Smart fallbacks

...results in clean, accurate person names ready for LLM annotation!