File size: 7,537 Bytes
5f5806d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
# spaCy NER Implementation

## Why spaCy for NER?

Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because:

1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning
2. **Context-aware**: Understands sentence structure and context
3. **Robust**: Handles various name formats (first, last, full, stage names)
4. **Language support**: Works with multiple languages and scripts
5. **Industry standard**: Used in production NLP applications

## How It Works

### Pipeline Overview

```
Original Name
    ↓
1. Translate Leetspeak (4β†’a, 3β†’e, 1β†’i)
    ↓
2. Remove Noise (emoji, LoRA terms, versions)
    ↓
3. spaCy NER - Extract PERSON entities
    ↓
4. Fallback to capitalized words if needed
    ↓
Cleaned Name
```

### Detailed Steps

#### Step 1: Leetspeak Translation
```python
"4kira LoRA v2" β†’ "akira LoRA v2"
"1rene Model" β†’ "irene Model"
"3mma Watson" β†’ "emma Watson"
```

#### Step 2: Noise Removal
```python
"akira LoRA v2" β†’ "akira"
"irene Model" β†’ "irene"
"emma Watson" β†’ "emma Watson"
```

#### Step 3: spaCy NER
```python
nlp("akira")
# Entities: [("akira", PERSON)]
# Result: "akira"

nlp("emma Watson")
# Entities: [("emma Watson", PERSON)]
# Result: "emma Watson"
```

#### Step 4: Fallback
If spaCy doesn't find a PERSON entity:
- Extract capitalized words (likely names)
- Or return cleaned text as-is

## Examples

### Case 1: Simple Name
```
Input:  "IU"
Output: "IU"

Process:
  - Preprocess: "IU" (no noise)
  - spaCy NER: Recognizes "IU" as PERSON
  - Result: "IU"
```

### Case 2: Name with LoRA Terms
```
Input:  "Scarlett Johanssonγ€ŒLoRa」"
Output: "Scarlett Johansson"

Process:
  - Preprocess: "Scarlett Johansson" (removed γ€ŒLoRa」)
  - spaCy NER: Recognizes "Scarlett Johansson" as PERSON
  - Result: "Scarlett Johansson"
```

### Case 3: Leetspeak Name
```
Input:  "4kira Anime Character v1"
Output: "akira"

Process:
  - Leetspeak: "akira Anime Character v1"
  - Preprocess: "akira Anime Character"
  - spaCy NER: Recognizes "akira" as PERSON
  - Result: "akira"
```

### Case 4: Complex Format
```
Input:  "Gakki | Aragaki Yui | ζ–°εž£η΅θ‘£"
Output: "Gakki"

Process:
  - Preprocess: "Gakki" (kept first part before |)
  - spaCy NER: Recognizes "Gakki" as PERSON
  - Result: "Gakki"
```

### Case 5: With Metadata
```
Input:  "Emma Watson (JG) v3.5"
Output: "Emma Watson"

Process:
  - Preprocess: "Emma Watson" (removed (JG) and v3.5)
  - spaCy NER: Recognizes "Emma Watson" as PERSON
  - Result: "Emma Watson"
```

## Advantages Over Regex-Only

### Old Approach (Regex Only)
```python
# Just remove noise and hope for the best
name = remove_noise(name)
name = name.strip()
# Result: May include non-name words
```

Problems:
- Can't distinguish names from other capitalized words
- May include words like "Model", "Anime", "Character"
- No context awareness
- Language-dependent regex patterns needed

### New Approach (spaCy NER)
```python
# Intelligent entity extraction
preprocessed = remove_noise(name)
doc = nlp(preprocessed)
person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
# Result: Only actual person names
```

Benefits:
- βœ… Identifies actual person entities
- βœ… Ignores non-person words
- βœ… Context-aware (understands "Emma Watson" is one entity)
- βœ… Multi-language support
- βœ… Handles various name formats

## Comparison Examples

| Input | Regex Only | spaCy NER |
|-------|------------|-----------|
| `"Emma Watson Model"` | `"Emma Watson Model"` ❌ | `"Emma Watson"` βœ… |
| `"Anime Character Levi"` | `"Anime Character Levi"` ❌ | `"Levi"` βœ… |
| `"Taylor Swift v2"` | `"Taylor Swift"` βœ… | `"Taylor Swift"` βœ… |
| `"K4te Middleton"` | `"K4te Middleton"` ❌ | `"Kate Middleton"` βœ… |
| `"Celebrity IU"` | `"Celebrity IU"` ❌ | `"IU"` βœ… |

## spaCy Model Information

### Model Used
- **Name**: `en_core_web_sm`
- **Language**: English (but works reasonably with romanized names)
- **Size**: ~13 MB
- **Entities**: Recognizes PERSON, ORG, GPE, etc.

### Installation
```bash
# Install spaCy
pip install spacy

# Download model
python -m spacy download en_core_web_sm
```

The notebook automatically downloads the model if not found.

### Performance
- **Speed**: ~1000-5000 docs/second
- **Accuracy**: High for common names
- **Memory**: Low (~100MB loaded)

## Fallback Strategy

If spaCy doesn't recognize a PERSON entity:

1. **Extract capitalized words**:
   ```python
   "unknown name here" β†’ ["unknown"]
   ```

2. **Return first few capitalized words**:
   ```python
   "Celebrity Model Actor" β†’ "Celebrity Model Actor"
   ```

3. **Last resort**: Return cleaned text as-is

This ensures we always get something, even for:
- Uncommon/rare names
- Nicknames
- Non-English names
- Stage names

## Testing

### How to Verify spaCy is Working

Run Cell 5 and check the output:

```
βœ… spaCy model loaded: en_core_web_sm

πŸ“Š Name cleaning examples (with spaCy NER):
===================================================================================================
Original Name                                      | Cleaned Name
===================================================================================================
Scarlett Johanssonγ€ŒLoRa」                        | Scarlett Johansson
Emma Watson (JG)                                  | Emma Watson
IU                                                | IU
Belle Delphine                                    | Belle Delphine
...
```

### Key Indicators

βœ… **Good signs**:
- Person names cleanly extracted
- No extra words like "Model", "LoRA", "Celebrity"
- Multi-word names kept together (e.g., "Emma Watson" not just "Emma")

❌ **Issues to watch**:
- Empty results (increase fallback logic)
- Partial names (e.g., only first name)
- Non-names included (tune preprocessing)

## Customization

### Add More Languages

For better support of non-English names:

```python
# Download multilingual model
python -m spacy download xx_ent_wiki_sm

# Use in code
nlp = spacy.load("xx_ent_wiki_sm")
```

### Adjust Entity Extraction

To extract other entities:

```python
# Extract organizations too
entities = [ent.text for ent in doc.ents
            if ent.label_ in ["PERSON", "ORG"]]
```

### Custom Entity Rules

Add custom patterns for names spaCy might miss:

```python
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
# Add patterns for specific name formats
```

## Benefits for This Project

### Better Person Identification

With cleaner names:
- LLMs receive recognizable names
- "Emma Watson" instead of "Emma Watson Model LoRA v3"
- Better identification accuracy

### Reduced Ambiguity

spaCy helps distinguish:
- Person names vs. descriptive words
- "Celebrity IU" β†’ "IU" (person)
- "Model Bella" β†’ "Bella" (person)

### Improved Context for LLMs

Cleaner input = better prompts:
```
Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
After:  "Given 'Emma Watson' (actress)..."
```

The LLM can now focus on identifying the person, not parsing the noise.

## Summary

βœ… **spaCy NER** provides intelligent, context-aware name extraction
βœ… **Better than regex** for handling complex name formats
βœ… **Fallback strategy** ensures we always get a result
βœ… **Industry standard** tool used in production NLP
βœ… **Easy to use** with minimal code

The combination of:
1. Leetspeak translation
2. Noise removal
3. spaCy NER
4. Smart fallbacks

...results in clean, accurate person names ready for LLM annotation!