LumeData
/

HandleAtlas-166m

Token Classification

username-extraction

Model card Files Files and versions

HandleAtlas-166m / README.md

Techpro864's picture

Add full model card

6964783 verified 4 days ago

|

History Blame Contribute Delete

2.89 kB

	---
	license: apache-2.0
	language:
	- en
	- multilingual
	pipeline_tag: token-classification
	tags:
	- gliner
	- ner
	- token-classification
	- social-media
	- username-extraction
	library_name: gliner
	base_model: urchade/gliner_small-v2.1
	---

	# HandleAtlas-166m

	A fine-tuned [GLiNER small v2.1](https://huggingface.co/urchade/gliner_small-v2.1) (~166M params)
	for extracting social-media handles from short bios. Built on Twitter/X bios but the
	patterns generalize to other platforms.

	## Labels

	- `instagram_username`
	- `snapchat_username`
	- `youtube_username`
	- `twitch_username`
	- `tiktok_username`
	- `discord_username`
	- `x_username`
	- `cashapp_username`
	- `onlyfans_username`
	- `tumblr_username`
	- `github_username`
	- `kofi_username`
	- `patreon_username`
	- `roblox_username`
	- `generic_username`

	`generic_username` is a fallback for handle-shaped strings without a clear platform
	indicator.

	## Usage

	```python
	from gliner import GLiNER

	model = GLiNER.from_pretrained("LumeData/HandleAtlas-166m")

	labels = ['instagram_username', 'snapchat_username', 'youtube_username', 'twitch_username', 'tiktok_username', 'discord_username', 'x_username', 'cashapp_username', 'onlyfans_username', 'tumblr_username', 'github_username', 'kofi_username', 'patreon_username', 'roblox_username', 'generic_username']

	text = "Insta: foodgrammer \| Snap: chefchef \| DC: gamer420 \| $cashtag"
	for ent in model.predict_entities(text, labels, threshold=0.5):
	print(f"{ent['text']!r} -> {ent['label']} ({ent['score']:.2f})")
	```

	## Training

	- Base: `urchade/gliner_small-v2.1`
	- Real data: ~1,000 hand-labeled Twitter bios
	- Synthetic data: ~2,200 generated bios (template-based + IG→Discord text rewriting
	for the discord_username class)
	- Case augmentation: each training record is emitted in original + fully-lowercased
	form so the model is robust to casing of platform prefixes (`Dc:`/`dc:`/`DC:` etc.)
	- 5 epochs, batch 4 × grad-accum 2, lr 5e-6 (encoder) / 1e-5 (heads), cosine schedule

	## Eval

	On a 100-record held-out slice of real Twitter bios:

	\| metric \| value \|
	\|-----------\|-------\|
	\| precision \| 0.849 \|
	\| recall \| 0.929 \|
	\| F1 \| 0.887 \|

	Strong per-label F1 on instagram (0.95), youtube (1.00), tiktok (1.00), twitch (1.00),
	onlyfans (1.00), generic (0.88), cashapp (0.86), snapchat (0.80).

	## Recommended thresholds

	- Default: `threshold=0.5`
	- For `generic_username`, bump to `0.65` to reduce false positives; it's the
	catch-all label and over-fires at the default threshold.

	## Limitations

	- Trained on patterns common in Twitter/X bios; performance on other domains
	(LinkedIn-style, Reddit, forum sigs) will be lower.
	- `discord_invite` is not predicted — invite codes will be classified as
	`discord_username` or skipped.
	- Multi-line bios with many handles can occasionally confuse adjacent URL labels
	(e.g., `patreon.com/x \| github.com/x` chains).