--- license: apache-2.0 language: - en - multilingual pipeline_tag: token-classification tags: - gliner - ner - token-classification - social-media - username-extraction library_name: gliner base_model: urchade/gliner_small-v2.1 --- # HandleAtlas-166m A fine-tuned [GLiNER small v2.1](https://huggingface.co/urchade/gliner_small-v2.1) (~166M params) for extracting social-media handles from short bios. Built on Twitter/X bios but the patterns generalize to other platforms. ## Labels - `instagram_username` - `snapchat_username` - `youtube_username` - `twitch_username` - `tiktok_username` - `discord_username` - `x_username` - `cashapp_username` - `onlyfans_username` - `tumblr_username` - `github_username` - `kofi_username` - `patreon_username` - `roblox_username` - `generic_username` `generic_username` is a fallback for handle-shaped strings without a clear platform indicator. ## Usage ```python from gliner import GLiNER model = GLiNER.from_pretrained("LumeData/HandleAtlas-166m") labels = ['instagram_username', 'snapchat_username', 'youtube_username', 'twitch_username', 'tiktok_username', 'discord_username', 'x_username', 'cashapp_username', 'onlyfans_username', 'tumblr_username', 'github_username', 'kofi_username', 'patreon_username', 'roblox_username', 'generic_username'] text = "Insta: foodgrammer | Snap: chefchef | DC: gamer420 | $cashtag" for ent in model.predict_entities(text, labels, threshold=0.5): print(f"{ent['text']!r} -> {ent['label']} ({ent['score']:.2f})") ``` ## Training - Base: `urchade/gliner_small-v2.1` - Real data: ~1,000 hand-labeled Twitter bios - Synthetic data: ~2,200 generated bios (template-based + IG→Discord text rewriting for the discord_username class) - Case augmentation: each training record is emitted in original + fully-lowercased form so the model is robust to casing of platform prefixes (`Dc:`/`dc:`/`DC:` etc.) - 5 epochs, batch 4 × grad-accum 2, lr 5e-6 (encoder) / 1e-5 (heads), cosine schedule ## Eval On a 100-record held-out slice of real Twitter bios: | metric | value | |-----------|-------| | precision | 0.849 | | recall | 0.929 | | F1 | 0.887 | Strong per-label F1 on instagram (0.95), youtube (1.00), tiktok (1.00), twitch (1.00), onlyfans (1.00), generic (0.88), cashapp (0.86), snapchat (0.80). ## Recommended thresholds - Default: `threshold=0.5` - For `generic_username`, bump to `0.65` to reduce false positives; it's the catch-all label and over-fires at the default threshold. ## Limitations - Trained on patterns common in Twitter/X bios; performance on other domains (LinkedIn-style, Reddit, forum sigs) will be lower. - `discord_invite` is not predicted — invite codes will be classified as `discord_username` or skipped. - Multi-line bios with many handles can occasionally confuse adjacent URL labels (e.g., `patreon.com/x | github.com/x` chains).