Instructions to use LumeData/HandleAtlas-166m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER
How to use LumeData/HandleAtlas-166m with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("LumeData/HandleAtlas-166m") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - multilingual | |
| pipeline_tag: token-classification | |
| tags: | |
| - gliner | |
| - ner | |
| - token-classification | |
| - social-media | |
| - username-extraction | |
| library_name: gliner | |
| base_model: urchade/gliner_small-v2.1 | |
| # HandleAtlas-166m | |
| A fine-tuned [GLiNER small v2.1](https://huggingface.co/urchade/gliner_small-v2.1) (~166M params) | |
| for extracting social-media handles from short bios. Built on Twitter/X bios but the | |
| patterns generalize to other platforms. | |
| ## Labels | |
| - `instagram_username` | |
| - `snapchat_username` | |
| - `youtube_username` | |
| - `twitch_username` | |
| - `tiktok_username` | |
| - `discord_username` | |
| - `x_username` | |
| - `cashapp_username` | |
| - `onlyfans_username` | |
| - `tumblr_username` | |
| - `github_username` | |
| - `kofi_username` | |
| - `patreon_username` | |
| - `roblox_username` | |
| - `generic_username` | |
| `generic_username` is a fallback for handle-shaped strings without a clear platform | |
| indicator. | |
| ## Usage | |
| ```python | |
| from gliner import GLiNER | |
| model = GLiNER.from_pretrained("LumeData/HandleAtlas-166m") | |
| labels = ['instagram_username', 'snapchat_username', 'youtube_username', 'twitch_username', 'tiktok_username', 'discord_username', 'x_username', 'cashapp_username', 'onlyfans_username', 'tumblr_username', 'github_username', 'kofi_username', 'patreon_username', 'roblox_username', 'generic_username'] | |
| text = "Insta: foodgrammer | Snap: chefchef | DC: gamer420 | $cashtag" | |
| for ent in model.predict_entities(text, labels, threshold=0.5): | |
| print(f"{ent['text']!r} -> {ent['label']} ({ent['score']:.2f})") | |
| ``` | |
| ## Training | |
| - Base: `urchade/gliner_small-v2.1` | |
| - Real data: ~1,000 hand-labeled Twitter bios | |
| - Synthetic data: ~2,200 generated bios (template-based + IG→Discord text rewriting | |
| for the discord_username class) | |
| - Case augmentation: each training record is emitted in original + fully-lowercased | |
| form so the model is robust to casing of platform prefixes (`Dc:`/`dc:`/`DC:` etc.) | |
| - 5 epochs, batch 4 × grad-accum 2, lr 5e-6 (encoder) / 1e-5 (heads), cosine schedule | |
| ## Eval | |
| On a 100-record held-out slice of real Twitter bios: | |
| | metric | value | | |
| |-----------|-------| | |
| | precision | 0.849 | | |
| | recall | 0.929 | | |
| | F1 | 0.887 | | |
| Strong per-label F1 on instagram (0.95), youtube (1.00), tiktok (1.00), twitch (1.00), | |
| onlyfans (1.00), generic (0.88), cashapp (0.86), snapchat (0.80). | |
| ## Recommended thresholds | |
| - Default: `threshold=0.5` | |
| - For `generic_username`, bump to `0.65` to reduce false positives; it's the | |
| catch-all label and over-fires at the default threshold. | |
| ## Limitations | |
| - Trained on patterns common in Twitter/X bios; performance on other domains | |
| (LinkedIn-style, Reddit, forum sigs) will be lower. | |
| - `discord_invite` is not predicted — invite codes will be classified as | |
| `discord_username` or skipped. | |
| - Multi-line bios with many handles can occasionally confuse adjacent URL labels | |
| (e.g., `patreon.com/x | github.com/x` chains). | |