| --- |
| license: mit |
| tags: |
| - text-classification |
| - crypto |
| - technology |
| - twitter |
| - x |
| - fasttext-distillation |
| --- |
| this is a classification model that sorts tweets/profiles off the probability that it is tech/crypto related. this was a model created for a job that fell short. this is a tf-idf model, distilled from a transformer model that I also made. maybe ill upload that soon |
|
|
| # Techpto Classifier |
|
|
| This repository contains a lightweight production classifier for detecting whether X/Twitter posts and profiles are crypto-related, tech-related, both, or neither. |
|
|
| ## Files |
|
|
| - `text_classifier.json`: Rust-compatible hashed logistic-regression classifier. |
| - `model_config.json`: labels, expected inputs, and recommended thresholds. |
| - `distill_metrics.json`: proxy evaluation metrics from distillation. |
| - `recommended_thresholds_distillation.json`: thresholds tuned against the V7 fastText teacher. |
| - `full_run_manifest.json`: counts and thresholds from the large full-corpus run. |
|
|
| ## Recommended Thresholds |
|
|
| The high-precision full-corpus run used: |
|
|
| ```json |
| { |
| "post_crypto": 0.85, |
| "post_tech": 0.90, |
| "profile_crypto": 0.90, |
| "profile_tech": 0.99 |
| } |
| ``` |
|
|
| The original distillation-tuned thresholds were: |
|
|
| ```json |
| { |
| "post_crypto": 0.58, |
| "post_tech": 0.44, |
| "profile_crypto": 0.34, |
| "profile_tech": 0.38 |
| } |
| ``` |
|
|
| ## Full-Corpus Run |
|
|
| Using the high-precision thresholds: |
|
|
| - Posts scanned: `928,484,069` |
| - Post matches: `7,728,133` |
| - Profiles scanned: `2,667,815,773` |
| - Profile matches: `7,915,096` |
|
|
| One corrupt post shard was skipped and is listed in `full_run_manifest.json`. |
|
|
| ## Notes |
|
|
| This is not a standard Transformers checkpoint. It is a compact hashed-feature linear classifier intended for very high-throughput local scanning. Metrics in `distill_metrics.json` are proxy metrics against teacher/weak labels rather than a final human-labeled benchmark. |
|
|