pompompur-in
/

techpto-classifier

Text Classification

fasttext-distillation

Model card Files Files and versions

techpto-classifier / README.md

pom

Add V7R tech crypto classifier

036970a 4 days ago

|

History Blame Contribute Delete

1.86 kB

	---
	license: mit
	tags:
	- text-classification
	- crypto
	- technology
	- twitter
	- x
	- fasttext-distillation
	---
	this is a classification model that sorts tweets/profiles off the probability that it is tech/crypto related. this was a model created for a job that fell short. this is a tf-idf model, distilled from a transformer model that I also made. maybe ill upload that soon

	# Techpto Classifier

	This repository contains a lightweight production classifier for detecting whether X/Twitter posts and profiles are crypto-related, tech-related, both, or neither.

	## Files

	- `text_classifier.json`: Rust-compatible hashed logistic-regression classifier.
	- `model_config.json`: labels, expected inputs, and recommended thresholds.
	- `distill_metrics.json`: proxy evaluation metrics from distillation.
	- `recommended_thresholds_distillation.json`: thresholds tuned against the V7 fastText teacher.
	- `full_run_manifest.json`: counts and thresholds from the large full-corpus run.

	## Recommended Thresholds

	The high-precision full-corpus run used:

	```json
	{
	"post_crypto": 0.85,
	"post_tech": 0.90,
	"profile_crypto": 0.90,
	"profile_tech": 0.99
	}
	```

	The original distillation-tuned thresholds were:

	```json
	{
	"post_crypto": 0.58,
	"post_tech": 0.44,
	"profile_crypto": 0.34,
	"profile_tech": 0.38
	}
	```

	## Full-Corpus Run

	Using the high-precision thresholds:

	- Posts scanned: `928,484,069`
	- Post matches: `7,728,133`
	- Profiles scanned: `2,667,815,773`
	- Profile matches: `7,915,096`

	One corrupt post shard was skipped and is listed in `full_run_manifest.json`.

	## Notes

	This is not a standard Transformers checkpoint. It is a compact hashed-feature linear classifier intended for very high-throughput local scanning. Metrics in `distill_metrics.json` are proxy metrics against teacher/weak labels rather than a final human-labeled benchmark.