jintanakan/meta-paper-classify
Viewer β’ Updated β’ 14.3k β’ 24
How to use jintanakan/wittr with Scikit-learn:
from huggingface_hub import hf_hub_download
import joblib
model = joblib.load(
hf_hub_download("jintanakan/wittr", "sklearn_model.joblib")
)
# only load pickle files from sources you trust
# read more about it here https://skops.readthedocs.io/en/stable/persistence.htmlA lightweight Naive Bayes classifier
designed to detect whether a given text should be filtered out from an academic paper
before being used for language-model pretraining or RAG (Retrieval-Augmented Generation).
WITTR is trained on text from academic and research-style corpora.
It distinguishes between two main categories:
The goal is to provide an automatic corpus-cleaning step
that keeps only the informative text suitable for model training.
| File | Description |
|---|---|
wittr_naive.pkl |
Trained Naive Bayes classifier |
wittr_naive_vectorizer.pkl |
TFβIDF vectorizer (must be used with the model) |
wittr_naive.py |
Simple python script |
README.md |
This documentation |
from huggingface_hub import hf_hub_download
import joblib
repo = "Lucanix/wittr"
clf = joblib.load(hf_hub_download(repo, "wittr_naive.pkl"))
vectorizer = joblib.load(hf_hub_download(repo, "wittr_naive_vectorizer.pkl"))
text = ["18 Bales G. S. & and Chrzan, D. C. Dynamics of irreversible island growth during submonolayer epitaxy. *Phys. Rev. B* **50**, 6057β6067 (1994)."]
X = vectorizer.transform(text)
print(clf.predict(X)) # β [1]