WITTR or Wait, Is That The References?

A lightweight Naive Bayes classifier
designed to detect whether a given text should be filtered out from an academic paper
before being used for language-model pretraining or RAG (Retrieval-Augmented Generation).

🧠 Concept

WITTR is trained on text from academic and research-style corpora.
It distinguishes between two main categories:

✅ 0 – meaningful content such as main academic paragraphs, analysis, or discussion
❌ 1 – metadata or non-content text such as author names, URLs, DOIs, references, publication years, or institutional names

The goal is to provide an automatic corpus-cleaning step
that keeps only the informative text suitable for model training.

⚙️ Model Details

Framework: scikit-learn
Architecture: Multinomial Naive Bayes
Vectorization: TF–IDF
Language: English academic text
Accuracy: ≈ 0.9777
Intended Use: academic text preprocessing, corpus filtering before LLM or RAG pipelines

📦 Files

File	Description
`wittr_naive.pkl`	Trained Naive Bayes classifier
`wittr_naive_vectorizer.pkl`	TF–IDF vectorizer (must be used with the model)
`wittr_naive.py`	Simple python script
`README.md`	This documentation

🚀 Usage Example

from huggingface_hub import hf_hub_download
import joblib

repo = "Lucanix/wittr"

clf = joblib.load(hf_hub_download(repo, "wittr_naive.pkl"))
vectorizer = joblib.load(hf_hub_download(repo, "wittr_naive_vectorizer.pkl"))

text = ["18 Bales G. S. & and Chrzan, D. C. Dynamics of irreversible island growth during submonolayer epitaxy. *Phys. Rev. B* **50**, 6057–6067 (1994)."]
X = vectorizer.transform(text)
print(clf.predict(X))  # → [1]

Downloads last month: -

jintanakan
/

wittr

WITTR or Wait, Is That The References?

🧠 Concept

⚙️ Model Details

📦 Files

🚀 Usage Example

Dataset used to train jintanakan/wittr