Text Classification
Scikit-learn
English
wittr / README.md
Shinapri's picture
Update README.md
55af022 verified
metadata
license: apache-2.0
language:
  - en
metrics:
  - name: exact_match
    value: 0.9777
    verified: false
pipeline_tag: text-classification
library_name: sklearn
datasets:
  - Lucanix/meta-paper-classify

WITTR or Wait, Is That The References?

A lightweight Naive Bayes classifier
designed to detect whether a given text should be filtered out from an academic paper
before being used for language-model pretraining or RAG (Retrieval-Augmented Generation).


๐Ÿง  Concept

WITTR is trained on text from academic and research-style corpora.
It distinguishes between two main categories:

  • โœ… 0 โ€“ meaningful content such as main academic paragraphs, analysis, or discussion
  • โŒ 1 โ€“ metadata or non-content text such as author names, URLs, DOIs, references, publication years, or institutional names

The goal is to provide an automatic corpus-cleaning step
that keeps only the informative text suitable for model training.


โš™๏ธ Model Details

  • Framework: scikit-learn
  • Architecture: Multinomial Naive Bayes
  • Vectorization: TFโ€“IDF
  • Language: English academic text
  • Accuracy: โ‰ˆ 0.9777
  • Intended Use: academic text preprocessing, corpus filtering before LLM or RAG pipelines

๐Ÿ“ฆ Files

File Description
wittr_naive.pkl Trained Naive Bayes classifier
wittr_naive_vectorizer.pkl TFโ€“IDF vectorizer (must be used with the model)
wittr_naive.py Simple python script
README.md This documentation

๐Ÿš€ Usage Example

from huggingface_hub import hf_hub_download
import joblib

repo = "Lucanix/wittr"

clf = joblib.load(hf_hub_download(repo, "wittr_naive.pkl"))
vectorizer = joblib.load(hf_hub_download(repo, "wittr_naive_vectorizer.pkl"))

text = ["18 Bales G. S. & and Chrzan, D. C. Dynamics of irreversible island growth during submonolayer epitaxy. *Phys. Rev. B* **50**, 6057โ€“6067 (1994)."]
X = vectorizer.transform(text)
print(clf.predict(X))  # โ†’ [1]