--- license: apache-2.0 language: - en metrics: - name: exact_match value: 0.9777 verified: false pipeline_tag: text-classification library_name: sklearn datasets: - Lucanix/meta-paper-classify --- # WITTR or Wait, Is That The References? A lightweight Naive Bayes classifier designed to detect whether a given text should be **filtered out** from an academic paper before being used for **language-model pretraining** or **RAG (Retrieval-Augmented Generation)**. --- ## 🧠 Concept WITTR is trained on text from academic and research-style corpora. It distinguishes between two main categories: - ✅ **0** – meaningful content such as main academic paragraphs, analysis, or discussion - ❌ **1** – metadata or non-content text such as author names, URLs, DOIs, references, publication years, or institutional names The goal is to provide an **automatic corpus-cleaning step** that keeps only the informative text suitable for model training. --- ## ⚙️ Model Details - **Framework:** scikit-learn - **Architecture:** Multinomial Naive Bayes - **Vectorization:** TF–IDF - **Language:** English academic text - **Accuracy:** ≈ 0.9777 - **Intended Use:** academic text preprocessing, corpus filtering before LLM or RAG pipelines --- ## 📦 Files | File | Description | |------|--------------| | `wittr_naive.pkl` | Trained Naive Bayes classifier | | `wittr_naive_vectorizer.pkl` | TF–IDF vectorizer (must be used with the model) | | `wittr_naive.py` | Simple python script | | `README.md` | This documentation | --- ## 🚀 Usage Example ```python from huggingface_hub import hf_hub_download import joblib repo = "Lucanix/wittr" clf = joblib.load(hf_hub_download(repo, "wittr_naive.pkl")) vectorizer = joblib.load(hf_hub_download(repo, "wittr_naive_vectorizer.pkl")) text = ["18 Bales G. S. & and Chrzan, D. C. Dynamics of irreversible island growth during submonolayer epitaxy. *Phys. Rev. B* **50**, 6057–6067 (1994)."] X = vectorizer.transform(text) print(clf.predict(X)) # → [1] ```