File size: 1,204 Bytes
e1fe580 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
"""
PubGuard — Scientific Publication Gatekeeper
=============================================
Multi-head document classifier for the PubVerse pipeline.
Determines whether extracted PDF text represents a genuine
scientific publication vs. junk, and flags AI-generated or
offensive content.
Classification heads:
1. doc_type – scientific_paper | poster | abstract_only | junk
2. ai_detect – human | ai_generated
3. toxicity – clean | toxic
Architecture mirrors openalex-topic-classifier:
model2vec (StaticModel) → L2-normalised embeddings → per-head
linear classifiers (sklearn / small torch heads) stored as
numpy weight matrices for zero-dependency inference.
Usage:
from pubguard import PubGuard
guard = PubGuard()
guard.initialize()
verdict = guard.screen(text)
# verdict = {
# 'doc_type': {'label': 'scientific_paper', 'score': 0.94},
# 'ai_generated': {'label': 'human', 'score': 0.87},
# 'toxicity': {'label': 'clean', 'score': 0.99},
# 'pass': True
# }
"""
from .classifier import PubGuard
from .config import PubGuardConfig
)
__version__ = "0.1.0"
__all__ = [
"PubGuard",
"PubGuardConfig",
]
|