File size: 1,608 Bytes
e1fe580 2cab1ae e1fe580 2cab1ae e1fe580 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | """
PubGuard — Scientific Publication Gatekeeper
=============================================
Multi-head document classifier for the PubVerse pipeline.
Determines whether extracted PDF text represents a genuine
scientific publication vs. junk, and flags AI-generated or
offensive content.
Classification heads:
1. doc_type – scientific_paper | poster | abstract_only | junk
2. ai_detect – human | ai_generated
3. toxicity – clean | toxic
Architecture mirrors openalex-topic-classifier:
model2vec (StaticModel) → L2-normalised embeddings → per-head
linear classifiers (sklearn / small torch heads) stored as
numpy weight matrices for zero-dependency inference.
Usage:
from pubguard import PubGuard
guard = PubGuard()
guard.initialize()
verdict = guard.screen(text)
# verdict = {
# 'doc_type': {'label': 'scientific_paper', 'score': 0.94},
# 'ai_generated': {'label': 'human', 'score': 0.87},
# 'toxicity': {'label': 'clean', 'score': 0.99},
# 'pass': True
# }
"""
from .classifier import PubGuard
from .config import PubGuardConfig
from .errors import (
PubVerseError,
build_pubguard_error,
empty_input_error,
unreadable_pdf_error,
models_missing_error,
gate_bypassed,
format_error_line,
PIPELINE_ERRORS,
)
__version__ = "0.1.0"
__all__ = [
"PubGuard",
"PubGuardConfig",
"PubVerseError",
"build_pubguard_error",
"empty_input_error",
"unreadable_pdf_error",
"models_missing_error",
"gate_bypassed",
"format_error_line",
"PIPELINE_ERRORS",
]
|