File size: 1,608 Bytes
e1fe580
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2cab1ae
 
 
 
 
 
 
 
 
e1fe580
 
 
 
 
 
2cab1ae
 
 
 
 
 
 
 
e1fe580
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
"""
PubGuard — Scientific Publication Gatekeeper
=============================================

Multi-head document classifier for the PubVerse pipeline.
Determines whether extracted PDF text represents a genuine
scientific publication vs. junk, and flags AI-generated or
offensive content.

Classification heads:
    1. doc_type   – scientific_paper | poster | abstract_only | junk
    2. ai_detect  – human | ai_generated
    3. toxicity   – clean | toxic

Architecture mirrors openalex-topic-classifier:
    model2vec (StaticModel) → L2-normalised embeddings → per-head
    linear classifiers (sklearn / small torch heads) stored as
    numpy weight matrices for zero-dependency inference.

Usage:
    from pubguard import PubGuard

    guard = PubGuard()
    guard.initialize()
    verdict = guard.screen(text)
    # verdict = {
    #   'doc_type': {'label': 'scientific_paper', 'score': 0.94},
    #   'ai_generated': {'label': 'human', 'score': 0.87},
    #   'toxicity': {'label': 'clean', 'score': 0.99},
    #   'pass': True
    # }
"""

from .classifier import PubGuard
from .config import PubGuardConfig
from .errors import (
    PubVerseError,
    build_pubguard_error,
    empty_input_error,
    unreadable_pdf_error,
    models_missing_error,
    gate_bypassed,
    format_error_line,
    PIPELINE_ERRORS,
)

__version__ = "0.1.0"
__all__ = [
    "PubGuard",
    "PubGuardConfig",
    "PubVerseError",
    "build_pubguard_error",
    "empty_input_error",
    "unreadable_pdf_error",
    "models_missing_error",
    "gate_bypassed",
    "format_error_line",
    "PIPELINE_ERRORS",
]