jimnoneill commited on
Commit
b9d33f4
·
verified ·
1 Parent(s): 635afb8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - document-classification
7
+ - scientific-posters
8
+ - multimodal
9
+ - model2vec
10
+ - poster-detection
11
+ library_name: model2vec
12
+ pipeline_tag: text-classification
13
+ ---
14
+
15
+ # PosterSentry — Multimodal Scientific Poster Classifier
16
+
17
+ PosterSentry classifies PDFs as **scientific posters** vs **non-posters** (papers, proceedings,
18
+ abstracts, newsletters) using a multimodal approach that combines text embeddings with visual
19
+ and structural features from the PDF.
20
+
21
+ Part of the [posters.science](https://posters.science) initiative at
22
+ [FAIR Data Innovations Hub](https://fairdataihub.org).
23
+
24
+ ## Architecture
25
+
26
+ Three feature channels concatenated into a 542-dimensional vector:
27
+
28
+ | Channel | Features | Dimension | Signal |
29
+ |---------|----------|-----------|--------|
30
+ | **Text** | model2vec (potion-base-32M) embedding | 512 | Semantic content |
31
+ | **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
32
+ | **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
33
+
34
+ Single LogisticRegression classifier with StandardScaler normalization.
35
+
36
+ ## Performance
37
+
38
+ | Metric | Value |
39
+ |--------|-------|
40
+ | Accuracy | **87.3%** |
41
+ | F1 (poster) | 87.1% |
42
+ | F1 (non-poster) | 87.4% |
43
+ | Inference | ~300 docs/sec (CPU) |
44
+
45
+ ### Top Features by Importance
46
+
47
+ 1. `size_per_page_kb` (+7.65) — Posters are dense, high-res single pages
48
+ 2. `page_count` (-5.49) — More pages = not a poster
49
+ 3. `file_size_kb` (-5.44) — Multi-page docs are bigger overall
50
+ 4. `img_height` (+1.38) — Posters are large-format
51
+ 5. `color_diversity` (+0.95) — Posters are visually rich
52
+
53
+ ## Training Data
54
+
55
+ Trained on **3,606 real documents** (zero synthetic data):
56
+
57
+ - **1,803 verified scientific posters** from Zenodo & Figshare (sampled from 28K+ corpus)
58
+ - **1,803 verified non-posters** — multi-page papers, proceedings, newsletters
59
+
60
+ See [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data).
61
+
62
+ ## Usage
63
+
64
+ ```python
65
+ from poster_sentry import PosterSentry
66
+
67
+ sentry = PosterSentry()
68
+ sentry.initialize()
69
+ result = sentry.classify("document.pdf")
70
+ # {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
71
+ ```
72
+
73
+ ## Citation
74
+
75
+ ```bibtex
76
+ @software{poster_sentry_2026,
77
+ title = {PosterSentry: Multimodal Scientific Poster Classifier},
78
+ author = {O'Neill, Jamey and FAIR Data Innovations Hub},
79
+ year = {2026},
80
+ url = {https://huggingface.co/fairdataihub/poster-sentry},
81
+ note = {Part of the posters.science initiative}
82
+ }
83
+ ```