jimnoneill commited on
Commit
0e48d34
Β·
verified Β·
1 Parent(s): 0602538

Updated README with logo, pipeline diagram, related models, grant acknowledgment

Browse files
Files changed (1) hide show
  1. README.md +130 -19
README.md CHANGED
@@ -8,22 +8,57 @@ tags:
8
  - multimodal
9
  - model2vec
10
  - poster-detection
 
 
 
 
11
  library_name: model2vec
12
  pipeline_tag: text-classification
 
13
  ---
14
 
 
 
 
 
15
  # PosterSentry β€” Multimodal Scientific Poster Classifier
16
 
17
- PosterSentry classifies PDFs as **scientific posters** vs **non-posters** (papers, proceedings,
18
- abstracts, newsletters) using a multimodal approach that combines text embeddings with visual
19
- and structural features from the PDF.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- Part of the [posters.science](https://posters.science) initiative at
22
- [FAIR Data Innovations Hub](https://fairdataihub.org).
 
 
 
 
 
 
 
 
23
 
24
  ## Architecture
25
 
26
- Three feature channels concatenated into a 542-dimensional vector:
27
 
28
  | Channel | Features | Dimension | Signal |
29
  |---------|----------|-----------|--------|
@@ -31,53 +66,129 @@ Three feature channels concatenated into a 542-dimensional vector:
31
  | **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
32
  | **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
33
 
34
- Single LogisticRegression classifier with StandardScaler normalization.
35
 
36
  ## Performance
37
 
 
 
38
  | Metric | Value |
39
  |--------|-------|
40
- | Accuracy | **87.3%** |
41
  | F1 (poster) | 87.1% |
42
  | F1 (non-poster) | 87.4% |
43
- | Inference | ~300 docs/sec (CPU) |
 
 
44
 
45
  ### Top Features by Importance
46
 
47
- 1. `size_per_page_kb` (+7.65) β€” Posters are dense, high-res single pages
48
- 2. `page_count` (-5.49) β€” More pages = not a poster
49
- 3. `file_size_kb` (-5.44) β€” Multi-page docs are bigger overall
50
- 4. `img_height` (+1.38) β€” Posters are large-format
51
- 5. `color_diversity` (+0.95) β€” Posters are visually rich
 
 
 
 
 
 
 
52
 
53
  ## Training Data
54
 
55
- Trained on **3,606 real documents** (zero synthetic data):
 
 
 
 
 
56
 
57
- - **1,803 verified scientific posters** from Zenodo & Figshare (sampled from 28K+ corpus)
58
- - **1,803 verified non-posters** β€” multi-page papers, proceedings, newsletters
59
 
60
- See [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data).
61
 
62
  ## Usage
63
 
 
 
64
  ```python
65
  from poster_sentry import PosterSentry
66
 
67
  sentry = PosterSentry()
68
  sentry.initialize()
 
 
69
  result = sentry.classify("document.pdf")
 
70
  # {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
 
 
 
71
  ```
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ## Citation
74
 
75
  ```bibtex
76
  @software{poster_sentry_2026,
77
  title = {PosterSentry: Multimodal Scientific Poster Classifier},
78
- author = {O'Neill, Jamey and FAIR Data Innovations Hub},
79
  year = {2026},
80
  url = {https://huggingface.co/fairdataihub/poster-sentry},
81
  note = {Part of the posters.science initiative}
82
  }
83
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - multimodal
9
  - model2vec
10
  - poster-detection
11
+ - machine-actionable
12
+ - FAIR-data
13
+ - posters-science
14
+ - quality-control
15
  library_name: model2vec
16
  pipeline_tag: text-classification
17
+ thumbnail: PosterSentry.png
18
  ---
19
 
20
+ <div align="center">
21
+ <img src="PosterSentry.png" alt="PosterSentry Logo" width="400"/>
22
+ </div>
23
+
24
  # PosterSentry β€” Multimodal Scientific Poster Classifier
25
 
26
+ ## Model Description
27
+
28
+ PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a **scientific poster** or a **non-poster** (paper, proceedings, newsletter, abstract book, etc.).
29
+
30
+ Part of the quality control pipeline for [**posters.science**](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).
31
+
32
+ Developed by the [**FAIR Data Innovations Hub**](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMIΒ²).
33
+
34
+ ## Related Models & Tools
35
+
36
+ | Resource | Description | Link |
37
+ |----------|-------------|------|
38
+ | **PosterSentry** | Multimodal poster classifier (this model) | [fairdataihub/poster-sentry](https://huggingface.co/fairdataihub/poster-sentry) |
39
+ | **Llama-3.1-8B-Poster-Extraction** | Poster β†’ structured JSON extraction | [fairdataihub/Llama-3.1-8B-Poster-Extraction](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) |
40
+ | **poster2json** | Python library for poster extraction | [PyPI](https://pypi.org/project/poster2json/) Β· [Docs](https://fairdataihub.github.io/poster2json/) Β· [GitHub](https://github.com/fairdataihub/poster2json) |
41
+ | **poster-json-schema** | DataCite-based poster metadata schema | [GitHub](https://github.com/fairdataihub/poster-json-schema) |
42
+ | **Platform** | posters.science | [posters.science](https://posters.science) |
43
+
44
+ ### Pipeline Position
45
+
46
+ PosterSentry sits at the front of the posters.science pipeline β€” it screens incoming PDFs before the expensive Llama-based extraction:
47
 
48
+ ```
49
+ PDF Input
50
+ β”‚
51
+ β–Ό
52
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
53
+ β”‚ PosterSentry β”‚ ──► β”‚ Llama-3.1-8B-Poster-Extraction β”‚ ──► β”‚ poster2json β”‚
54
+ β”‚ (classify) β”‚ β”‚ (extract structured metadata) β”‚ β”‚ (validate) β”‚
55
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
56
+ poster? βœ“ raw text β†’ JSON schema FAIR output
57
+ ```
58
 
59
  ## Architecture
60
 
61
+ Three feature channels concatenated into a **542-dimensional** vector, fed to a single LogisticRegression:
62
 
63
  | Channel | Features | Dimension | Signal |
64
  |---------|----------|-----------|--------|
 
66
  | **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
67
  | **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
68
 
69
+ Each classifier head is a single linear layer stored as a numpy `.npz` file (10 KB). Inference is pure numpy β€” no torch required at prediction time.
70
 
71
  ## Performance
72
 
73
+ Validated on 3,606 real scientific documents:
74
+
75
  | Metric | Value |
76
  |--------|-------|
77
+ | **Accuracy** | **87.3%** |
78
  | F1 (poster) | 87.1% |
79
  | F1 (non-poster) | 87.4% |
80
+ | Precision (poster) | 88.2% |
81
+ | Recall (poster) | 85.9% |
82
+ | Inference speed | ~300 docs/sec (CPU) |
83
 
84
  ### Top Features by Importance
85
 
86
+ | Rank | Feature | Coefficient | Signal |
87
+ |------|---------|------------|--------|
88
+ | 1 | `size_per_page_kb` | +7.65 | Posters are dense, high-res single pages |
89
+ | 2 | `page_count` | -5.49 | More pages = not a poster |
90
+ | 3 | `file_size_kb` | -5.44 | Multi-page docs are bigger overall |
91
+ | 4 | `img_height` | +1.38 | Posters are large-format |
92
+ | 5 | `page_height_pt` | +1.38 | Large physical dimensions |
93
+ | 6 | `avg_font_size` | -1.10 | Papers use smaller fonts |
94
+ | 7 | `is_landscape` | +0.98 | Some posters are landscape |
95
+ | 8 | `color_diversity` | +0.95 | Posters are visually rich |
96
+ | 9 | `edge_density` | +0.79 | More visual edges in posters |
97
+ | 10 | `text_block_count` | +0.75 | Multi-column poster layouts |
98
 
99
  ## Training Data
100
 
101
+ Trained on **3,606 real documents** β€” zero synthetic data:
102
+
103
+ | Class | Count | Source |
104
+ |-------|-------|--------|
105
+ | **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare |
106
+ | **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books |
107
 
108
+ Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare).
 
109
 
110
+ Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)
111
 
112
  ## Usage
113
 
114
+ ### Python API
115
+
116
  ```python
117
  from poster_sentry import PosterSentry
118
 
119
  sentry = PosterSentry()
120
  sentry.initialize()
121
+
122
+ # Classify a PDF (uses text + visual + structural features)
123
  result = sentry.classify("document.pdf")
124
+ print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
125
  # {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
126
+
127
+ # Batch classification
128
+ results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])
129
  ```
130
 
131
+ ### Installation
132
+
133
+ ```bash
134
+ pip install git+https://github.com/fairdataihub/poster-repo-qc.git
135
+
136
+ # Or install from source
137
+ git clone https://github.com/fairdataihub/poster-repo-qc.git
138
+ cd poster-repo-qc
139
+ pip install -e ".[train]"
140
+ ```
141
+
142
+ ### Training
143
+
144
+ ```bash
145
+ python scripts/train_poster_sentry.py --n-per-class 2000
146
+ ```
147
+
148
+ Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).
149
+
150
+ ## Model Specifications
151
+
152
+ | Attribute | Value |
153
+ |-----------|-------|
154
+ | Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
155
+ | Embedding dimension | 512 |
156
+ | Visual features | 15 (color, edge, FFT, whitespace) |
157
+ | Structural features | 15 (page geometry, fonts, text blocks) |
158
+ | Total input dimension | 542 |
159
+ | Classifier | LogisticRegression (sklearn) + StandardScaler |
160
+ | Head file size | 10 KB (.npz) |
161
+ | Precision | float32 |
162
+ | GPU required | No (CPU-only) |
163
+ | License | MIT |
164
+
165
+ ## System Requirements
166
+
167
+ - **CPU**: Any modern CPU (no GPU needed)
168
+ - **RAM**: β‰₯4GB
169
+ - **Python**: β‰₯3.10
170
+ - **Dependencies**: numpy, model2vec, scikit-learn, PyMuPDF, Pillow
171
+
172
  ## Citation
173
 
174
  ```bibtex
175
  @software{poster_sentry_2026,
176
  title = {PosterSentry: Multimodal Scientific Poster Classifier},
177
+ author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
178
  year = {2026},
179
  url = {https://huggingface.co/fairdataihub/poster-sentry},
180
  note = {Part of the posters.science initiative}
181
  }
182
  ```
183
+
184
+ ## License
185
+
186
+ This model is released under the [MIT License](https://opensource.org/licenses/MIT).
187
+
188
+ ## Acknowledgments
189
+
190
+ - [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMIΒ²)
191
+ - [posters.science](https://posters.science) platform
192
+ - [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
193
+ - HuggingFace for model hosting infrastructure
194
+ - Funded by The Navigation Fund ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)) β€” "Poster Sharing and Discovery Made Easy"