File size: 7,523 Bytes
b9d33f4
 
 
 
 
 
 
 
 
 
0e48d34
 
 
 
b9d33f4
 
0e48d34
b9d33f4
 
0e48d34
 
 
 
b9d33f4
 
0e48d34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9d33f4
0e48d34
 
 
 
 
 
 
 
 
 
b9d33f4
 
 
0e48d34
b9d33f4
 
 
 
 
 
 
0e48d34
b9d33f4
 
 
0e48d34
 
b9d33f4
 
0e48d34
b9d33f4
 
0e48d34
 
 
b9d33f4
 
 
0e48d34
 
 
 
 
 
 
 
 
 
 
 
b9d33f4
 
 
0e48d34
 
 
 
 
 
b9d33f4
0e48d34
b9d33f4
0e48d34
b9d33f4
 
 
0e48d34
 
b9d33f4
 
 
 
 
0e48d34
 
b9d33f4
0e48d34
b9d33f4
0e48d34
 
 
b9d33f4
 
0e48d34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9d33f4
 
 
 
 
0e48d34
b9d33f4
 
 
 
 
0e48d34
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
license: mit
language:
  - en
tags:
  - document-classification
  - scientific-posters
  - multimodal
  - model2vec
  - poster-detection
  - machine-actionable
  - FAIR-data
  - posters-science
  - quality-control
library_name: model2vec
pipeline_tag: text-classification
thumbnail: PosterSentry.png
---

<div align="center">
  <img src="PosterSentry.png" alt="PosterSentry Logo" width="400"/>
</div>

# PosterSentry β€” Multimodal Scientific Poster Classifier

## Model Description

PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a **scientific poster** or a **non-poster** (paper, proceedings, newsletter, abstract book, etc.).

Part of the quality control pipeline for [**posters.science**](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).

Developed by the [**FAIR Data Innovations Hub**](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMIΒ²).

## Related Models & Tools

| Resource | Description | Link |
|----------|-------------|------|
| **PosterSentry** | Multimodal poster classifier (this model) | [fairdataihub/poster-sentry](https://huggingface.co/fairdataihub/poster-sentry) |
| **Llama-3.1-8B-Poster-Extraction** | Poster β†’ structured JSON extraction | [fairdataihub/Llama-3.1-8B-Poster-Extraction](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) |
| **poster2json** | Python library for poster extraction | [PyPI](https://pypi.org/project/poster2json/) Β· [Docs](https://fairdataihub.github.io/poster2json/) Β· [GitHub](https://github.com/fairdataihub/poster2json) |
| **poster-json-schema** | DataCite-based poster metadata schema | [GitHub](https://github.com/fairdataihub/poster-json-schema) |
| **Platform** | posters.science | [posters.science](https://posters.science) |

### Pipeline Position

PosterSentry sits at the front of the posters.science pipeline β€” it screens incoming PDFs before the expensive Llama-based extraction:

```
PDF Input
   β”‚
   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PosterSentry β”‚ ──► β”‚ Llama-3.1-8B-Poster-Extraction    β”‚ ──► β”‚ poster2json  β”‚
β”‚ (classify)   β”‚     β”‚ (extract structured metadata)      β”‚     β”‚ (validate)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   poster? βœ“              raw text β†’ JSON schema                  FAIR output
```

## Architecture

Three feature channels concatenated into a **542-dimensional** vector, fed to a single LogisticRegression:

| Channel | Features | Dimension | Signal |
|---------|----------|-----------|--------|
| **Text** | model2vec (potion-base-32M) embedding | 512 | Semantic content |
| **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
| **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |

Each classifier head is a single linear layer stored as a numpy `.npz` file (10 KB). Inference is pure numpy β€” no torch required at prediction time.

## Performance

Validated on 3,606 real scientific documents:

| Metric | Value |
|--------|-------|
| **Accuracy** | **87.3%** |
| F1 (poster) | 87.1% |
| F1 (non-poster) | 87.4% |
| Precision (poster) | 88.2% |
| Recall (poster) | 85.9% |
| Inference speed | ~300 docs/sec (CPU) |

### Top Features by Importance

| Rank | Feature | Coefficient | Signal |
|------|---------|------------|--------|
| 1 | `size_per_page_kb` | +7.65 | Posters are dense, high-res single pages |
| 2 | `page_count` | -5.49 | More pages = not a poster |
| 3 | `file_size_kb` | -5.44 | Multi-page docs are bigger overall |
| 4 | `img_height` | +1.38 | Posters are large-format |
| 5 | `page_height_pt` | +1.38 | Large physical dimensions |
| 6 | `avg_font_size` | -1.10 | Papers use smaller fonts |
| 7 | `is_landscape` | +0.98 | Some posters are landscape |
| 8 | `color_diversity` | +0.95 | Posters are visually rich |
| 9 | `edge_density` | +0.79 | More visual edges in posters |
| 10 | `text_block_count` | +0.75 | Multi-column poster layouts |

## Training Data

Trained on **3,606 real documents** β€” zero synthetic data:

| Class | Count | Source |
|-------|-------|--------|
| **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare |
| **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books |

Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare).

Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)

## Usage

### Python API

```python
from poster_sentry import PosterSentry

sentry = PosterSentry()
sentry.initialize()

# Classify a PDF (uses text + visual + structural features)
result = sentry.classify("document.pdf")
print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}

# Batch classification
results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])
```

### Installation

```bash
pip install git+https://github.com/fairdataihub/poster-repo-qc.git

# Or install from source
git clone https://github.com/fairdataihub/poster-repo-qc.git
cd poster-repo-qc
pip install -e ".[train]"
```

### Training

```bash
python scripts/train_poster_sentry.py --n-per-class 2000
```

Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).

## Model Specifications

| Attribute | Value |
|-----------|-------|
| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
| Embedding dimension | 512 |
| Visual features | 15 (color, edge, FFT, whitespace) |
| Structural features | 15 (page geometry, fonts, text blocks) |
| Total input dimension | 542 |
| Classifier | LogisticRegression (sklearn) + StandardScaler |
| Head file size | 10 KB (.npz) |
| Precision | float32 |
| GPU required | No (CPU-only) |
| License | MIT |

## System Requirements

- **CPU**: Any modern CPU (no GPU needed)
- **RAM**: β‰₯4GB
- **Python**: β‰₯3.10
- **Dependencies**: numpy, model2vec, scikit-learn, PyMuPDF, Pillow

## Citation

```bibtex
@software{poster_sentry_2026,
  title = {PosterSentry: Multimodal Scientific Poster Classifier},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  url = {https://huggingface.co/fairdataihub/poster-sentry},
  note = {Part of the posters.science initiative}
}
```

## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).

## Acknowledgments

- [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMIΒ²)
- [posters.science](https://posters.science) platform
- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
- HuggingFace for model hosting infrastructure
- Funded by The Navigation Fund ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)) β€” "Poster Sharing and Discovery Made Easy"