---
license: mit
language:
- en
tags:
- document-classification
- scientific-posters
- multimodal
- model2vec
- poster-detection
- machine-actionable
- FAIR-data
- posters-science
- quality-control
library_name: model2vec
pipeline_tag: text-classification
thumbnail: PosterSentry.png
---
# PosterSentry — Multimodal Scientific Poster Classifier
## Model Description
PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a **scientific poster** or a **non-poster** (paper, proceedings, newsletter, abstract book, etc.).
Part of the quality control pipeline for [**posters.science**](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).
Developed by the [**FAIR Data Innovations Hub**](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMI²).
## Related Models & Tools
| Resource | Description | Link |
|----------|-------------|------|
| **PosterSentry** | Multimodal poster classifier (this model) | [fairdataihub/poster-sentry](https://huggingface.co/fairdataihub/poster-sentry) |
| **Llama-3.1-8B-Poster-Extraction** | Poster → structured JSON extraction | [fairdataihub/Llama-3.1-8B-Poster-Extraction](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) |
| **poster2json** | Python library for poster extraction | [PyPI](https://pypi.org/project/poster2json/) · [Docs](https://fairdataihub.github.io/poster2json/) · [GitHub](https://github.com/fairdataihub/poster2json) |
| **poster-json-schema** | DataCite-based poster metadata schema | [GitHub](https://github.com/fairdataihub/poster-json-schema) |
| **Platform** | posters.science | [posters.science](https://posters.science) |
### Pipeline Position
PosterSentry sits at the front of the posters.science pipeline — it screens incoming PDFs before the expensive Llama-based extraction:
```
PDF Input
│
▼
┌──────────────┐ ┌───────────────────────────────────┐ ┌──────────────┐
│ PosterSentry │ ──► │ Llama-3.1-8B-Poster-Extraction │ ──► │ poster2json │
│ (classify) │ │ (extract structured metadata) │ │ (validate) │
└──────────────┘ └───────────────────────────────────┘ └──────────────┘
poster? ✓ raw text → JSON schema FAIR output
```
## Architecture
Three feature channels concatenated into a **542-dimensional** vector, fed to a single LogisticRegression:
| Channel | Features | Dimension | Signal |
|---------|----------|-----------|--------|
| **Text** | model2vec (potion-base-32M) embedding | 512 | Semantic content |
| **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
| **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
Each classifier head is a single linear layer stored as a numpy `.npz` file (10 KB). Inference is pure numpy — no torch required at prediction time.
## Performance
Validated on 3,606 real scientific documents:
| Metric | Value |
|--------|-------|
| **Accuracy** | **87.3%** |
| F1 (poster) | 87.1% |
| F1 (non-poster) | 87.4% |
| Precision (poster) | 88.2% |
| Recall (poster) | 85.9% |
| Inference speed | ~300 docs/sec (CPU) |
### Top Features by Importance
| Rank | Feature | Coefficient | Signal |
|------|---------|------------|--------|
| 1 | `size_per_page_kb` | +7.65 | Posters are dense, high-res single pages |
| 2 | `page_count` | -5.49 | More pages = not a poster |
| 3 | `file_size_kb` | -5.44 | Multi-page docs are bigger overall |
| 4 | `img_height` | +1.38 | Posters are large-format |
| 5 | `page_height_pt` | +1.38 | Large physical dimensions |
| 6 | `avg_font_size` | -1.10 | Papers use smaller fonts |
| 7 | `is_landscape` | +0.98 | Some posters are landscape |
| 8 | `color_diversity` | +0.95 | Posters are visually rich |
| 9 | `edge_density` | +0.79 | More visual edges in posters |
| 10 | `text_block_count` | +0.75 | Multi-column poster layouts |
## Training Data
Trained on **3,606 real documents** — zero synthetic data:
| Class | Count | Source |
|-------|-------|--------|
| **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare |
| **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books |
Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare).
Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)
## Usage
### Python API
```python
from poster_sentry import PosterSentry
sentry = PosterSentry()
sentry.initialize()
# Classify a PDF (uses text + visual + structural features)
result = sentry.classify("document.pdf")
print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
# Batch classification
results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])
```
### Installation
```bash
pip install git+https://github.com/fairdataihub/poster-repo-qc.git
# Or install from source
git clone https://github.com/fairdataihub/poster-repo-qc.git
cd poster-repo-qc
pip install -e ".[train]"
```
### Training
```bash
python scripts/train_poster_sentry.py --n-per-class 2000
```
Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).
## Model Specifications
| Attribute | Value |
|-----------|-------|
| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
| Embedding dimension | 512 |
| Visual features | 15 (color, edge, FFT, whitespace) |
| Structural features | 15 (page geometry, fonts, text blocks) |
| Total input dimension | 542 |
| Classifier | LogisticRegression (sklearn) + StandardScaler |
| Head file size | 10 KB (.npz) |
| Precision | float32 |
| GPU required | No (CPU-only) |
| License | MIT |
## System Requirements
- **CPU**: Any modern CPU (no GPU needed)
- **RAM**: ≥4GB
- **Python**: ≥3.10
- **Dependencies**: numpy, model2vec, scikit-learn, PyMuPDF, Pillow
## Citation
```bibtex
@software{poster_sentry_2026,
title = {PosterSentry: Multimodal Scientific Poster Classifier},
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
year = {2026},
url = {https://huggingface.co/fairdataihub/poster-sentry},
note = {Part of the posters.science initiative}
}
```
## License
This model is released under the [MIT License](https://opensource.org/licenses/MIT).
## Acknowledgments
- [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMI²)
- [posters.science](https://posters.science) platform
- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
- HuggingFace for model hosting infrastructure
- Funded by The Navigation Fund ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)) — "Poster Sharing and Discovery Made Easy"