File size: 7,523 Bytes

---
license: mit
language:
  - en
tags:
  - document-classification
  - scientific-posters
  - multimodal
  - model2vec
  - poster-detection
  - machine-actionable
  - FAIR-data
  - posters-science
  - quality-control
library_name: model2vec
pipeline_tag: text-classification
thumbnail: PosterSentry.png
---

<div align="center">
  <img src="PosterSentry.png" alt="PosterSentry Logo" width="400"/>
</div>

# PosterSentry — Multimodal Scientific Poster Classifier

## Model Description

PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a **scientific poster** or a **non-poster** (paper, proceedings, newsletter, abstract book, etc.).

Part of the quality control pipeline for [**posters.science**](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).

Developed by the [**FAIR Data Innovations Hub**](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMI²).

## Related Models & Tools

| Resource | Description | Link |
|----------|-------------|------|
| **PosterSentry** | Multimodal poster classifier (this model) | [fairdataihub/poster-sentry](https://huggingface.co/fairdataihub/poster-sentry) |
| **Llama-3.1-8B-Poster-Extraction** | Poster → structured JSON extraction | [fairdataihub/Llama-3.1-8B-Poster-Extraction](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) |
| **poster2json** | Python library for poster extraction | [PyPI](https://pypi.org/project/poster2json/) · [Docs](https://fairdataihub.github.io/poster2json/) · [GitHub](https://github.com/fairdataihub/poster2json) |
| **poster-json-schema** | DataCite-based poster metadata schema | [GitHub](https://github.com/fairdataihub/poster-json-schema) |
| **Platform** | posters.science | [posters.science](https://posters.science) |

### Pipeline Position

PosterSentry sits at the front of the posters.science pipeline — it screens incoming PDFs before the expensive Llama-based extraction:

```
PDF Input
   │
   ▼
┌──────────────┐     ┌───────────────────────────────────┐     ┌──────────────┐
│ PosterSentry │ ──► │ Llama-3.1-8B-Poster-Extraction    │ ──► │ poster2json  │
│ (classify)   │     │ (extract structured metadata)      │     │ (validate)   │
└──────────────┘     └───────────────────────────────────┘     └──────────────┘
   poster? ✓              raw text → JSON schema                  FAIR output
```

## Architecture

Three feature channels concatenated into a **542-dimensional** vector, fed to a single LogisticRegression:

| Channel | Features | Dimension | Signal |
|---------|----------|-----------|--------|
| **Text** | model2vec (potion-base-32M) embedding | 512 | Semantic content |
| **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
| **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |

Each classifier head is a single linear layer stored as a numpy `.npz` file (10 KB). Inference is pure numpy — no torch required at prediction time.

## Performance

Validated on 3,606 real scientific documents:

| Metric | Value |
|--------|-------|
| **Accuracy** | **87.3%** |
| F1 (poster) | 87.1% |
| F1 (non-poster) | 87.4% |
| Precision (poster) | 88.2% |
| Recall (poster) | 85.9% |
| Inference speed | ~300 docs/sec (CPU) |

### Top Features by Importance

| Rank | Feature | Coefficient | Signal |
|------|---------|------------|--------|
| 1 | `size_per_page_kb` | +7.65 | Posters are dense, high-res single pages |
| 2 | `page_count` | -5.49 | More pages = not a poster |
| 3 | `file_size_kb` | -5.44 | Multi-page docs are bigger overall |
| 4 | `img_height` | +1.38 | Posters are large-format |
| 5 | `page_height_pt` | +1.38 | Large physical dimensions |
| 6 | `avg_font_size` | -1.10 | Papers use smaller fonts |
| 7 | `is_landscape` | +0.98 | Some posters are landscape |
| 8 | `color_diversity` | +0.95 | Posters are visually rich |
| 9 | `edge_density` | +0.79 | More visual edges in posters |
| 10 | `text_block_count` | +0.75 | Multi-column poster layouts |

## Training Data

Trained on **3,606 real documents** — zero synthetic data:

| Class | Count | Source |
|-------|-------|--------|
| **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare |
| **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books |

Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare).

Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)

## Usage

### Python API

```python
from poster_sentry import PosterSentry

sentry = PosterSentry()
sentry.initialize()

# Classify a PDF (uses text + visual + structural features)
result = sentry.classify("document.pdf")
print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}

# Batch classification
results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])
```

### Installation

```bash
pip install git+https://github.com/fairdataihub/poster-repo-qc.git

# Or install from source
git clone https://github.com/fairdataihub/poster-repo-qc.git
cd poster-repo-qc
pip install -e ".[train]"
```

### Training

```bash
python scripts/train_poster_sentry.py --n-per-class 2000
```

Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).

## Model Specifications

| Attribute | Value |
|-----------|-------|
| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
| Embedding dimension | 512 |
| Visual features | 15 (color, edge, FFT, whitespace) |
| Structural features | 15 (page geometry, fonts, text blocks) |
| Total input dimension | 542 |
| Classifier | LogisticRegression (sklearn) + StandardScaler |
| Head file size | 10 KB (.npz) |
| Precision | float32 |
| GPU required | No (CPU-only) |
| License | MIT |

## System Requirements

- **CPU**: Any modern CPU (no GPU needed)
- **RAM**: ≥4GB
- **Python**: ≥3.10
- **Dependencies**: numpy, model2vec, scikit-learn, PyMuPDF, Pillow

## Citation

```bibtex
@software{poster_sentry_2026,
  title = {PosterSentry: Multimodal Scientific Poster Classifier},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  url = {https://huggingface.co/fairdataihub/poster-sentry},
  note = {Part of the posters.science initiative}
}
```

## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).

## Acknowledgments

- [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMI²)
- [posters.science](https://posters.science) platform
- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
- HuggingFace for model hosting infrastructure
- Funded by The Navigation Fund ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)) — "Poster Sharing and Discovery Made Easy"