|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- document-classification |
|
|
- scientific-posters |
|
|
- multimodal |
|
|
- model2vec |
|
|
- poster-detection |
|
|
- machine-actionable |
|
|
- FAIR-data |
|
|
- posters-science |
|
|
- quality-control |
|
|
library_name: model2vec |
|
|
pipeline_tag: text-classification |
|
|
thumbnail: PosterSentry.png |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<img src="PosterSentry.png" alt="PosterSentry Logo" width="400"/> |
|
|
</div> |
|
|
|
|
|
# PosterSentry — Multimodal Scientific Poster Classifier |
|
|
|
|
|
## Model Description |
|
|
|
|
|
PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a **scientific poster** or a **non-poster** (paper, proceedings, newsletter, abstract book, etc.). |
|
|
|
|
|
Part of the quality control pipeline for [**posters.science**](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR). |
|
|
|
|
|
Developed by the [**FAIR Data Innovations Hub**](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMI²). |
|
|
|
|
|
## Related Models & Tools |
|
|
|
|
|
| Resource | Description | Link | |
|
|
|----------|-------------|------| |
|
|
| **PosterSentry** | Multimodal poster classifier (this model) | [fairdataihub/poster-sentry](https://huggingface.co/fairdataihub/poster-sentry) | |
|
|
| **Llama-3.1-8B-Poster-Extraction** | Poster → structured JSON extraction | [fairdataihub/Llama-3.1-8B-Poster-Extraction](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) | |
|
|
| **poster2json** | Python library for poster extraction | [PyPI](https://pypi.org/project/poster2json/) · [Docs](https://fairdataihub.github.io/poster2json/) · [GitHub](https://github.com/fairdataihub/poster2json) | |
|
|
| **poster-json-schema** | DataCite-based poster metadata schema | [GitHub](https://github.com/fairdataihub/poster-json-schema) | |
|
|
| **Platform** | posters.science | [posters.science](https://posters.science) | |
|
|
|
|
|
### Pipeline Position |
|
|
|
|
|
PosterSentry sits at the front of the posters.science pipeline — it screens incoming PDFs before the expensive Llama-based extraction: |
|
|
|
|
|
``` |
|
|
PDF Input |
|
|
│ |
|
|
▼ |
|
|
┌──────────────┐ ┌───────────────────────────────────┐ ┌──────────────┐ |
|
|
│ PosterSentry │ ──► │ Llama-3.1-8B-Poster-Extraction │ ──► │ poster2json │ |
|
|
│ (classify) │ │ (extract structured metadata) │ │ (validate) │ |
|
|
└──────────────┘ └───────────────────────────────────┘ └──────────────┘ |
|
|
poster? ✓ raw text → JSON schema FAIR output |
|
|
``` |
|
|
|
|
|
## Architecture |
|
|
|
|
|
Three feature channels concatenated into a **542-dimensional** vector, fed to a single LogisticRegression: |
|
|
|
|
|
| Channel | Features | Dimension | Signal | |
|
|
|---------|----------|-----------|--------| |
|
|
| **Text** | model2vec (potion-base-32M) embedding | 512 | Semantic content | |
|
|
| **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout | |
|
|
| **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry | |
|
|
|
|
|
Each classifier head is a single linear layer stored as a numpy `.npz` file (10 KB). Inference is pure numpy — no torch required at prediction time. |
|
|
|
|
|
## Performance |
|
|
|
|
|
Validated on 3,606 real scientific documents: |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Accuracy** | **87.3%** | |
|
|
| F1 (poster) | 87.1% | |
|
|
| F1 (non-poster) | 87.4% | |
|
|
| Precision (poster) | 88.2% | |
|
|
| Recall (poster) | 85.9% | |
|
|
| Inference speed | ~300 docs/sec (CPU) | |
|
|
|
|
|
### Top Features by Importance |
|
|
|
|
|
| Rank | Feature | Coefficient | Signal | |
|
|
|------|---------|------------|--------| |
|
|
| 1 | `size_per_page_kb` | +7.65 | Posters are dense, high-res single pages | |
|
|
| 2 | `page_count` | -5.49 | More pages = not a poster | |
|
|
| 3 | `file_size_kb` | -5.44 | Multi-page docs are bigger overall | |
|
|
| 4 | `img_height` | +1.38 | Posters are large-format | |
|
|
| 5 | `page_height_pt` | +1.38 | Large physical dimensions | |
|
|
| 6 | `avg_font_size` | -1.10 | Papers use smaller fonts | |
|
|
| 7 | `is_landscape` | +0.98 | Some posters are landscape | |
|
|
| 8 | `color_diversity` | +0.95 | Posters are visually rich | |
|
|
| 9 | `edge_density` | +0.79 | More visual edges in posters | |
|
|
| 10 | `text_block_count` | +0.75 | Multi-column poster layouts | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Trained on **3,606 real documents** — zero synthetic data: |
|
|
|
|
|
| Class | Count | Source | |
|
|
|-------|-------|--------| |
|
|
| **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare | |
|
|
| **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books | |
|
|
|
|
|
Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare). |
|
|
|
|
|
Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Python API |
|
|
|
|
|
```python |
|
|
from poster_sentry import PosterSentry |
|
|
|
|
|
sentry = PosterSentry() |
|
|
sentry.initialize() |
|
|
|
|
|
# Classify a PDF (uses text + visual + structural features) |
|
|
result = sentry.classify("document.pdf") |
|
|
print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}") |
|
|
# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'} |
|
|
|
|
|
# Batch classification |
|
|
results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"]) |
|
|
``` |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/fairdataihub/poster-repo-qc.git |
|
|
|
|
|
# Or install from source |
|
|
git clone https://github.com/fairdataihub/poster-repo-qc.git |
|
|
cd poster-repo-qc |
|
|
pip install -e ".[train]" |
|
|
``` |
|
|
|
|
|
### Training |
|
|
|
|
|
```bash |
|
|
python scripts/train_poster_sentry.py --n-per-class 2000 |
|
|
``` |
|
|
|
|
|
Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier). |
|
|
|
|
|
## Model Specifications |
|
|
|
|
|
| Attribute | Value | |
|
|
|-----------|-------| |
|
|
| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) | |
|
|
| Embedding dimension | 512 | |
|
|
| Visual features | 15 (color, edge, FFT, whitespace) | |
|
|
| Structural features | 15 (page geometry, fonts, text blocks) | |
|
|
| Total input dimension | 542 | |
|
|
| Classifier | LogisticRegression (sklearn) + StandardScaler | |
|
|
| Head file size | 10 KB (.npz) | |
|
|
| Precision | float32 | |
|
|
| GPU required | No (CPU-only) | |
|
|
| License | MIT | |
|
|
|
|
|
## System Requirements |
|
|
|
|
|
- **CPU**: Any modern CPU (no GPU needed) |
|
|
- **RAM**: ≥4GB |
|
|
- **Python**: ≥3.10 |
|
|
- **Dependencies**: numpy, model2vec, scikit-learn, PyMuPDF, Pillow |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@software{poster_sentry_2026, |
|
|
title = {PosterSentry: Multimodal Scientific Poster Classifier}, |
|
|
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh}, |
|
|
year = {2026}, |
|
|
url = {https://huggingface.co/fairdataihub/poster-sentry}, |
|
|
note = {Part of the posters.science initiative} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the [MIT License](https://opensource.org/licenses/MIT). |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMI²) |
|
|
- [posters.science](https://posters.science) platform |
|
|
- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone |
|
|
- HuggingFace for model hosting infrastructure |
|
|
- Funded by The Navigation Fund ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)) — "Poster Sharing and Discovery Made Easy" |
|
|
|