--- license: mit language: - en tags: - document-classification - scientific-posters - multimodal - model2vec - poster-detection - machine-actionable - FAIR-data - posters-science - quality-control library_name: model2vec pipeline_tag: text-classification thumbnail: PosterSentry.png ---
PosterSentry Logo
# PosterSentry — Multimodal Scientific Poster Classifier ## Model Description PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a **scientific poster** or a **non-poster** (paper, proceedings, newsletter, abstract book, etc.). Part of the quality control pipeline for [**posters.science**](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR). Developed by the [**FAIR Data Innovations Hub**](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMI²). ## Related Models & Tools | Resource | Description | Link | |----------|-------------|------| | **PosterSentry** | Multimodal poster classifier (this model) | [fairdataihub/poster-sentry](https://huggingface.co/fairdataihub/poster-sentry) | | **Llama-3.1-8B-Poster-Extraction** | Poster → structured JSON extraction | [fairdataihub/Llama-3.1-8B-Poster-Extraction](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) | | **poster2json** | Python library for poster extraction | [PyPI](https://pypi.org/project/poster2json/) · [Docs](https://fairdataihub.github.io/poster2json/) · [GitHub](https://github.com/fairdataihub/poster2json) | | **poster-json-schema** | DataCite-based poster metadata schema | [GitHub](https://github.com/fairdataihub/poster-json-schema) | | **Platform** | posters.science | [posters.science](https://posters.science) | ### Pipeline Position PosterSentry sits at the front of the posters.science pipeline — it screens incoming PDFs before the expensive Llama-based extraction: ``` PDF Input │ ▼ ┌──────────────┐ ┌───────────────────────────────────┐ ┌──────────────┐ │ PosterSentry │ ──► │ Llama-3.1-8B-Poster-Extraction │ ──► │ poster2json │ │ (classify) │ │ (extract structured metadata) │ │ (validate) │ └──────────────┘ └───────────────────────────────────┘ └──────────────┘ poster? ✓ raw text → JSON schema FAIR output ``` ## Architecture Three feature channels concatenated into a **542-dimensional** vector, fed to a single LogisticRegression: | Channel | Features | Dimension | Signal | |---------|----------|-----------|--------| | **Text** | model2vec (potion-base-32M) embedding | 512 | Semantic content | | **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout | | **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry | Each classifier head is a single linear layer stored as a numpy `.npz` file (10 KB). Inference is pure numpy — no torch required at prediction time. ## Performance Validated on 3,606 real scientific documents: | Metric | Value | |--------|-------| | **Accuracy** | **87.3%** | | F1 (poster) | 87.1% | | F1 (non-poster) | 87.4% | | Precision (poster) | 88.2% | | Recall (poster) | 85.9% | | Inference speed | ~300 docs/sec (CPU) | ### Top Features by Importance | Rank | Feature | Coefficient | Signal | |------|---------|------------|--------| | 1 | `size_per_page_kb` | +7.65 | Posters are dense, high-res single pages | | 2 | `page_count` | -5.49 | More pages = not a poster | | 3 | `file_size_kb` | -5.44 | Multi-page docs are bigger overall | | 4 | `img_height` | +1.38 | Posters are large-format | | 5 | `page_height_pt` | +1.38 | Large physical dimensions | | 6 | `avg_font_size` | -1.10 | Papers use smaller fonts | | 7 | `is_landscape` | +0.98 | Some posters are landscape | | 8 | `color_diversity` | +0.95 | Posters are visually rich | | 9 | `edge_density` | +0.79 | More visual edges in posters | | 10 | `text_block_count` | +0.75 | Multi-column poster layouts | ## Training Data Trained on **3,606 real documents** — zero synthetic data: | Class | Count | Source | |-------|-------|--------| | **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare | | **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books | Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare). Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data) ## Usage ### Python API ```python from poster_sentry import PosterSentry sentry = PosterSentry() sentry.initialize() # Classify a PDF (uses text + visual + structural features) result = sentry.classify("document.pdf") print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}") # {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'} # Batch classification results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"]) ``` ### Installation ```bash pip install git+https://github.com/fairdataihub/poster-repo-qc.git # Or install from source git clone https://github.com/fairdataihub/poster-repo-qc.git cd poster-repo-qc pip install -e ".[train]" ``` ### Training ```bash python scripts/train_poster_sentry.py --n-per-class 2000 ``` Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier). ## Model Specifications | Attribute | Value | |-----------|-------| | Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) | | Embedding dimension | 512 | | Visual features | 15 (color, edge, FFT, whitespace) | | Structural features | 15 (page geometry, fonts, text blocks) | | Total input dimension | 542 | | Classifier | LogisticRegression (sklearn) + StandardScaler | | Head file size | 10 KB (.npz) | | Precision | float32 | | GPU required | No (CPU-only) | | License | MIT | ## System Requirements - **CPU**: Any modern CPU (no GPU needed) - **RAM**: ≥4GB - **Python**: ≥3.10 - **Dependencies**: numpy, model2vec, scikit-learn, PyMuPDF, Pillow ## Citation ```bibtex @software{poster_sentry_2026, title = {PosterSentry: Multimodal Scientific Poster Classifier}, author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh}, year = {2026}, url = {https://huggingface.co/fairdataihub/poster-sentry}, note = {Part of the posters.science initiative} } ``` ## License This model is released under the [MIT License](https://opensource.org/licenses/MIT). ## Acknowledgments - [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMI²) - [posters.science](https://posters.science) platform - [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone - HuggingFace for model hosting infrastructure - Funded by The Navigation Fund ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)) — "Poster Sharing and Discovery Made Easy"