fairdataihub
/

poster-sentry

@@ -8,22 +8,57 @@ tags:
   - multimodal
   - model2vec
   - poster-detection
 library_name: model2vec
 pipeline_tag: text-classification
 ---
 # PosterSentry — Multimodal Scientific Poster Classifier
-PosterSentry classifies PDFs as **scientific posters** vs **non-posters** (papers, proceedings,
-abstracts, newsletters) using a multimodal approach that combines text embeddings with visual
-and structural features from the PDF.
-Part of the [posters.science](https://posters.science) initiative at
-[FAIR Data Innovations Hub](https://fairdataihub.org).
 ## Architecture
-Three feature channels concatenated into a 542-dimensional vector:
 | Channel | Features | Dimension | Signal |
 |---------|----------|-----------|--------|
@@ -31,53 +66,129 @@ Three feature channels concatenated into a 542-dimensional vector:
 | **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
 | **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
-Single LogisticRegression classifier with StandardScaler normalization.
 ## Performance
 | Metric | Value |
 |--------|-------|
-| Accuracy | **87.3%** |
 | F1 (poster) | 87.1% |
 | F1 (non-poster) | 87.4% |
-| Inference | ~300 docs/sec (CPU) |
 ### Top Features by Importance
-1. `size_per_page_kb` (+7.65) — Posters are dense, high-res single pages
-2. `page_count` (-5.49) — More pages = not a poster
-3. `file_size_kb` (-5.44) — Multi-page docs are bigger overall
-4. `img_height` (+1.38) — Posters are large-format
-5. `color_diversity` (+0.95) — Posters are visually rich
 ## Training Data
-Trained on **3,606 real documents** (zero synthetic data):
-- **1,803 verified scientific posters** from Zenodo & Figshare (sampled from 28K+ corpus)
-- **1,803 verified non-posters** — multi-page papers, proceedings, newsletters
-See [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data).
 ## Usage
 ```python
 from poster_sentry import PosterSentry
 sentry = PosterSentry()
 sentry.initialize()
 result = sentry.classify("document.pdf")
 # {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
 ```
 ## Citation
 ```bibtex
 @software{poster_sentry_2026,
   title = {PosterSentry: Multimodal Scientific Poster Classifier},
-  author = {O'Neill, Jamey and FAIR Data Innovations Hub},
   year = {2026},
   url = {https://huggingface.co/fairdataihub/poster-sentry},
   note = {Part of the posters.science initiative}
 }
 ```

   - multimodal
   - model2vec
   - poster-detection
+  - machine-actionable
+  - FAIR-data
+  - posters-science
+  - quality-control
 library_name: model2vec
 pipeline_tag: text-classification
+thumbnail: PosterSentry.png
 ---
+<div align="center">
+  <img src="PosterSentry.png" alt="PosterSentry Logo" width="400"/>
+</div>
 # PosterSentry — Multimodal Scientific Poster Classifier
+## Model Description
+PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a **scientific poster** or a **non-poster** (paper, proceedings, newsletter, abstract book, etc.).
+Part of the quality control pipeline for [**posters.science**](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).
+Developed by the [**FAIR Data Innovations Hub**](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMI²).
+## Related Models & Tools
+| Resource | Description | Link |
+|----------|-------------|------|
+| **PosterSentry** | Multimodal poster classifier (this model) | [fairdataihub/poster-sentry](https://huggingface.co/fairdataihub/poster-sentry) |
+| **Llama-3.1-8B-Poster-Extraction** | Poster → structured JSON extraction | [fairdataihub/Llama-3.1-8B-Poster-Extraction](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) |
+| **poster2json** | Python library for poster extraction | [PyPI](https://pypi.org/project/poster2json/) · [Docs](https://fairdataihub.github.io/poster2json/) · [GitHub](https://github.com/fairdataihub/poster2json) |
+| **poster-json-schema** | DataCite-based poster metadata schema | [GitHub](https://github.com/fairdataihub/poster-json-schema) |
+| **Platform** | posters.science | [posters.science](https://posters.science) |
+### Pipeline Position
+PosterSentry sits at the front of the posters.science pipeline — it screens incoming PDFs before the expensive Llama-based extraction:
+```
+PDF Input
+   │
+   ▼
+┌──────────────┐     ┌───────────────────────────────────┐     ┌──────────────┐
+│ PosterSentry │ ──► │ Llama-3.1-8B-Poster-Extraction    │ ──► │ poster2json  │
+│ (classify)   │     │ (extract structured metadata)      │     │ (validate)   │
+└──────────────┘     └───────────────────────────────────┘     └──────────────┘
+   poster? ✓              raw text → JSON schema                  FAIR output
+```
 ## Architecture
+Three feature channels concatenated into a **542-dimensional** vector, fed to a single LogisticRegression:
 | Channel | Features | Dimension | Signal |
 |---------|----------|-----------|--------|
 | **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
 | **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
+Each classifier head is a single linear layer stored as a numpy `.npz` file (10 KB). Inference is pure numpy — no torch required at prediction time.
 ## Performance
+Validated on 3,606 real scientific documents:
 | Metric | Value |
 |--------|-------|
+| **Accuracy** | **87.3%** |
 | F1 (poster) | 87.1% |
 | F1 (non-poster) | 87.4% |
+| Precision (poster) | 88.2% |
+| Recall (poster) | 85.9% |
+| Inference speed | ~300 docs/sec (CPU) |
 ### Top Features by Importance
+| Rank | Feature | Coefficient | Signal |
+|------|---------|------------|--------|
+| 1 | `size_per_page_kb` | +7.65 | Posters are dense, high-res single pages |
+| 2 | `page_count` | -5.49 | More pages = not a poster |
+| 3 | `file_size_kb` | -5.44 | Multi-page docs are bigger overall |
+| 4 | `img_height` | +1.38 | Posters are large-format |
+| 5 | `page_height_pt` | +1.38 | Large physical dimensions |
+| 6 | `avg_font_size` | -1.10 | Papers use smaller fonts |
+| 7 | `is_landscape` | +0.98 | Some posters are landscape |
+| 8 | `color_diversity` | +0.95 | Posters are visually rich |
+| 9 | `edge_density` | +0.79 | More visual edges in posters |
+| 10 | `text_block_count` | +0.75 | Multi-column poster layouts |
 ## Training Data
+Trained on **3,606 real documents** — zero synthetic data:
+| Class | Count | Source |
+|-------|-------|--------|
+| **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare |
+| **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books |
+Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare).
+Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)
 ## Usage
+### Python API
 ```python
 from poster_sentry import PosterSentry
 sentry = PosterSentry()
 sentry.initialize()
+# Classify a PDF (uses text + visual + structural features)
 result = sentry.classify("document.pdf")
+print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
 # {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
+# Batch classification
+results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])
 ```
+### Installation
+```bash
+pip install git+https://github.com/fairdataihub/poster-repo-qc.git
+# Or install from source
+git clone https://github.com/fairdataihub/poster-repo-qc.git
+cd poster-repo-qc
+pip install -e ".[train]"
+```
+### Training
+```bash
+python scripts/train_poster_sentry.py --n-per-class 2000
+```
+Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).
+## Model Specifications
+| Attribute | Value |
+|-----------|-------|
+| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
+| Embedding dimension | 512 |
+| Visual features | 15 (color, edge, FFT, whitespace) |
+| Structural features | 15 (page geometry, fonts, text blocks) |
+| Total input dimension | 542 |
+| Classifier | LogisticRegression (sklearn) + StandardScaler |
+| Head file size | 10 KB (.npz) |
+| Precision | float32 |
+| GPU required | No (CPU-only) |
+| License | MIT |
+## System Requirements
+- **CPU**: Any modern CPU (no GPU needed)
+- **RAM**: ≥4GB
+- **Python**: ≥3.10
+- **Dependencies**: numpy, model2vec, scikit-learn, PyMuPDF, Pillow
 ## Citation
 ```bibtex
 @software{poster_sentry_2026,
   title = {PosterSentry: Multimodal Scientific Poster Classifier},
+  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
   year = {2026},
   url = {https://huggingface.co/fairdataihub/poster-sentry},
   note = {Part of the posters.science initiative}
 }
 ```
+## License
+This model is released under the [MIT License](https://opensource.org/licenses/MIT).
+## Acknowledgments
+- [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMI²)
+- [posters.science](https://posters.science) platform
+- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
+- HuggingFace for model hosting infrastructure
+- Funded by The Navigation Fund ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)) — "Poster Sharing and Discovery Made Easy"