Updated README with logo, pipeline diagram, related models, grant acknowledgment
Browse files
README.md
CHANGED
|
@@ -8,22 +8,57 @@ tags:
|
|
| 8 |
- multimodal
|
| 9 |
- model2vec
|
| 10 |
- poster-detection
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
library_name: model2vec
|
| 12 |
pipeline_tag: text-classification
|
|
|
|
| 13 |
---
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
# PosterSentry β Multimodal Scientific Poster Classifier
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
## Architecture
|
| 25 |
|
| 26 |
-
Three feature channels concatenated into a 542-dimensional vector:
|
| 27 |
|
| 28 |
| Channel | Features | Dimension | Signal |
|
| 29 |
|---------|----------|-----------|--------|
|
|
@@ -31,53 +66,129 @@ Three feature channels concatenated into a 542-dimensional vector:
|
|
| 31 |
| **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
|
| 32 |
| **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
## Performance
|
| 37 |
|
|
|
|
|
|
|
| 38 |
| Metric | Value |
|
| 39 |
|--------|-------|
|
| 40 |
-
| Accuracy | **87.3%** |
|
| 41 |
| F1 (poster) | 87.1% |
|
| 42 |
| F1 (non-poster) | 87.4% |
|
| 43 |
-
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
### Top Features by Importance
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
## Training Data
|
| 54 |
|
| 55 |
-
Trained on **3,606 real documents**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
-
|
| 58 |
-
- **1,803 verified non-posters** β multi-page papers, proceedings, newsletters
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
## Usage
|
| 63 |
|
|
|
|
|
|
|
| 64 |
```python
|
| 65 |
from poster_sentry import PosterSentry
|
| 66 |
|
| 67 |
sentry = PosterSentry()
|
| 68 |
sentry.initialize()
|
|
|
|
|
|
|
| 69 |
result = sentry.classify("document.pdf")
|
|
|
|
| 70 |
# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
|
|
|
|
|
|
|
|
|
|
| 71 |
```
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
## Citation
|
| 74 |
|
| 75 |
```bibtex
|
| 76 |
@software{poster_sentry_2026,
|
| 77 |
title = {PosterSentry: Multimodal Scientific Poster Classifier},
|
| 78 |
-
author = {O'Neill,
|
| 79 |
year = {2026},
|
| 80 |
url = {https://huggingface.co/fairdataihub/poster-sentry},
|
| 81 |
note = {Part of the posters.science initiative}
|
| 82 |
}
|
| 83 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- multimodal
|
| 9 |
- model2vec
|
| 10 |
- poster-detection
|
| 11 |
+
- machine-actionable
|
| 12 |
+
- FAIR-data
|
| 13 |
+
- posters-science
|
| 14 |
+
- quality-control
|
| 15 |
library_name: model2vec
|
| 16 |
pipeline_tag: text-classification
|
| 17 |
+
thumbnail: PosterSentry.png
|
| 18 |
---
|
| 19 |
|
| 20 |
+
<div align="center">
|
| 21 |
+
<img src="PosterSentry.png" alt="PosterSentry Logo" width="400"/>
|
| 22 |
+
</div>
|
| 23 |
+
|
| 24 |
# PosterSentry β Multimodal Scientific Poster Classifier
|
| 25 |
|
| 26 |
+
## Model Description
|
| 27 |
+
|
| 28 |
+
PosterSentry is a lightweight, CPU-optimized multimodal classifier that determines whether a PDF is a **scientific poster** or a **non-poster** (paper, proceedings, newsletter, abstract book, etc.).
|
| 29 |
+
|
| 30 |
+
Part of the quality control pipeline for [**posters.science**](https://posters.science), a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).
|
| 31 |
+
|
| 32 |
+
Developed by the [**FAIR Data Innovations Hub**](https://fairdataihub.org/) at the California Medical Innovations Institute (CalMIΒ²).
|
| 33 |
+
|
| 34 |
+
## Related Models & Tools
|
| 35 |
+
|
| 36 |
+
| Resource | Description | Link |
|
| 37 |
+
|----------|-------------|------|
|
| 38 |
+
| **PosterSentry** | Multimodal poster classifier (this model) | [fairdataihub/poster-sentry](https://huggingface.co/fairdataihub/poster-sentry) |
|
| 39 |
+
| **Llama-3.1-8B-Poster-Extraction** | Poster β structured JSON extraction | [fairdataihub/Llama-3.1-8B-Poster-Extraction](https://huggingface.co/fairdataihub/Llama-3.1-8B-Poster-Extraction) |
|
| 40 |
+
| **poster2json** | Python library for poster extraction | [PyPI](https://pypi.org/project/poster2json/) Β· [Docs](https://fairdataihub.github.io/poster2json/) Β· [GitHub](https://github.com/fairdataihub/poster2json) |
|
| 41 |
+
| **poster-json-schema** | DataCite-based poster metadata schema | [GitHub](https://github.com/fairdataihub/poster-json-schema) |
|
| 42 |
+
| **Platform** | posters.science | [posters.science](https://posters.science) |
|
| 43 |
+
|
| 44 |
+
### Pipeline Position
|
| 45 |
+
|
| 46 |
+
PosterSentry sits at the front of the posters.science pipeline β it screens incoming PDFs before the expensive Llama-based extraction:
|
| 47 |
|
| 48 |
+
```
|
| 49 |
+
PDF Input
|
| 50 |
+
β
|
| 51 |
+
βΌ
|
| 52 |
+
ββββββββββββββββ βββββββββββββββββββββββββββββββββββββ ββββββββββββββββ
|
| 53 |
+
β PosterSentry β βββΊ β Llama-3.1-8B-Poster-Extraction β βββΊ β poster2json β
|
| 54 |
+
β (classify) β β (extract structured metadata) β β (validate) β
|
| 55 |
+
ββββββββββββββββ βββββββββββββββββββββββββββββββββββββ ββββββββββββββββ
|
| 56 |
+
poster? β raw text β JSON schema FAIR output
|
| 57 |
+
```
|
| 58 |
|
| 59 |
## Architecture
|
| 60 |
|
| 61 |
+
Three feature channels concatenated into a **542-dimensional** vector, fed to a single LogisticRegression:
|
| 62 |
|
| 63 |
| Channel | Features | Dimension | Signal |
|
| 64 |
|---------|----------|-----------|--------|
|
|
|
|
| 66 |
| **Visual** | Color stats, edge density, FFT spatial complexity, whitespace | 15 | Visual layout |
|
| 67 |
| **Structural** | Page count, area, font diversity, text blocks, density | 15 | PDF geometry |
|
| 68 |
|
| 69 |
+
Each classifier head is a single linear layer stored as a numpy `.npz` file (10 KB). Inference is pure numpy β no torch required at prediction time.
|
| 70 |
|
| 71 |
## Performance
|
| 72 |
|
| 73 |
+
Validated on 3,606 real scientific documents:
|
| 74 |
+
|
| 75 |
| Metric | Value |
|
| 76 |
|--------|-------|
|
| 77 |
+
| **Accuracy** | **87.3%** |
|
| 78 |
| F1 (poster) | 87.1% |
|
| 79 |
| F1 (non-poster) | 87.4% |
|
| 80 |
+
| Precision (poster) | 88.2% |
|
| 81 |
+
| Recall (poster) | 85.9% |
|
| 82 |
+
| Inference speed | ~300 docs/sec (CPU) |
|
| 83 |
|
| 84 |
### Top Features by Importance
|
| 85 |
|
| 86 |
+
| Rank | Feature | Coefficient | Signal |
|
| 87 |
+
|------|---------|------------|--------|
|
| 88 |
+
| 1 | `size_per_page_kb` | +7.65 | Posters are dense, high-res single pages |
|
| 89 |
+
| 2 | `page_count` | -5.49 | More pages = not a poster |
|
| 90 |
+
| 3 | `file_size_kb` | -5.44 | Multi-page docs are bigger overall |
|
| 91 |
+
| 4 | `img_height` | +1.38 | Posters are large-format |
|
| 92 |
+
| 5 | `page_height_pt` | +1.38 | Large physical dimensions |
|
| 93 |
+
| 6 | `avg_font_size` | -1.10 | Papers use smaller fonts |
|
| 94 |
+
| 7 | `is_landscape` | +0.98 | Some posters are landscape |
|
| 95 |
+
| 8 | `color_diversity` | +0.95 | Posters are visually rich |
|
| 96 |
+
| 9 | `edge_density` | +0.79 | More visual edges in posters |
|
| 97 |
+
| 10 | `text_block_count` | +0.75 | Multi-column poster layouts |
|
| 98 |
|
| 99 |
## Training Data
|
| 100 |
|
| 101 |
+
Trained on **3,606 real documents** β zero synthetic data:
|
| 102 |
+
|
| 103 |
+
| Class | Count | Source |
|
| 104 |
+
|-------|-------|--------|
|
| 105 |
+
| **Poster** | 1,803 | Verified scientific posters from Zenodo & Figshare |
|
| 106 |
+
| **Non-poster** | 1,803 | Multi-page papers, proceedings, newsletters, abstract books |
|
| 107 |
|
| 108 |
+
Sampled from the [posters.science](https://posters.science) corpus of **30,000+ classified PDFs** (28,111 posters, 2,036 non-posters from Zenodo and Figshare).
|
|
|
|
| 109 |
|
| 110 |
+
Training data: [fairdataihub/poster-sentry-training-data](https://huggingface.co/datasets/fairdataihub/poster-sentry-training-data)
|
| 111 |
|
| 112 |
## Usage
|
| 113 |
|
| 114 |
+
### Python API
|
| 115 |
+
|
| 116 |
```python
|
| 117 |
from poster_sentry import PosterSentry
|
| 118 |
|
| 119 |
sentry = PosterSentry()
|
| 120 |
sentry.initialize()
|
| 121 |
+
|
| 122 |
+
# Classify a PDF (uses text + visual + structural features)
|
| 123 |
result = sentry.classify("document.pdf")
|
| 124 |
+
print(f"Is poster: {result['is_poster']}, Confidence: {result['confidence']:.2f}")
|
| 125 |
# {'is_poster': True, 'confidence': 0.97, 'path': 'document.pdf'}
|
| 126 |
+
|
| 127 |
+
# Batch classification
|
| 128 |
+
results = sentry.classify_batch(["poster1.pdf", "paper.pdf", "newsletter.pdf"])
|
| 129 |
```
|
| 130 |
|
| 131 |
+
### Installation
|
| 132 |
+
|
| 133 |
+
```bash
|
| 134 |
+
pip install git+https://github.com/fairdataihub/poster-repo-qc.git
|
| 135 |
+
|
| 136 |
+
# Or install from source
|
| 137 |
+
git clone https://github.com/fairdataihub/poster-repo-qc.git
|
| 138 |
+
cd poster-repo-qc
|
| 139 |
+
pip install -e ".[train]"
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
### Training
|
| 143 |
+
|
| 144 |
+
```bash
|
| 145 |
+
python scripts/train_poster_sentry.py --n-per-class 2000
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
Training completes in ~40 minutes on CPU (PDF rendering is the bottleneck, not the classifier).
|
| 149 |
+
|
| 150 |
+
## Model Specifications
|
| 151 |
+
|
| 152 |
+
| Attribute | Value |
|
| 153 |
+
|-----------|-------|
|
| 154 |
+
| Embedding backbone | minishlab/potion-base-32M (model2vec StaticModel) |
|
| 155 |
+
| Embedding dimension | 512 |
|
| 156 |
+
| Visual features | 15 (color, edge, FFT, whitespace) |
|
| 157 |
+
| Structural features | 15 (page geometry, fonts, text blocks) |
|
| 158 |
+
| Total input dimension | 542 |
|
| 159 |
+
| Classifier | LogisticRegression (sklearn) + StandardScaler |
|
| 160 |
+
| Head file size | 10 KB (.npz) |
|
| 161 |
+
| Precision | float32 |
|
| 162 |
+
| GPU required | No (CPU-only) |
|
| 163 |
+
| License | MIT |
|
| 164 |
+
|
| 165 |
+
## System Requirements
|
| 166 |
+
|
| 167 |
+
- **CPU**: Any modern CPU (no GPU needed)
|
| 168 |
+
- **RAM**: β₯4GB
|
| 169 |
+
- **Python**: β₯3.10
|
| 170 |
+
- **Dependencies**: numpy, model2vec, scikit-learn, PyMuPDF, Pillow
|
| 171 |
+
|
| 172 |
## Citation
|
| 173 |
|
| 174 |
```bibtex
|
| 175 |
@software{poster_sentry_2026,
|
| 176 |
title = {PosterSentry: Multimodal Scientific Poster Classifier},
|
| 177 |
+
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
|
| 178 |
year = {2026},
|
| 179 |
url = {https://huggingface.co/fairdataihub/poster-sentry},
|
| 180 |
note = {Part of the posters.science initiative}
|
| 181 |
}
|
| 182 |
```
|
| 183 |
+
|
| 184 |
+
## License
|
| 185 |
+
|
| 186 |
+
This model is released under the [MIT License](https://opensource.org/licenses/MIT).
|
| 187 |
+
|
| 188 |
+
## Acknowledgments
|
| 189 |
+
|
| 190 |
+
- [FAIR Data Innovations Hub](https://fairdataihub.org/) at California Medical Innovations Institute (CalMIΒ²)
|
| 191 |
+
- [posters.science](https://posters.science) platform
|
| 192 |
+
- [MinishLab](https://github.com/MinishLab) for the model2vec embedding backbone
|
| 193 |
+
- HuggingFace for model hosting infrastructure
|
| 194 |
+
- Funded by The Navigation Fund ([10.71707/rk36-9x79](https://doi.org/10.71707/rk36-9x79)) β "Poster Sharing and Discovery Made Easy"
|