Llama-3.1-8B-Poster-Extraction
Model Description
This model powers the extraction pipeline for posters.science, a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).
The model converts raw poster text into structured JSON metadata conforming to the poster-json-schema—a DataCite-based schema extended for poster-specific metadata including conference information, content sections, and figure/table captions.
Developed by the FAIR Data Innovations Hub at the California Medical Innovations Institute (CalMI²).
poster2json Library
This model is the core of the poster2json Python library:
| Resource | Link |
|---|---|
| PyPI | poster2json |
| Documentation | fairdataihub.github.io/poster2json |
| GitHub | fairdataihub/poster2json |
| API Repository | fairdataihub/posters-science-extraction-api |
| Platform | posters.science |
Quick Install
pip install poster2json
Python Usage
from poster2json import extract_poster
result = extract_poster("path/to/poster.pdf")
print(result["titles"][0]["title"])
print(result["creators"])
Output Schema
Output conforms to the poster-json-schema, based on DataCite Metadata Schema with poster-specific extensions:
{
"$schema": "https://posters.science/schema/v0.1/poster_schema.json",
"creators": [
{
"name": "Garcia, Sofia",
"givenName": "Sofia",
"familyName": "Garcia",
"nameType": "Personal",
"affiliation": ["University of California, San Diego"]
}
],
"titles": [
{ "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
],
"publicationYear": 2025,
"subjects": [
{ "subject": "Machine Learning" },
{ "subject": "Diabetic Retinopathy" }
],
"descriptions": [
{
"description": "This poster presents machine learning methods for automated diabetic retinopathy screening...",
"descriptionType": "Abstract"
}
],
"conference": {
"conferenceName": "AMIA 2025 Annual Symposium",
"conferenceLocation": "San Francisco, CA"
},
"content": {
"sections": [
{ "sectionTitle": "Introduction", "sectionContent": "..." },
{ "sectionTitle": "Methods", "sectionContent": "..." },
{ "sectionTitle": "Results", "sectionContent": "..." },
{ "sectionTitle": "Conclusions", "sectionContent": "..." }
]
},
"imageCaptions": [
{ "caption": "Figure 1. ROC curves showing model performance across datasets" }
],
"tableCaptions": [
{ "caption": "Table 1. Summary of demographic characteristics" }
],
"rightsList": [
{ "rights": "Creative Commons Attribution 4.0 International" }
],
"formats": ["PDF"]
}
Key Schema Fields (DataCite-based)
| Field | Description |
|---|---|
creators |
Authors with name, affiliation, ORCID identifiers |
titles |
Main title and alternative/translated titles |
subjects |
Keywords and classification codes (MeSH, LCSH) |
descriptions |
Abstract, methods, technical information |
conference |
Conference name, location, dates, URI |
content.sections |
Extracted poster sections with titles and content |
imageCaptions |
Figure captions extracted from the poster |
tableCaptions |
Table captions extracted from the poster |
fundingReferences |
Grant information (funder, award number) |
rightsList |
License information (CC-BY, etc.) |
relatedIdentifiers |
DOIs, URLs to related resources |
Model Specifications
| Attribute | Value |
|---|---|
| Base Model | meta-llama/Llama-3.1-8B-Instruct |
| Parameters | 8 Billion |
| Context Length | 128K tokens |
| Architecture | LLaMA 3.1 |
| Precision | bfloat16 |
| License | Llama 3.1 Community License |
Performance
Validated on 10 manually annotated scientific posters:
| Metric | Score | Threshold |
|---|---|---|
| Word Capture | 0.96 | ≥0.75 |
| ROUGE-L | 0.89 | ≥0.75 |
| Number Capture | 0.93 | ≥0.75 |
| Field Proportion | 0.99 | 0.50–2.00 |
Pass Rate: 10/10 (100%)
Direct Usage (Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "jimnoneill/Llama-3.1-8B-Poster-Extraction"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
prompt = """Extract structured metadata from the following scientific poster.
Return valid JSON conforming to the poster-json-schema with fields:
creators, titles, publicationYear, subjects, descriptions, conference, content, imageCaptions, tableCaptions.
Poster Content:
[Your poster text here]
"""
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=4096, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
System Requirements
- GPU: NVIDIA CUDA-capable, ≥16GB VRAM (RTX 4090 recommended)
- RAM: ≥32GB
- Supports 8-bit quantization for memory-constrained environments
- Compatible with vLLM and other inference optimization frameworks
Citation
@software{poster2json2026,
title = {poster2json: Scientific Poster to JSON Metadata Extraction},
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
year = {2026},
url = {https://github.com/fairdataihub/poster2json},
doi = {10.5281/zenodo.18320010}
}
License
This model is released under the Llama 3.1 Community License.
Acknowledgments
- FAIR Data Innovations Hub at California Medical Innovations Institute (CalMI²)
- posters.science platform
- Meta AI for the Llama 3.1 base model
- HuggingFace for model hosting infrastructure
- Funded by The Navigation Fund (10.71707/rk36-9x79) — "Poster Sharing and Discovery Made Easy"
- Downloads last month
- 276
Model tree for fairdataihub/Llama-3.1-8B-Poster-Extraction
Base model
meta-llama/Llama-3.1-8B