Llama-3.1-8B-Poster-Extraction

Model Description

This model powers the extraction pipeline for posters.science, a platform for making scientific conference posters Findable, Accessible, Interoperable, and Reusable (FAIR).

The model converts raw poster text into structured JSON metadata conforming to the poster-json-schema—a DataCite-based schema extended for poster-specific metadata including conference information, content sections, and figure/table captions.

Developed by the FAIR Data Innovations Hub at the California Medical Innovations Institute (CalMI²).

poster2json Library

This model is the core of the poster2json Python library:

Quick Install

pip install poster2json

Python Usage

from poster2json import extract_poster

result = extract_poster("path/to/poster.pdf")
print(result["titles"][0]["title"])
print(result["creators"])

Output Schema

Output conforms to the poster-json-schema, based on DataCite Metadata Schema with poster-specific extensions:

{
  "$schema": "https://posters.science/schema/v0.1/poster_schema.json",
  "creators": [
    {
      "name": "Garcia, Sofia",
      "givenName": "Sofia",
      "familyName": "Garcia",
      "nameType": "Personal",
      "affiliation": ["University of California, San Diego"]
    }
  ],
  "titles": [
    { "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
  ],
  "publicationYear": 2025,
  "subjects": [
    { "subject": "Machine Learning" },
    { "subject": "Diabetic Retinopathy" }
  ],
  "descriptions": [
    {
      "description": "This poster presents machine learning methods for automated diabetic retinopathy screening...",
      "descriptionType": "Abstract"
    }
  ],
  "conference": {
    "conferenceName": "AMIA 2025 Annual Symposium",
    "conferenceLocation": "San Francisco, CA"
  },
  "content": {
    "sections": [
      { "sectionTitle": "Introduction", "sectionContent": "..." },
      { "sectionTitle": "Methods", "sectionContent": "..." },
      { "sectionTitle": "Results", "sectionContent": "..." },
      { "sectionTitle": "Conclusions", "sectionContent": "..." }
    ]
  },
  "imageCaptions": [
    { "caption": "Figure 1. ROC curves showing model performance across datasets" }
  ],
  "tableCaptions": [
    { "caption": "Table 1. Summary of demographic characteristics" }
  ],
  "rightsList": [
    { "rights": "Creative Commons Attribution 4.0 International" }
  ],
  "formats": ["PDF"]
}

Key Schema Fields (DataCite-based)

Field Description
creators Authors with name, affiliation, ORCID identifiers
titles Main title and alternative/translated titles
subjects Keywords and classification codes (MeSH, LCSH)
descriptions Abstract, methods, technical information
conference Conference name, location, dates, URI
content.sections Extracted poster sections with titles and content
imageCaptions Figure captions extracted from the poster
tableCaptions Table captions extracted from the poster
fundingReferences Grant information (funder, award number)
rightsList License information (CC-BY, etc.)
relatedIdentifiers DOIs, URLs to related resources

Model Specifications

Attribute Value
Base Model meta-llama/Llama-3.1-8B-Instruct
Parameters 8 Billion
Context Length 128K tokens
Architecture LLaMA 3.1
Precision bfloat16
License Llama 3.1 Community License

Performance

Validated on 10 manually annotated scientific posters:

Metric Score Threshold
Word Capture 0.96 ≥0.75
ROUGE-L 0.89 ≥0.75
Number Capture 0.93 ≥0.75
Field Proportion 0.99 0.50–2.00

Pass Rate: 10/10 (100%)

Direct Usage (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "jimnoneill/Llama-3.1-8B-Poster-Extraction"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

prompt = """Extract structured metadata from the following scientific poster.
Return valid JSON conforming to the poster-json-schema with fields:
creators, titles, publicationYear, subjects, descriptions, conference, content, imageCaptions, tableCaptions.

Poster Content:
[Your poster text here]
"""

messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=4096, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

System Requirements

  • GPU: NVIDIA CUDA-capable, ≥16GB VRAM (RTX 4090 recommended)
  • RAM: ≥32GB
  • Supports 8-bit quantization for memory-constrained environments
  • Compatible with vLLM and other inference optimization frameworks

Citation

@software{poster2json2026,
  title = {poster2json: Scientific Poster to JSON Metadata Extraction},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  url = {https://github.com/fairdataihub/poster2json},
  doi = {10.5281/zenodo.18320010}
}

License

This model is released under the Llama 3.1 Community License.

Acknowledgments

  • FAIR Data Innovations Hub at California Medical Innovations Institute (CalMI²)
  • posters.science platform
  • Meta AI for the Llama 3.1 base model
  • HuggingFace for model hosting infrastructure
  • Funded by The Navigation Fund (10.71707/rk36-9x79) — "Poster Sharing and Discovery Made Easy"
Downloads last month
276
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fairdataihub/Llama-3.1-8B-Poster-Extraction

Finetuned
(2402)
this model