fine-tuned-model-led

Fine-tuned version of allenai/led-base-16384 for automatic metadata extraction from academic documents (books, theses, journal articles, conference papers) in Spanish, developed as part of a thesis project at SEDICI (Servicio de Difusión de la Creación Intelectual — UNLP).

What it does

Given the plain text of an academic document (PDF), the model extracts structured metadata fields such as title, authors, date, abstract, keywords, subject, document type, and more — returning a JSON object.

Base model

This model is a fine-tune of allenai/led-base-16384 (Longformer Encoder-Decoder), which supports sequences up to 16 384 tokens — suitable for full academic document texts.

Usage

This model is designed to run as part of the full extraction pipeline. See the project repository and documentation for setup instructions:

Code & full pipeline: https://github.com/nahuelPanigo/document_extraction_llm
Documentation: https://nahuelpanigo.github.io/document_extraction_llm/

Training data

Fine-tuned on a curated dataset of academic documents from the SEDICI repository, with manually validated metadata used as ground truth.

Downloads last month: 6

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Nahpanigo99/fine-tuned-model-led

Base model

allenai/led-base-16384

Finetuned

(44)

this model