Instructions to use savi8sant8s/ptbr-post-ocr-sc-llm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use savi8sant8s/ptbr-post-ocr-sc-llm with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="savi8sant8s/ptbr-post-ocr-sc-llm")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("savi8sant8s/ptbr-post-ocr-sc-llm", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use savi8sant8s/ptbr-post-ocr-sc-llm with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "savi8sant8s/ptbr-post-ocr-sc-llm"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "savi8sant8s/ptbr-post-ocr-sc-llm",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/savi8sant8s/ptbr-post-ocr-sc-llm

SGLang

How to use savi8sant8s/ptbr-post-ocr-sc-llm with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "savi8sant8s/ptbr-post-ocr-sc-llm" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "savi8sant8s/ptbr-post-ocr-sc-llm",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "savi8sant8s/ptbr-post-ocr-sc-llm" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "savi8sant8s/ptbr-post-ocr-sc-llm",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use savi8sant8s/ptbr-post-ocr-sc-llm with Docker Model Runner:
```
docker model run hf.co/savi8sant8s/ptbr-post-ocr-sc-llm
```

A proposal for post-OCR spelling correction using Language Models

Link: https://openreview.net/forum?id=p5P9R9AKr5 - Repository: https://github.com/savi8sant8s/ptbr-post-ocr-sc-llm

Fine-tuned models:

Bart Portuguese: https://huggingface.co/adalbertojunior/bart-base-portuguese;
ByT5 Portuguese: https://huggingface.co/pierreguillou/byt5-small-qa-squad-v1.1-portuguese;
Gervásio PTBR: https://huggingface.co/PORTULAN/gervasio-7b-portuguese-ptbr-decoder;
Sabiá: https://huggingface.co/maritaca-ai/sabia-7b.

Abstract:

This work explores the use of Language Models (LMs) to correct residual errors in texts extracted by OCR and HTR (Handwritten Text Recognition) systems. We propose a general approach but utilize the images from Brazilian handwritten essays of the BRESSAY dataset as a use case. Two standard LMs (Bart and ByT5) and two LLMs (LLama 1 and LLama 2) were evaluated in this context. The results indicate that the smaller LMs outperformed the LLMs in terms of error rate reduction (CER and WER). Traditional correction methods, such as Symspell and Norvig, were influential in some cases but fell short of the results obtained by the LMs. ByT5 with byte-level tokenization improved CER and WER, proving performance for texts with high noise. As a result, smaller LMs, after fine-tuning, are more efficient and cheaper for post-OCR corrections. We identify and propose promising future studies involving correction at broader levels of context, such as paragraphs.

Methodology:

Results:

Citation:

@inproceedings{
  araujo2024a,
  title={A proposal for post-{OCR} spelling correction using Language Models},
  author={S{\'a}vio Santos de Ara{\'u}jo and Byron Leite Dantas Bezerra and Arthur Flor de Sousa Neto and Cleber Zanchettin},
  booktitle={Latinx in AI @ NeurIPS 2024},
  year={2024},
  url={https://openreview.net/forum?id=p5P9R9AKr5}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for savi8sant8s/ptbr-post-ocr-sc-llm

Base model

adalbertojunior/bart-base-portuguese

Finetuned

(1)

this model

Collection including savi8sant8s/ptbr-post-ocr-sc-llm

Published Papers

Collection

2 items • Updated Sep 10, 2025