Bar-JEPA
Per-bar numerical value recovery from vertical bar chart images. A self-supervised I-JEPA encoder (ViT-H, finetuned on synthetic bar charts) produces feature maps consumed by a lightweight keypoint decoder. The decoder outputs heatmaps for bar corners, value-axis ticks and the coordinate origin, which are post-processed with NMS, OCR and RANSAC regression to recover numerical bar values.
Paper: Bar-JEPA: Extracting Values from Bar Chart with Joint-Embedding Predictive Architecture — Poonam, Epple & Ropinski, Ulm University (ICDAR 2026).
Code: github.com/dralois/Bar-JEPA
Pipeline
Image → variable-resolution patch extraction (≤256 patches, 14 px)
→ frozen I-JEPA ViT-H encoder
→ classic decoder (2× deconv-BN-ReLU + 1×1 conv, 32 keypoint channels)
→ origin / classification / regression heatmap heads
→ NMS + PaddleOCR + RANSAC → bar values
The encoder is frozen during decoder training. Variable-resolution inputs follow the Pix2Struct aspect-ratio-preserving scaling strategy.
Checkpoints
Download checkpoints and place them in ./output/. The ViT-H base checkpoint (IN1K-vit.h.14-300e.pth.tar) must also be present there before encoder finetuning.
| File | Description |
|---|---|
kp-cl-arp-ctt-ft-latest.pth.tar |
Classic decoder, ARP encoder + Chart-to-Text FT, UB PMC finetuned (best) |
kp-cl-arp-ft-latest.pth.tar |
Classic decoder, ARP encoder, UB PMC finetuned |
kp-cl-noarp-ft-latest.pth.tar |
Classic decoder, fixed-resolution encoder, UB PMC finetuned |
kp-cl-vanilla-ft-latest.pth.tar |
Classic decoder, vanilla ImageNet-only encoder, UB PMC finetuned |
kp-spl-arp-ft-latest.pth.tar |
Simple decoder, ARP encoder, UB PMC finetuned |
Usage
All tasks go through bar-jepa/main.py. Requires pixi or PyTorch ≥ 2.3 + PaddleOCR.
# Setup
pixi install
# or: pip install -e ".[torch]" && pip install paddlepaddle paddleocr
Encoder finetuning:
python bar-jepa/main.py \
--mode finetune \
--fname bar-jepa/configs/charts/vith14_arp.yaml \
--devices cuda:0
Decoder training (pretraining → UB PMC finetuning):
python bar-jepa/main.py --mode decoder \
--fname bar-jepa/configs/keypoint/classic_arp.yaml --devices cuda:0
python bar-jepa/main.py --mode decoder \
--fname bar-jepa/configs/keypoint/classic_arp.yaml --devices cuda:0 \
--override meta.do_finetune=true data.root_path=./UBPMC data.is_ubpmc=true
Evaluation:
python bar-jepa/main.py --mode eval \
--fname bar-jepa/configs/eval/classic_arp.yaml --devices cuda:0
# Run all five configurations at once:
python scripts/run_all_evals.py
Training
Encoder finetuning — self-supervised I-JEPA objective on 100k synthetic bar charts. 50 epochs, AdamW, cosine LR 5×10⁻⁵ → 1×10⁻⁶ (6-epoch warm-up), cosine weight decay 0.02–0.04, effective batch size 40, 2× NVIDIA RTX A6000. Optional additional 25 epochs on 15k Chart-to-Text real-world charts.
Decoder training — supervised keypoint regression on 17k synthetic charts. 50 epochs, AdamW, cosine LR 1×10⁻³ → 1×10⁻⁵ (3-epoch warm-up), cosine weight decay 0.04–0.1, batch size 1360, 1× NVIDIA RTX A6000. Finetuned for 30 more epochs on UB PMC / ICPR CHART-Infographics 2022 (1316 vertical bar charts).
Performance
Variable-resolution encoder (ARP), classic decoder, synthetic pretraining only:
| Dataset | Bar F1 | Tick F1 | Acc (ε=0.05) | Acc (ε=0.02) |
|---|---|---|---|---|
| UB PMC (real-world) | 0.740 | 0.827 | 0.450 | 0.341 |
| Synthetic (100 charts) | 0.956 | 0.940 | 0.778 | 0.629 |
With additional Chart-to-Text real-world finetuning:
| Dataset | Bar F1 | Tick F1 | Acc (ε=0.05) | Acc (ε=0.02) |
|---|---|---|---|---|
| UB PMC (real-world) | 0.785 | 0.842 | 0.499 | 0.365 |
| Synthetic (100 charts) | 0.961 | 0.951 | 0.792 | 0.657 |
Accuracy criterion: |h_gt − h_pred| / h_gt ≤ ε (same as Zhou et al. 2021).
Dataset
Training data at dralois/Bar-JEPA. Download and place at ./data (encoder, 100k) and ./data_decoder (decoder, 17k), or generate from scratch:
python bar-gen/generator.py --output ./data --count 100000
| Config | Split | Samples | Local path |
|---|---|---|---|
encoder_training |
train / test | 100,000 / 100 | ./data |
decoder_training |
train / test | 17,000 / 3,400 | ./data_decoder |
pipeline_testing |
test | 100 | — |
Limitations
- Vertical bar charts only; no stacked bars, error bars or 3D effects.
- OCR (PaddleOCR
latin_PP-OCRv5_mobile_rec) is an external dependency for tick label reading. - Encoder operates in latent space only, making integration with multimodal language models non-trivial.
Citation
@inproceedings{poonam2026bar-jepa,
title = {Bar-JEPA: Extracting Values from Bar Chart with Joint-Embedding Predictive Architecture},
author = {Poonam, Poonam and Epple, Alexander and Ropinski, Timo},
booktitle = {ICDAR},
year = {2026}
}
License
Model code is derived from facebookresearch/ijepa and licensed under the same terms. See bar-jepa/LICENSE.
Model tree for dralois/Bar-JEPA
Base model
facebook/ijepa_vith16_1kDataset used to train dralois/Bar-JEPA
Evaluation results
- Bar Keypoint F1 on UB PMC (real-world)self-reported0.785
- Tick Keypoint F1 on UB PMC (real-world)self-reported0.842
- Value Accuracy (ε=0.05) on UB PMC (real-world)self-reported0.499
- Value Accuracy (ε=0.02) on UB PMC (real-world)self-reported0.365
- Bar Keypoint F1 on Synthetic (100 held-out charts)self-reported0.961
- Tick Keypoint F1 on Synthetic (100 held-out charts)self-reported0.951
- Value Accuracy (ε=0.05) on Synthetic (100 held-out charts)self-reported0.792
- Value Accuracy (ε=0.02) on Synthetic (100 held-out charts)self-reported0.657