Bar-JEPA

Per-bar numerical value recovery from vertical bar chart images. A self-supervised I-JEPA encoder (ViT-H, finetuned on synthetic bar charts) produces feature maps consumed by a lightweight keypoint decoder. The decoder outputs heatmaps for bar corners, value-axis ticks and the coordinate origin, which are post-processed with NMS, OCR and RANSAC regression to recover numerical bar values.

Paper: Bar-JEPA: Extracting Values from Bar Chart with Joint-Embedding Predictive Architecture — Poonam, Epple & Ropinski, Ulm University (ICDAR 2026).
Code: github.com/dralois/Bar-JEPA

Pipeline

Image → variable-resolution patch extraction (≤256 patches, 14 px)
      → frozen I-JEPA ViT-H encoder
      → classic decoder (2× deconv-BN-ReLU + 1×1 conv, 32 keypoint channels)
      → origin / classification / regression heatmap heads
      → NMS + PaddleOCR + RANSAC → bar values

The encoder is frozen during decoder training. Variable-resolution inputs follow the Pix2Struct aspect-ratio-preserving scaling strategy.

Checkpoints

Download checkpoints and place them in ./output/. The ViT-H base checkpoint (IN1K-vit.h.14-300e.pth.tar) must also be present there before encoder finetuning.

File	Description
`kp-cl-arp-ctt-ft-latest.pth.tar`	Classic decoder, ARP encoder + Chart-to-Text FT, UB PMC finetuned (best)
`kp-cl-arp-ft-latest.pth.tar`	Classic decoder, ARP encoder, UB PMC finetuned
`kp-cl-noarp-ft-latest.pth.tar`	Classic decoder, fixed-resolution encoder, UB PMC finetuned
`kp-cl-vanilla-ft-latest.pth.tar`	Classic decoder, vanilla ImageNet-only encoder, UB PMC finetuned
`kp-spl-arp-ft-latest.pth.tar`	Simple decoder, ARP encoder, UB PMC finetuned

Usage

All tasks go through bar-jepa/main.py. Requires pixi or PyTorch ≥ 2.3 + PaddleOCR.

# Setup
pixi install
# or: pip install -e ".[torch]" && pip install paddlepaddle paddleocr

Encoder finetuning:

python bar-jepa/main.py \
  --mode finetune \
  --fname bar-jepa/configs/charts/vith14_arp.yaml \
  --devices cuda:0

Decoder training (pretraining → UB PMC finetuning):

python bar-jepa/main.py --mode decoder \
  --fname bar-jepa/configs/keypoint/classic_arp.yaml --devices cuda:0

python bar-jepa/main.py --mode decoder \
  --fname bar-jepa/configs/keypoint/classic_arp.yaml --devices cuda:0 \
  --override meta.do_finetune=true data.root_path=./UBPMC data.is_ubpmc=true

Evaluation:

python bar-jepa/main.py --mode eval \
  --fname bar-jepa/configs/eval/classic_arp.yaml --devices cuda:0

# Run all five configurations at once:
python scripts/run_all_evals.py

Training

Encoder finetuning — self-supervised I-JEPA objective on 100k synthetic bar charts. 50 epochs, AdamW, cosine LR 5×10⁻⁵ → 1×10⁻⁶ (6-epoch warm-up), cosine weight decay 0.02–0.04, effective batch size 40, 2× NVIDIA RTX A6000. Optional additional 25 epochs on 15k Chart-to-Text real-world charts.

Decoder training — supervised keypoint regression on 17k synthetic charts. 50 epochs, AdamW, cosine LR 1×10⁻³ → 1×10⁻⁵ (3-epoch warm-up), cosine weight decay 0.04–0.1, batch size 1360, 1× NVIDIA RTX A6000. Finetuned for 30 more epochs on UB PMC / ICPR CHART-Infographics 2022 (1316 vertical bar charts).

Performance

Variable-resolution encoder (ARP), classic decoder, synthetic pretraining only:

Dataset	Bar F1	Tick F1	Acc (ε=0.05)	Acc (ε=0.02)
UB PMC (real-world)	0.740	0.827	0.450	0.341
Synthetic (100 charts)	0.956	0.940	0.778	0.629

With additional Chart-to-Text real-world finetuning:

Dataset	Bar F1	Tick F1	Acc (ε=0.05)	Acc (ε=0.02)
UB PMC (real-world)	0.785	0.842	0.499	0.365
Synthetic (100 charts)	0.961	0.951	0.792	0.657

Accuracy criterion: |h_gt − h_pred| / h_gt ≤ ε (same as Zhou et al. 2021).

Dataset

Training data at dralois/Bar-JEPA. Download and place at ./data (encoder, 100k) and ./data_decoder (decoder, 17k), or generate from scratch:

python bar-gen/generator.py --output ./data --count 100000

Config	Split	Samples	Local path
`encoder_training`	train / test	100,000 / 100	`./data`
`decoder_training`	train / test	17,000 / 3,400	`./data_decoder`
`pipeline_testing`	test	100	—

Limitations

Vertical bar charts only; no stacked bars, error bars or 3D effects.
OCR (PaddleOCR latin_PP-OCRv5_mobile_rec) is an external dependency for tick label reading.
Encoder operates in latent space only, making integration with multimodal language models non-trivial.

Citation

@inproceedings{poonam2026bar-jepa,
  title     = {Bar-JEPA: Extracting Values from Bar Chart with Joint-Embedding Predictive Architecture},
  author    = {Poonam, Poonam and Epple, Alexander and Ropinski, Timo},
  booktitle = {ICDAR},
  year      = {2026}
}

License

Model code is derived from facebookresearch/ijepa and licensed under the same terms. See bar-jepa/LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Keypoint Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dralois/Bar-JEPA

Base model

facebook/ijepa_vith16_1k

Finetuned

(1)

this model

Dataset used to train dralois/Bar-JEPA

Evaluation results

Bar Keypoint F1 on UB PMC (real-world)
self-reported

0.785
Tick Keypoint F1 on UB PMC (real-world)
self-reported

0.842
Value Accuracy (ε=0.05) on UB PMC (real-world)
self-reported

0.499
Value Accuracy (ε=0.02) on UB PMC (real-world)
self-reported

0.365
Bar Keypoint F1 on Synthetic (100 held-out charts)
self-reported

0.961
Tick Keypoint F1 on Synthetic (100 held-out charts)
self-reported

0.951
Value Accuracy (ε=0.05) on Synthetic (100 held-out charts)
self-reported

0.792
Value Accuracy (ε=0.02) on Synthetic (100 held-out charts)
self-reported

0.657