Bar-JEPA

Per-bar numerical value recovery from vertical bar chart images. A self-supervised I-JEPA encoder (ViT-H, finetuned on synthetic bar charts) produces feature maps consumed by a lightweight keypoint decoder. The decoder outputs heatmaps for bar corners, value-axis ticks and the coordinate origin, which are post-processed with NMS, OCR and RANSAC regression to recover numerical bar values.

Paper: Bar-JEPA: Extracting Values from Bar Chart with Joint-Embedding Predictive Architecture — Poonam, Epple & Ropinski, Ulm University (ICDAR 2026).
Code: github.com/dralois/Bar-JEPA

Pipeline

Image → variable-resolution patch extraction (≤256 patches, 14 px)
      → frozen I-JEPA ViT-H encoder
      → classic decoder (2× deconv-BN-ReLU + 1×1 conv, 32 keypoint channels)
      → origin / classification / regression heatmap heads
      → NMS + PaddleOCR + RANSAC → bar values

The encoder is frozen during decoder training. Variable-resolution inputs follow the Pix2Struct aspect-ratio-preserving scaling strategy.

Checkpoints

Download checkpoints and place them in ./output/. The ViT-H base checkpoint (IN1K-vit.h.14-300e.pth.tar) must also be present there before encoder finetuning.

File Description
kp-cl-arp-ctt-ft-latest.pth.tar Classic decoder, ARP encoder + Chart-to-Text FT, UB PMC finetuned (best)
kp-cl-arp-ft-latest.pth.tar Classic decoder, ARP encoder, UB PMC finetuned
kp-cl-noarp-ft-latest.pth.tar Classic decoder, fixed-resolution encoder, UB PMC finetuned
kp-cl-vanilla-ft-latest.pth.tar Classic decoder, vanilla ImageNet-only encoder, UB PMC finetuned
kp-spl-arp-ft-latest.pth.tar Simple decoder, ARP encoder, UB PMC finetuned

Usage

All tasks go through bar-jepa/main.py. Requires pixi or PyTorch ≥ 2.3 + PaddleOCR.

# Setup
pixi install
# or: pip install -e ".[torch]" && pip install paddlepaddle paddleocr

Encoder finetuning:

python bar-jepa/main.py \
  --mode finetune \
  --fname bar-jepa/configs/charts/vith14_arp.yaml \
  --devices cuda:0

Decoder training (pretraining → UB PMC finetuning):

python bar-jepa/main.py --mode decoder \
  --fname bar-jepa/configs/keypoint/classic_arp.yaml --devices cuda:0

python bar-jepa/main.py --mode decoder \
  --fname bar-jepa/configs/keypoint/classic_arp.yaml --devices cuda:0 \
  --override meta.do_finetune=true data.root_path=./UBPMC data.is_ubpmc=true

Evaluation:

python bar-jepa/main.py --mode eval \
  --fname bar-jepa/configs/eval/classic_arp.yaml --devices cuda:0

# Run all five configurations at once:
python scripts/run_all_evals.py

Training

Encoder finetuning — self-supervised I-JEPA objective on 100k synthetic bar charts. 50 epochs, AdamW, cosine LR 5×10⁻⁵ → 1×10⁻⁶ (6-epoch warm-up), cosine weight decay 0.02–0.04, effective batch size 40, 2× NVIDIA RTX A6000. Optional additional 25 epochs on 15k Chart-to-Text real-world charts.

Decoder training — supervised keypoint regression on 17k synthetic charts. 50 epochs, AdamW, cosine LR 1×10⁻³ → 1×10⁻⁵ (3-epoch warm-up), cosine weight decay 0.04–0.1, batch size 1360, 1× NVIDIA RTX A6000. Finetuned for 30 more epochs on UB PMC / ICPR CHART-Infographics 2022 (1316 vertical bar charts).

Performance

Variable-resolution encoder (ARP), classic decoder, synthetic pretraining only:

Dataset Bar F1 Tick F1 Acc (ε=0.05) Acc (ε=0.02)
UB PMC (real-world) 0.740 0.827 0.450 0.341
Synthetic (100 charts) 0.956 0.940 0.778 0.629

With additional Chart-to-Text real-world finetuning:

Dataset Bar F1 Tick F1 Acc (ε=0.05) Acc (ε=0.02)
UB PMC (real-world) 0.785 0.842 0.499 0.365
Synthetic (100 charts) 0.961 0.951 0.792 0.657

Accuracy criterion: |h_gt − h_pred| / h_gt ≤ ε (same as Zhou et al. 2021).

Dataset

Training data at dralois/Bar-JEPA. Download and place at ./data (encoder, 100k) and ./data_decoder (decoder, 17k), or generate from scratch:

python bar-gen/generator.py --output ./data --count 100000
Config Split Samples Local path
encoder_training train / test 100,000 / 100 ./data
decoder_training train / test 17,000 / 3,400 ./data_decoder
pipeline_testing test 100

Limitations

  • Vertical bar charts only; no stacked bars, error bars or 3D effects.
  • OCR (PaddleOCR latin_PP-OCRv5_mobile_rec) is an external dependency for tick label reading.
  • Encoder operates in latent space only, making integration with multimodal language models non-trivial.

Citation

@inproceedings{poonam2026bar-jepa,
  title     = {Bar-JEPA: Extracting Values from Bar Chart with Joint-Embedding Predictive Architecture},
  author    = {Poonam, Poonam and Epple, Alexander and Ropinski, Timo},
  booktitle = {ICDAR},
  year      = {2026}
}

License

Model code is derived from facebookresearch/ijepa and licensed under the same terms. See bar-jepa/LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dralois/Bar-JEPA

Finetuned
(1)
this model

Dataset used to train dralois/Bar-JEPA

Evaluation results

  • Bar Keypoint F1 on UB PMC (real-world)
    self-reported
    0.785
  • Tick Keypoint F1 on UB PMC (real-world)
    self-reported
    0.842
  • Value Accuracy (ε=0.05) on UB PMC (real-world)
    self-reported
    0.499
  • Value Accuracy (ε=0.02) on UB PMC (real-world)
    self-reported
    0.365
  • Bar Keypoint F1 on Synthetic (100 held-out charts)
    self-reported
    0.961
  • Tick Keypoint F1 on Synthetic (100 held-out charts)
    self-reported
    0.951
  • Value Accuracy (ε=0.05) on Synthetic (100 held-out charts)
    self-reported
    0.792
  • Value Accuracy (ε=0.02) on Synthetic (100 held-out charts)
    self-reported
    0.657