Thinking with Tables (TWT)

Model summary

Thinking with Tables (TWT) is a multi-modal model family trained for Tabular-Vision Multi-Modal Understanding (TVMU): jointly using table-related images, text, and executable Python in a sandbox to solve multi-modal table question answering and multi-modal table prediction tasks.

The approach is program-aided neuro-symbolic reasoning: the model emits structured outputs (<analy>, <code>, <answer>), code is executed in a controlled environment, and results are fed back in <code_result> for multi-turn reasoning until a final answer is produced.

Artifact Hugging Face link
TWT (RL / GRPO) Kunyang-YU/TWT-RL
TWT (SFT) Kunyang-YU/TWT-SFT
Training data Kunyang-YU/TWT-Training
GitHub kunyang-YU/Thinking-with-Tables

Paper: Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning (arXiv:2603.24004, 2026).

Code: See the project README for clone URL, environment setup, ms-swift training scripts, and evaluation pipelines.


Base model and training

  • Typical base: multi-modal vision-language model (e.g. Qwen3-VL-8B in the paper).
  • Stage 1 โ€” Task-Oriented SFT (TO-SFT): supervised fine-tuning so the model follows the <analy> / <code> / <answer> protocol; <code_result> is masked during training (loss on model-generated tokens only).
  • Stage 2 โ€” Adaptive Loss-Scaled GRPO (AL-GRPO): reinforcement learning with multi-turn rollouts (code execution in sandbox, reward on final answer); only successfully executed code segments contribute to the GRPO loss (default+code loss scaling).

Approximate data scale (from the paper / README): SFT on the order of ~1.5K table QA + ~1.2K table prediction + ~5K additional samples; GRPO on the order of ~0.5K table QA + ~0.4K table prediction (exact counts may vary by release).


Intended use

Primary use cases

  • Table QA: Given a table header/sample image, natural-language question, and path to the full CSV, the model generates code (e.g. pandas) to read and reason over the full table in a sandbox.
  • Table prediction: Given a task image, metadata, and paths to tabular data / images, the model may call task-specific tools (e.g. ResNet50Predict for vision features) plus tabular code to produce a classification or regression answer.

Out-of-scope / not guaranteed

  • Tasks that require no code or no environment access may not match the training distribution.
  • The model is not a substitute for financial, legal, or safety-critical decisions without human review.

Output format (inference contract)

The model is expected to produce parseable segments:

Tag Role
<analy>...</analy> Reasoning
<code>...</code> Python for the sandbox
<code_result>...</code_result> Filled by the system after execution (not model-generated)
<answer>...</answer> Final answer

Inference is typically multi-turn: generate โ†’ execute <code> โ†’ append <code_result> โ†’ repeat until <answer>.


Evaluation notes

  • Table QA: Benchmarks include settings aligned with TAT-QA, FinQA, HiTab, and similar; the evaluation harness uses an OpenAI-compatible API (e.g. vLLM) plus a sandbox loop.
  • Table prediction: Tasks such as Pawpularity, Adoption, SkinCA, Paintings require ResNet50 weights and a tools.py / ResNet50Predict(task, image_path) interface in the sandbox path.

Important: At inference time the full table is usually not pasted into the prompt; the model must read the file via generated code using the provided path.


Limitations

  • Depends on a correctly configured sandbox (timeouts, allowed imports, file paths, optional ResNet weights).
  • Code execution failures or path errors break the multi-turn loop; robustness depends on deployment.
  • Performance varies by table complexity, OCR/image quality, and domain shift from training data.

Citation

@article{yu2026thinking,
  title={Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning},
  author={Yu, Kun-Yang and Zhou, Zhi and Tian, Shi-Yu and Yang, Xiao-Wen and Jia, Zi-Yi and Yang, Ming and Cheng, Zi-Jian and Guo, Lan-Zhe and Li, Yu-Feng},
  journal={arXiv preprint arXiv:2603.24004},
  year={2026}
}

This model is trained based on Qwen3VL, thanks!


Contact

Questions about paths or reproduction: open an issue in the project repository or email yuky@lamda.nju.edu.cn (as listed in the main README).

Downloads last month
31
Safetensors
Model size
770k params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Kunyang-YU/TWT-SFT

Finetuned
(234)
this model

Paper for Kunyang-YU/TWT-SFT