Thinking with Tables (TWT)
Model summary
Thinking with Tables (TWT) is a multi-modal model family trained for Tabular-Vision Multi-Modal Understanding (TVMU): jointly using table-related images, text, and executable Python in a sandbox to solve multi-modal table question answering and multi-modal table prediction tasks.
The approach is program-aided neuro-symbolic reasoning: the model emits structured outputs (<analy>, <code>, <answer>), code is executed in a controlled environment, and results are fed back in <code_result> for multi-turn reasoning until a final answer is produced.
| Artifact | Hugging Face link |
|---|---|
| TWT (RL / GRPO) | Kunyang-YU/TWT-RL |
| TWT (SFT) | Kunyang-YU/TWT-SFT |
| Training data | Kunyang-YU/TWT-Training |
| GitHub | kunyang-YU/Thinking-with-Tables |
Paper: Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning (arXiv:2603.24004, 2026).
Code: See the project README for clone URL, environment setup, ms-swift training scripts, and evaluation pipelines.
Base model and training
- Typical base: multi-modal vision-language model (e.g. Qwen3-VL-8B in the paper).
- Stage 1 โ Task-Oriented SFT (TO-SFT): supervised fine-tuning so the model follows the
<analy>/<code>/<answer>protocol;<code_result>is masked during training (loss on model-generated tokens only). - Stage 2 โ Adaptive Loss-Scaled GRPO (AL-GRPO): reinforcement learning with multi-turn rollouts (code execution in sandbox, reward on final answer); only successfully executed code segments contribute to the GRPO loss (
default+codeloss scaling).
Approximate data scale (from the paper / README): SFT on the order of ~1.5K table QA + ~1.2K table prediction + ~5K additional samples; GRPO on the order of ~0.5K table QA + ~0.4K table prediction (exact counts may vary by release).
Intended use
Primary use cases
- Table QA: Given a table header/sample image, natural-language question, and path to the full CSV, the model generates code (e.g. pandas) to read and reason over the full table in a sandbox.
- Table prediction: Given a task image, metadata, and paths to tabular data / images, the model may call task-specific tools (e.g.
ResNet50Predictfor vision features) plus tabular code to produce a classification or regression answer.
Out-of-scope / not guaranteed
- Tasks that require no code or no environment access may not match the training distribution.
- The model is not a substitute for financial, legal, or safety-critical decisions without human review.
Output format (inference contract)
The model is expected to produce parseable segments:
| Tag | Role |
|---|---|
<analy>...</analy> |
Reasoning |
<code>...</code> |
Python for the sandbox |
<code_result>...</code_result> |
Filled by the system after execution (not model-generated) |
<answer>...</answer> |
Final answer |
Inference is typically multi-turn: generate โ execute <code> โ append <code_result> โ repeat until <answer>.
Evaluation notes
- Table QA: Benchmarks include settings aligned with TAT-QA, FinQA, HiTab, and similar; the evaluation harness uses an OpenAI-compatible API (e.g. vLLM) plus a sandbox loop.
- Table prediction: Tasks such as Pawpularity, Adoption, SkinCA, Paintings require ResNet50 weights and a
tools.py/ResNet50Predict(task, image_path)interface in the sandbox path.
Important: At inference time the full table is usually not pasted into the prompt; the model must read the file via generated code using the provided path.
Limitations
- Depends on a correctly configured sandbox (timeouts, allowed imports, file paths, optional ResNet weights).
- Code execution failures or path errors break the multi-turn loop; robustness depends on deployment.
- Performance varies by table complexity, OCR/image quality, and domain shift from training data.
Citation
@article{yu2026thinking,
title={Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning},
author={Yu, Kun-Yang and Zhou, Zhi and Tian, Shi-Yu and Yang, Xiao-Wen and Jia, Zi-Yi and Yang, Ming and Cheng, Zi-Jian and Guo, Lan-Zhe and Li, Yu-Feng},
journal={arXiv preprint arXiv:2603.24004},
year={2026}
}
This model is trained based on Qwen3VL, thanks!
Contact
Questions about paths or reproduction: open an issue in the project repository or email yuky@lamda.nju.edu.cn (as listed in the main README).
- Downloads last month
- 25
Model tree for Kunyang-YU/TWT-RL
Base model
Qwen/Qwen3-VL-8B-Instruct