Thinking with Tables (TWT)

Model summary

Thinking with Tables (TWT) is a multi-modal model family trained for Tabular-Vision Multi-Modal Understanding (TVMU): jointly using table-related images, text, and executable Python in a sandbox to solve multi-modal table question answering and multi-modal table prediction tasks.

The approach is program-aided neuro-symbolic reasoning: the model emits structured outputs (<analy>, <code>, <answer>), code is executed in a controlled environment, and results are fed back in <code_result> for multi-turn reasoning until a final answer is produced.

Artifact	Hugging Face link
TWT (RL / GRPO)	Kunyang-YU/TWT-RL
TWT (SFT)	Kunyang-YU/TWT-SFT
Training data	Kunyang-YU/TWT-Training
GitHub	kunyang-YU/Thinking-with-Tables

Paper: Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning (arXiv:2603.24004, 2026).

Code: See the project README for clone URL, environment setup, ms-swift training scripts, and evaluation pipelines.

Base model and training

Typical base: multi-modal vision-language model (e.g. Qwen3-VL-8B in the paper).
Stage 1 — Task-Oriented SFT (TO-SFT): supervised fine-tuning so the model follows the <analy> / <code> / <answer> protocol; <code_result> is masked during training (loss on model-generated tokens only).
Stage 2 — Adaptive Loss-Scaled GRPO (AL-GRPO): reinforcement learning with multi-turn rollouts (code execution in sandbox, reward on final answer); only successfully executed code segments contribute to the GRPO loss (default+code loss scaling).

Approximate data scale (from the paper / README): SFT on the order of ~1.5K table QA + ~1.2K table prediction + ~5K additional samples; GRPO on the order of ~0.5K table QA + ~0.4K table prediction (exact counts may vary by release).

Intended use

Primary use cases

Table QA: Given a table header/sample image, natural-language question, and path to the full CSV, the model generates code (e.g. pandas) to read and reason over the full table in a sandbox.
Table prediction: Given a task image, metadata, and paths to tabular data / images, the model may call task-specific tools (e.g. ResNet50Predict for vision features) plus tabular code to produce a classification or regression answer.

Out-of-scope / not guaranteed

Tasks that require no code or no environment access may not match the training distribution.
The model is not a substitute for financial, legal, or safety-critical decisions without human review.

Output format (inference contract)

The model is expected to produce parseable segments:

Tag	Role
`<analy>...</analy>`	Reasoning
`<code>...</code>`	Python for the sandbox
`<code_result>...</code_result>`	Filled by the system after execution (not model-generated)
`<answer>...</answer>`	Final answer

Inference is typically multi-turn: generate → execute <code> → append <code_result> → repeat until <answer>.

Evaluation notes

Table QA: Benchmarks include settings aligned with TAT-QA, FinQA, HiTab, and similar; the evaluation harness uses an OpenAI-compatible API (e.g. vLLM) plus a sandbox loop.
Table prediction: Tasks such as Pawpularity, Adoption, SkinCA, Paintings require ResNet50 weights and a tools.py / ResNet50Predict(task, image_path) interface in the sandbox path.

Important: At inference time the full table is usually not pasted into the prompt; the model must read the file via generated code using the provided path.

Limitations

Depends on a correctly configured sandbox (timeouts, allowed imports, file paths, optional ResNet weights).
Code execution failures or path errors break the multi-turn loop; robustness depends on deployment.
Performance varies by table complexity, OCR/image quality, and domain shift from training data.

Citation

@article{yu2026thinking,
  title={Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning},
  author={Yu, Kun-Yang and Zhou, Zhi and Tian, Shi-Yu and Yang, Xiao-Wen and Jia, Zi-Yi and Yang, Ming and Cheng, Zi-Jian and Guo, Lan-Zhe and Li, Yu-Feng},
  journal={arXiv preprint arXiv:2603.24004},
  year={2026}
}

This model is trained based on Qwen3VL, thanks!

Contact

Questions about paths or reproduction: open an issue in the project repository or email yuky@lamda.nju.edu.cn (as listed in the main README).

Downloads last month: 31

Safetensors

Model size

770k params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kunyang-YU/TWT-SFT

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(234)

this model

Paper for Kunyang-YU/TWT-SFT

Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning

Paper • 2603.24004 • Published 23 days ago