arxiv:2602.11731

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Published on Feb 12

· Submitted by

Cheng Tan on Feb 13

ByteDance

Upvote

Authors:

Siyuan Li ,

Cheng Tan

Abstract

Visual reasoning is enhanced by reconstructing logical structures from compressed visual tokens through a DSL-based approach that generates deterministic visual proofs for verification.

AI-generated summary

Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

View arXiv page View PDF Add to collection

Community

chengtan9907

Paper author Paper submitter about 11 hours ago

The core idea of Thinking with Drafting (TwD)

is super refreshing: instead of letting a multimodal model “guess the answer” with fluent CoT or pretty-looking diagrams, it forces the model to draft its reasoning into executable structure. Not vibes. Not plausible pixels. But strict, renderable DSL code.

The “optical decompression” framing is also 🔥 — OCR gives you symbols, but not logical topology. TwD says: real understanding = reconstructing the hidden structure behind those symbols. And the moment the model has to commit to aligned segments, brackets, and cross-row constraints, hallucination becomes much harder.

What I like most is the shift from:

generate explanation → hope it’s right
to
generate structure → verify it deterministically

That feels like a big step toward trustworthy multimodal reasoning.