Emergent Compositional Communication for Latent World Properties
Abstract
Multi-agent systems with Gumbel-Softmax communication develop compositional representations of physical properties from video features without supervision, demonstrating scalable and causally robust learning of latent variables.
Can multi-agent communication pressure extract discrete, compositional representations of invisible physical properties from frozen video features? We show that agents communicating through a Gumbel-Softmax bottleneck with iterated learning develop positionally disentangled protocols for latent properties (elasticity, friction, mass ratio) without property labels or supervision on message structure. With 4 agents, 100% of 80 seeds converge to near-perfect compositionality (PosDis=0.999, holdout 98.3%). Controls confirm multi-agent structure -- not bandwidth or temporal coverage -- drives this effect. Causal intervention shows surgical property disruption (~15% drop on targeted property, <3% on others). A controlled backbone comparison reveals that the perceptual prior determines what is communicable: DINOv2 dominates on spatially-visible ramp physics (98.3% vs 95.1%), while V-JEPA 2 dominates on dynamics-only collision physics (87.4% vs 77.7%, d=2.74). Scale-matched (d=3.37) and frame-matched (d=6.53) controls attribute this gap entirely to video-native pretraining. The frozen protocol supports action-conditioned planning (91.5%) with counterfactual velocity reasoning (r=0.780). Validation on Physics 101 real camera footage confirms 85.6% mass-comparison accuracy on unseen objects, temporal dynamics contributing +11.2% beyond static appearance, agent-scaling compositionality replicating at 90% for 4 agents, and causal intervention extending to real video (d=1.87, p=0.022).
Community
Discrete bottlenecks on frozen V-JEPA 2 yield compositional physics codes. 100% sender convergence, validates on real camera footage at 85.6%. V-JEPA 2 outperforms DINOv2 on collision physics — video-native pretraining matters.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations (2026)
- HCLSM: Hierarchical Causal Latent State Machines for Object-Centric World Modeling (2026)
- The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems (2026)
- What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators (2026)
- Reason-to-Transmit: Deliberative Adaptive Communication for Cooperative Perception (2026)
- Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control (2026)
- Learning Additively Compositional Latent Actions for Embodied AI (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.03266 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper