78.2 kB

Title: MicroVerse: A Preliminary Exploration Toward a Micro-World Simulation

URL Source: https://arxiv.org/html/2603.00585

Markdown Content: Rongsheng Wang 1,3 Minghao Wu 1 1 1 footnotemark: 1 Hongru Zhou 2 Zhihan Yu 1

Zhenyang Cai 1 Junying Chen 1 Benyou Wang 1,3

1 The Chinese University of Hong Kong, Shenzhen

2 Peking Union Medical College Hospital

3 Shenzhen Loop Area Institute

Abstract

Recent advances in video generation have opened new avenues for macroscopic simulation of complex dynamic systems, but their application to microscopic phenomena remains largely unexplored. Microscale simulation holds great promise for biomedical applications such as drug discovery, organ-on-chip systems, and disease mechanism studies, while also showing potential in education and interactive visualization. In this work, we introduce MicroWorldBench, a multi-level rubric-based benchmark for microscale simulation tasks. MicroWorldBench enables systematic, rubric-based evaluation through 459 unique expert-annotated criteria spanning multiple microscale simulation task (e.g., organ-level processes, cellular dynamics, and subcellular molecular interactions) and evaluation dimensions (e.g., scientific fidelity, visual quality, instruction following). MicroWorldBench reveals that current SOTA video generation models fail in microscale simulation, showing violations of physical laws, temporal inconsistency, and misalignment with expert criteria. To address these limitations, we construct MicroSim-10K, a high-quality, expert-verified simulation dataset. Leveraging this dataset, we train MicroVerse, a video generation model tailored for microscale simulation. MicroVerse can accurately reproduce complex microscale mechanism. Our work first introduce the concept of Micro-World Simulation and present a proof of concept, paving the way for applications in biology, education, and scientific visualization. Our work demonstrates the potential of educational microscale simulations of biological mechanisms. Our data and code are publicly available at https://github.com/FreedomIntelligence/MicroVerse

1 Introduction

World models LeCun (2022); Bruce et al. (2024); Lu et al. (2024) have been extensively studied for their ability to simulate environments and agent interactions. They offer a unified computational framework for perceiving surroundings, controlling actions, and predicting outcomes, thereby reducing reliance on real-world trials. This not only robotics engines Luo and Du (2024); Lu et al. (2024) engines and reinforcement learning planners Hafner et al. (2020); Agarwal et al. (2025), but also enhances decision-making, supports safe exploration, and enables scalable learning.

Recently, video generative models have demonstrated strong potential to acquire commonsense knowledge directly from raw video data, ranging from physical laws in the real world to embodied behavioral patterns Brooks et al. (2024), laying the foundation for their use as real-world simulators. For example, prior work Luo and Du (2024) employs video-guided goal-conditioned exploration, grounding large-scale video generation model priors into continuous action spaces through self-supervision, enabling robots to master complex manipulation skills without explicit actions or rewards; and other works Lu et al. (2024) leverage video generation models for embodied decision-making, allowing agents to imaginatively explore their environment with high generative quality and consistent exploration.

Figure 1: Failure cases of Sora and Veo3 on Microscale Simulation. Although Sora and Veo3 generate results that appear visually correct, their violations of physical laws are particularly evident.

Despite tremendous progress in video generation for natural scenes and human-centered domains OpenAI (2024); Google DeepMind (2025); Kong et al. (2024); Wan et al. (2025); Yang et al. (2024), research efforts have remained predominantly focused on the macroscopic scale. This success has not translated effectively to the microscopic scale, where current state-of-the-art models fail to produce physically plausible or biologically meaningful dynamics, as shown in Figure1. Microscopic simulation, which tracks the interactions of atoms, molecules, and cells to uncover underlying mechanisms, is crucial for applications in materials science, biomedical research Dario et al. (2000), education Romme (2002), and interactive visualization White (1992). The failure of existing models, primarily due to a lack of incorporated biomedical knowledge, highlights a critical gap despite the strong potential of microscale simulation for generating clinically realistic dynamics in fields like drug discovery and disease modeling. To address this, we aim to explore the potential of educational microscale simulations of biological mechanisms.

In this work, we introduce MicroWorldBench, a multi-level rubric-based benchmark for microscale simulation tasks comprising 459 real-world tasks that span organ-level, cellular, and subcellular processes. These tasks were jointly selected from a large candidate pool by LLMs and domain experts for their diversity and relevance, with each task paired with self-contained, objective evaluation criteria specifying the essentials for valid simulation. Our extensive experiments across a broad spectrum of video generation models reveal that while most maintain superficial visual coherence and adhere to prompts, they perform poorly in microscale settings, consistently failing to generate biologically plausible dynamics. These failures indicate that current models, trained predominantly on human-scale videos, lack grounding in microphysical principles and knowledge.

To mitigate the gap, we introduce MicroVerse, a video generation model tailored for microscale simulation. MicroVerse is built on Wan2.1 Wan et al. (2025) model and trained with MicroSim-10K, the first microscale dataset containing 9,601 expert-verified scenarios. Unlike human-scale datasets, MicroSim-10K emphasizes physical plausibility and biological fidelity across diverse microscale mechanisms. On MicroWorldBench, MicroVerse surpasses original model by more than +2.7 in scientific fidelity, highlighting the importance of domain-specific data.

Our contributions are summarized as follows: (i) We introduce the concept of Micro-World Simulation and present a proof of concept, which includes a clear objective, a dedicated benchmark, a training dataset, and a tailored model. (ii) We propose MicroWorldBench, the first rubric-based benchmark specifically designed for evaluating microscale simulation in video generation; (iii) we construct MicroSim-10K, a large-scale, expert-verified dataset of microscale simulation videos; (iv) We introduce MicroVerse , a fine-tuned video generation model built upon MicroSim-10K, achieving competitive performance on MicroWorldBench by reducing violations of scientific constraints and improving temporal and spatial consistency.

2 MicroWorldBench: A Rubric-Based Benchmark for Microscale Simulation

Figure 2: Illustration of MicroWorldBench Evaluation Process. A MicroWorldBench example consists of a generated microscopic video and a set of task-specific evaluation criteria written by experts. A MLLM-based scoring system rates responses according to each criterion.

Generic Evaluation Fails to Capture Microscale Simulation Dynamics Existing evaluation methods for video models often rely on generic scoring rules or high-level principles Huang et al. (2024); Zheng et al. (2025); He et al. (2024), which are insufficient for microscale simulation. Such methods overlook the need for fine-grained microscopic simulations, resulting in misaligned outcomes and failing to capture deficiencies in physical plausibility and biological fidelity. In this work, the proposed Rubric evaluation addresses this gap by introducing task-specific criteria with differentiated weights. Rubrics highlight the most critical dimensions identified by experts and ensure that evaluations emphasize substantive shortcomings rather than being diluted by aggregate scoring.

In this section, we introduction the core structure of the rubric-based benchmark, covering task selection (Sec.2.1), prompt design (Sec.2.2), and rubric construction (Sec.2.3), and describe the methodology for model evaluation (Sec.2.4).

2.1 Task Choice

Biological systems are inherently hierarchical, encompassing levels from society, body, organ, and tissue to cell, organelle, protein, and gene Qu et al. (2011). Given constraints of practicality impact and data availability, in this work we focus on three representative levels as a principled sampling of this hierarchy. Importantly, this choice does not discard existing scientific frameworks, but rather reflects a consensus-based selection of the most representative and tractable scales.

1. Organ-level simulations are essential because they connect microscale behaviors with macroscopic physiological functions. Dynamic processes such as cardiac contraction or vascular deformation are directly related to medical diagnosis, surgical planning, and education. A benchmark that evaluates these dynamics provides a direct path toward clinically relevant applications.
1. Cellular-level simulations are central to biology and medicine, as cell migration, proliferation, and interaction underpin processes such as tissue growth, wound healing, and immune response. Accurate modeling at this level enables researchers and students to visualize and understand the driving forces of health and disease, creating opportunities for both discovery and pedagogy.
1. Subcellular-level simulations present the most fine-grained view, capturing biochemical and biophysical mechanisms that govern life at its foundation—fusion, apoptosis, signaling cascades. Evaluating generative models at this level is particularly important, as these processes are both visually subtle and mechanistically complex, requiring high fidelity and physical plausibility.

2.2 Prompt Suite

Both the sampling process of diffusion-based video generation models and the development of expert-driven evaluation rubrics are computationally expensive. To ensure efficiency, we control the number of tasks while maintaining diversity and coverage. The construction follows a two-stage pipeline: (1) collecting tasks related to microscale simulation from YouTube; and (2) expert filtering to retain only scientifically meaningful tasks. The final suite contains 459 tasks: 238 at the organ level, 189 at the cellular level, and 32 at the subcellular level. The proportion of tasks is consistent with the distribution of levels in the collected videos.

Collecting and Generating Prompts We retrieved over 8,000 YouTube videos using topic-specific queries related to organ-level, cellular-level, and subcellular-level simulations. For each video, we collected metadata including titles and descriptions. This information was then provided to GPT-4o, which generated tasks describing the microscale mechanism. Finally, we generated 8,162 tasks. The prompts used to instruct GPT-4o refer to AppendixJ.

Expert Filtering We filtered the generated tasks based on two criteria: (1) the diversity of the tasks, and (2) the practical relevance of the tasks. For diversity, we asked GPT-4o to classify each task into one of the following categories: Organ-level simulations, Cellular-level simulations, or Subcellular-level simulations. For practical relevance, we invited three biology experts, and each task had to receive agreement from at least two of the three experts. A task was retained in MicroWorldBench only if it satisfied both criteria. Classification prompts are in AppendixK.

2.3 Rubric Criteria

As shown in Figure2, each MicroWorldBench example includes a task instruction and rubric criteria, drafted by LLMs and refined by experts. These criteria evaluate scientific fidelity, visual quality, and instruction following. Scientific fidelity emphasizes mechanistic accuracy rather than visual realism. An LLM-based grader then scores the output, providing a standardized, interpretable assessment.

Due to limited expert availability and efficiency concerns, we adopt a collaborative approach where LLMs generate initial rubric drafts and experts perform revision and validation. This method not only improves the efficiency of rubric construction but also ensures broader coverage and more comprehensive consideration despite the small number of experts.

Stage 1: Rubric Drafts Generation For each task, GPT-5 generates a set of fine-grained criteria: P=(a i,d i,s i,w i)i=1 N P={(a_{i},d_{i},s_{i},w_{i})}{i=1}^{N}, where a i a{i} denotes the evaluation dimension, d i d_{i} is the description of the i i-th criterion, s i∈+1,−1 s_{i}\in{+1,-1} is the polarity indicating whether the point contributes (+1)(+1) or deducts (−1)(-1), and w i∈(0,1]w_{i}\in(0,1] is the weight reflecting its importance (e.g., w i=1.0 w_{i}=1.0 for core scientific requirements, w i=0.5 w_{i}=0.5 for key but secondary requirements, and w i=0.2 w_{i}=0.2 for auxiliary or presentational).

The score for each task is defined as: S=∑i=1 N s i⋅w i S=\sum_{i=1}^{N}s_{i}\cdot w_{i}. To ensure comparability across tasks, we normalize it: S norm=S∑i=1 N w i+×100 S_{\text{norm}}=\frac{S}{\sum_{i=1}^{N}w_{i}^{+}}\times 100 where ∑w i+\sum w_{i}^{+} is the maximum score from positive criteria, ensuring a maximum of 100 and preventing minor positives from offsetting severe scientific errors.”

Stage 2: Expert Revision and Validation Domain experts refine the LLM-generated rubric through the following actions:

• Deleting or filtering criteria: Experts refine the criteria by modifying or removing d i d_{i} that are redundant, irrelevant, or scientifically trivial.
• Adjusting weights: When the weight of certain criteria does not align with the scientific validity of the task, experts modify the corresponding weight w i w_{i}.
• Supplementing criteria: If the automatically generated criteria fail to cover essential scientific dimensions, experts can introduce new tuples (a j,d j,s j,w j)(a_{j},d_{j},s_{j},w_{j}).

We invited three experts to participate in the revision and validation process. Each expert first independently reviewed and modified the evaluation criteria, including adjusting weights, removing redundant items, and supplementing any missing dimensions. All modifications were documented with clear rationale to ensure transparency. The proposed changes from all experts were then aggregated, and conflicts were resolved through discussion, majority voting. For more analysis on expert revision and validation, refer to the AppendixC.

2.4 Evaluation Results and Analysis

Settings We evaluated video generation models on microscopic simulation tasks using MicroWorldBench, including open-source models (e.g., Wan2.1 Wan et al. (2025), HunyuanVideo Kong et al. (2024)) and commercial models (e.g., Sora OpenAI (2024), Veo3 Google DeepMind (2025)). Inference was conducted once per model under default settings to ensure fairness and consistent resolution. Rubric evaluation employed LLM-as-a-Judge Zheng et al. (2023), with GPT-5 serving as the Judge. The configurations and sampling details in the AppendixE.

Table 1: Performance comparison of different video generation models on MicroWorldBench. Bold indicates the best performance.

Table 2: Performance comparison of different video generation models on MicroWorldBench (dimension-wise scores). Bold indicates the best performance.

Overall Results As shown in Table1, the performance of different models varies significantly across organ-level, cellular-level, and subcellular-level tasks. Although commercial closed-source models, such as Veo3, substantially outperform open-source models in overall scores, their advantage is mainly confined to the visual quality dimension rather than scientific fidelity.

Visual Quality vs. Scientific Fidelity Table2 shows that nearly all models achieve high scores in visual quality (80–97), yet their scientific fidelity lags far behind (most open-source models score only 15–43). This result demonstrates that current models often generate videos that “look right” but fail to strictly adhere to physical and biological laws.

Performance Differences Across Hierarchical Tasks Both advanced open-source models (e.g., Wan2.2-T2V-A14B) and top commercial models (Sora, Veo3) exhibit lower performance on cellular and subcellular tasks compared to organ-level simulations. This may be attributed to the higher requirements for physical and biological consistency in these tasks, as well as the scarcity of microscale training data that can capture complex dynamics.

Scale Effects in Open-Source Models Within the Wan series, increasing model size from 1.3B to 14B mainly improves visual quality, while scientific fidelity shows little significant growth. This suggests that expanding model parameters alone is not sufficient to solve the core scientific fidelity challenges in microscale simulation.

3 MicroVerse: Toward Microscale Simulation via a Expert-Verified Dataset

The results of MicroWorldBench indicate that current models remain limited in their ability to model microscale mechanism governed by physical and biological principles. Most large-scale video datasets—such as InternVid Wang et al. (2023b), UCF101 Soomro et al. (2012), and OpenVid-1M Nan et al. (2024)—primarily consist of natural scenes or human activities, offering little relevance to microscopic processes. To address this challenge, we propose a new microscale simulation models, termed MicroVerse, which explicitly incorporate physical grounding and fine-grained biological dynamics. A key prerequisite for developing such models is the availability of domain-specific data that accurately capture microscopic processes with physical fidelity.

Figure 3: Overview of our data filtering pipeline. Each stage applies specific filters and shows the volume of data removed and retained.

3.1 Data Construction: MicroSim-10K

Collecting videos from YouTube We used the official YouTube API to search for videos related to microsimulation and filtered them based on the following criteria: (1) resolution of at least 720p; and (2) licensed under Creative Commons. These requirements ensure that the collected videos are suitable and freely available for training. In total, we obtained 12,848 relevant videos.

Splitting videos After obtaining the videos, we segmented them into multiple semantically consistent and short clips. We used OpenCLIP Ilharco et al. (2021) for video segmentation: whenever the similarity between adjacent frames fell below 0.85, a split was made. In total, 67,853 clips were generated. Since not all clips were related to microsimulation, we trained a classifier based on VideoMAE Tong et al. (2022) to filter them. The model achieved an accuracy of over 92%, significantly improving the quality of the dataset. With the help of the classifier, 34,318 clips were filtered out. For details of the clip classification model related to microsimulation, refer to the AppendixL.

Automatic and expert filtering To improve the quality and physical consistency of the clips, we first applied OpenCV 1 1 1 https://github.com/opencv/opencv-python to detect black borders and used EasyOCR 2 2 2 https://github.com/JaidedAI/EasyOCR to detect subtitles in order to filter out those affecting semantic representation, retaining 12,194 clips. Experts then reviewed the data, removing meaningless or physically inconsistent clips, resulting in 9,601 clips.

Generating captions We leverage a multimodal LLM (GPT-4o) to generate detailed captions. Due to context limits, we uniformly sampled 8 frames per clip as visual input. To minimize hallucinations, we supply the video title and description.

3.2 Data Statistics

3.2.1 Fundamental Attributes

(a) Distribution of Categories

(b) Distribution of Video Duration

Figure 4: Distributions of fundamental video attributes in the MicroSim-10K.

MicroSim-10K is the first large-scale dataset dedicated to microscale simulation, comprising 9,601 high-quality video clips. As shown in Figure4, all clips have a resolution of at least 720p and a duration of 5–60 seconds, ensuring that each captures a complete and coherent microscopic process. The dataset spans diverse biological mechanisms across organ, cellular, and subcellular levels, offering broad coverage of key scenarios. Each clip is paired with a detailed caption generated by a multimodal LLM and validated by experts, with an average length of around 150 words, providing precise semantic alignment for model training.

3.2.2 Popularity and Relevance

(a) Distribution of Video Views

(b) Distribution of Video Likes

Figure 5: Distributions of video popularity indicators in the MicroSim-10K.

To capture the educational and communicative value of microscale simulations, MicroSim-10K retains metadata such as views, likes, and comments. As shown in Figure5, the videos in MicroSim-10K have been widely viewed, with many reaching hundreds of thousands of views, and they have received substantial likes and comments, reflecting strong popularity and broad accessibility across both scientific and public communities.

3.2.3 Realism and Distribution

We compare its distribution with real-world microscopy videos using Fréchet Video Distance (FVD). Using the method described in Section 3.1, we collected 377 real biological videos from YouTube and obtained 643 video clips after preprocessing. As shown in Table3, the FVD between MicroSim-10K and real biological videos is 123.9. This result indicates that our expert-verified MicroSim-10K already lies remarkably close to the real microscopy distribution in terms of visual statistics and structural descriptors, effectively bridging the gap between simulated and real experimental data.

Table 3: FVD comparison across models (lower is better). The FVD between MicroSim-10K and real biological videos is 123.9, indicating a close distributional alignment.

3.3 Training MicroVerse

For training, we fine-tune the Wan2.1 model. A text prompt P P is encoded as a sequence: P=(p 0,p 1,…,p m)P=(p_{0},p_{1},\ldots,p_{m}), while the target video V V is decomposed into T T frames. Each frame is mapped into the latent space via a VAE Kingma and Welling (2013) encoder, yielding the sequence: L=(l 0,l 1,…,l T)L=(l_{0},l_{1},\ldots,l_{T}). The text input P P is transformed into embeddings E E using CLIP text encoder, and the latent sequence L L is processed by a Diffusion Transformer (DiT)Peebles and Xie (2023).

The training objective is to predict the latent representation of the video through a denoising diffusion process. At timestep t t, the loss function is defined as:

ℒ=𝔼[‖ε−ε θ(L t,t,E)‖2],\mathcal{L}=\mathbb{E}\Big[|\varepsilon-\varepsilon_{\theta}(L_{t},t,E)|^{2}\Big],(1)

where L t L_{t} is the noisy latent representation at timestep t t, ε\varepsilon denotes the injected noise, ε θ\varepsilon_{\theta} is the model’s noise prediction, t t is the current diffusion timestep, and E E is the text embedding.

During fine-tuning, with probability defined by the 10%, the text conditioning is entirely masked, enabling Classifier-Free Guidance (CFG)Ho and Salimans (2022) training. This mixture of unconditional and conditional training improves the generation quality of the model during inference.

4 Experiments

Experiment Settings We train MicroVerse using 8 NVIDIA H200 GPUs, fully fine-tuning all parameters of Wan2.1-T2V-1.3B Wan et al. (2025) with a learning rate of 1e-5 and a batch size of 8. The training process is designed to improve the model’s capability to generate microscopic simulation videos conditioned on text prompts. We conducted a comparative with other models on MicroWorldBench. Additional training details are provided in the AppendixD.

Human Evaluation To evaluate alignment with human preferences, we conducted a human study comparing MicroVerse with Sora and Veo3. The evaluation included 60 samples across three levels of microsimulation (20 samples per level), all sourced from the 20 most popular microsimulation videos on YouTube. Model outputs were randomly shuffled, and three evaluators independently selected the preferred result based on instruction fidelity and visual clarity, or marked a tie. The final results were reported as preference ratios.

4.1 Results of our MicroVerse

Improvement in Scientific Fidelity Table2 shows that MicroVerse achieves a significant improvement in Scientific Fidelity, reaching a score of 43.0 and outperforming all open-source models. This enhancement is attributed to the training on the physics-grounded MicroSim-10K dataset, which enables the model to better adhere to biological and physical laws. Although there is a slight decrease in Visual Quality (68.5) and Instruction Following (49.3), this does not affect our core objective: advancing scientific fidelity.

Breakthrough in Subcellular-Level Tasks According to Table1, on the highly challenging subcellular-level tasks, MicroVerse achieves a score of 53.3, surpassing all open-source models. This demonstrates that our dataset enables MicroVerse to make notable progress on microscale simulation tasks where existing models typically struggle.

4.2 Analysis

Scaling Results We identify two main limitations in the performance of the 1.3B model: first, the improvement in scientific fidelity is relatively modest when fine-tuning a small-parameter model; second, there is a slight decline in visual quality and instruction-following capabilities. To address these issues, we scale the model parameters to 14B and employ a mixed-domain training strategy, combining MicroSim-10K with an equivalent amount of high-quality general-domain data randomly sampled from OpenVid Nan et al. (2024). As shown in Table4, this dual scaling of both model capacity and data diversity significantly enhances performance across all dimensions, achieving state-of-the-art results among open-source models. For more ablation studies on dataset filtering, dataset size, and training recipes, please refer to AppendixB.

Table 4: Impact of Data Scale and Mixed-Domain Training on MicroVerse Performance. Bold indicates the best performance. FT indicates fine-tuning.

Human Evaluation Results Figure6 shows the results of human evaluation. Compared with Wan2.1-1.3B models, MicroVerse performs excellently in the dimension of Scientific Fidelity. Its outstanding performance in Scientific Fidelity further validates the effectiveness of MicroSim-10K. In addition, the Cohen’s Kappa coefficient among the three independent experts was above 0.80, indicating strong interrater agreement and confirming the reliability of the scoring process. More details on Cohen’s Kappa coefficient can be found in AppendixG.

(a) Human Evaluation Result of MicroVerse and Wan2.1-1.3B

(b) Consistency between LLMs and Humans

Figure 6: Human Evaluation and Consistency Results.

Consistency among Judgers in MicroWorldBench To ensure that MicroWorldBench’s evaluation aligns closely with human judgment across all dimensions, we conducted human preference labeling on a large set of generated videos. Specifically, we computed the consistency of evaluation tasks across different models as well as between the models and humans. Figure6 shows the consistency relationships among different models and between the models and humans. For more analysis results on evaluation consistency, please refer to AppendixC.

5 Related Work

World Model World models LeCun (2022); Bruce et al. (2024); Lu et al. (2024) have garnered significant attention. They simulate dynamic environments by predicting future states and estimating rewards based on current observations and actions. Their ability to model state transitions has been extended to real-world scenarios through joint learning of policies and world models, improving sample efficiency in simulated robotics Seo et al. (2023), real-world robots Wu et al. (2022), clinical decision Yang et al. (2025), and autonomous driving Wang et al. (2023a). For example, some work Du et al. (2023) explores long-horizon video planning by combining vision–language and text-to-video models. Others Luo and Du (2024) focus on linking video models to continuous actions through goal-conditioned exploration. Recent works Lu et al. (2024) also use video generative models to let agents explore environments more effectively. MeWM Yang et al. (2025) applies world modeling to medical image analysis and clinical decision-making.

Video Generation Video generation has seen rapid progress in the past two years. The release of Sora OpenAI (2024) has ignited strong research interest in text-to-video generation, leading to breakthroughs in quality, coherence, and controllability Blattmann et al. (2023). Other commercial systems such as Veo3, Kling, HunyuanVideo Kong et al. (2024), and Hailuo HailuoAI (2024) have achieved impressive performance and are widely applied in video production, advertising, and education. With the technology maturing, domain-specific models are emerging to address specialized needs. For instance, MedGen Wang et al. (2025) generates accurate, high-quality medical videos for health education, while AniSora Jiang et al. (2025) focuses on producing detailed and stylistically rich animated content. Despite these advances, the use of video generation for microscale simulation remains largely unexplored.

Rubric Evaluation Rubric-based evaluation has become a standard approach for assessing LLMs on open-ended tasks, offering task-specific and interpretable criteria that improve grading consistency. HealthBench Arora et al. (2025) scales this paradigm to 5,000 multi-turn conversations with 48k clinician-authored rubrics covering accuracy, safety, and communication. Building on this, Baichuan-M2 Team (2025) dynamically generates case-specific rubrics as verifiable reward signals for reinforcement learning, enabling adaptive and context-aware supervision. Rubrics as Rewards (RaR)Gunjal et al. (2025) further formalizes rubric-based RL and shows significant gains over Likert-style scoring. These efforts highlight rubric-guided evaluation and training as a promising methodology for developing reliable, aligned, and LLMs.

6 Conclusions

Video generation excel at natural and human-centered macroscopic scenes but fail to capture faithful microscale dynamics. This work introduces MicroWorldBench, the first rubric-based benchmark for microscale video generation with 459 expert-curated tasks and well-defined rubric criteria. In addition, we build MicroSim-10K and develop MicroVerse which demonstrate remarkable performance on microscale simulation tasks. By integrating physical constraints and expert supervision, MicroVerse not only improves visual fidelity but also advances toward biologically meaningful dynamics, enabling applications in biomedical research, education, and interactive scientific visualization.

Limitation

Our work aims to explore the potential of educational microscale simulations of biological mechanisms, rather than the reproduction of results observed in wet lab experiments. However, our current approach does not explicitly incorporate the underlying physical laws that govern biomedical microscale dynamics, such as fluid mechanics in blood flow, diffusion–reaction equations in molecular transport, or biomechanical constraints in cellular processes. This limitation restricts the applicability of the model in scenarios that require high-precision scientific simulation and prediction.

Ethics Statement

All data are publicly available, compliant with YouTube’s terms, and we exclude personal/sensitive content. Captions were auto-generated (MLLMs) and manually verified to remove inappropriate/identifiable material. The dataset is intended solely and strictly for research purposes and should not be used for non-research settings. We do not own the copyright of these data and will only publicly release the URLs linked to the data instead of the raw data.

Acknowledgements

This work was supported by Major Frontier Exploration Program (Grant No. C10120250085) from the Shenzhen Medical Academy of Research and Translation (SMART), Shenzhen Medical Research Fund (B2503005), the Shenzhen Science and Technology Program (JCYJ20220818103001002), NSFC grant 72495131, Shenzhen Doctoral Startup Funding (RCBS20221008093330065), Tianyuan Fund for Mathematics of National Natural Science Foundation of China (NSFC) (12326608), Shenzhen Science and Technology Program (Shenzhen Key Laboratory Grant No. ZDSYS20230626091302006), the 1+1+1 CUHK-CUHK(SZ)-GDSTC Joint Collaboration Fund, Guangdong Provincial Key Laboratory of Mathematical Foundations for Artificial Intelligence (2023B1212010001), the International Science and Technology Cooperation Center, Ministry of Science and Technology of China (under grant 2024YFE0203000), and Shenzhen Stability Science Program 2023.

References

N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. (2025)Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575. Cited by: §1.
R. K. Arora, J. Wei, et al. (2025)HealthBench: evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775. Cited by: §5.
A. Blattmann, S. Frolov, R. Rombach, and P. Esser (2023)Stable video diffusion: scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127. Cited by: §5.
T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh (2024)Video generation models as world simulators. Note: https://openai.com/research/video-generation-models-as-world-simulatorsOpenAI Research Cited by: §1.
J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. (2024)Genie: generative interactive environments. In Forty-first International Conference on Machine Learning, Cited by: §1, §5.
P. Dario, M. C. Carrozza, A. Benvenuto, and A. Menciassi (2000)Micro-systems in biomedical applications. Journal of Micromechanics and Microengineering 10 (2), pp.235. Cited by: §1.
Y. Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum, L. Kaelbling, A. Zeng, and J. Tompson (2023)Video language planning. External Links: 2310.10625, LinkCited by: §5.
Google DeepMind (2025)Veo 3. External Links: LinkCited by: §1, §2.4.
A. Gunjal, A. Wang, E. Lau, V. Nath, B. Liu, and S. Hendryx (2025)Rubrics as rewards: reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746. Cited by: §5.
D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2020)Dream to control: learning behaviors by latent imagination. External Links: 1912.01603, LinkCited by: §1.
HailuoAI (2024)HailuoAI. External Links: LinkCited by: §5.
X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulraj, et al. (2024)Videoscore: building automatic metrics to simulate fine-grained human feedback for video generation. arXiv preprint arXiv:2406.15252. Cited by: §2.
J. Ho and T. Salimans (2022)Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598. Cited by: §3.3.
Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. (2024)Vbench: comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.21807–21818. Cited by: §2.
G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt (2021)OpenCLIP Note: If you use this software, please cite it as below.External Links: Document, LinkCited by: §3.1.
Y. Jiang, B. Xu, S. Yang, M. Yin, J. Liu, C. Xu, S. Wang, Y. Wu, B. Zhu, X. Zhang, X. Zheng, J. Xu, Y. Zhang, J. Hou, and H. Sun (2025)AniSora: exploring the frontiers of animation video generation in the sora era. External Links: 2412.10255, LinkCited by: §5.
D. P. Kingma and M. Welling (2013)Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.3.
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. (2024)Hunyuanvideo: a systematic framework for large video generative models. arXiv preprint arXiv:2412.03603. Cited by: §1, §2.4, §5.
Y. LeCun (2022)A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review 62 (1), pp.1–62. Cited by: §1, §5.
T. Lu, T. Shu, A. Yuille, D. Khashabi, and J. Chen (2024)Generative world explorer. arXiv preprint arXiv:2411.11844. Cited by: §1, §1, §5.
Y. Luo and Y. Du (2024)Grounding video models to actions through goal conditioned exploration. arXiv preprint arXiv:2411.07223. Cited by: §1, §1, §5.
K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y. Tai (2024)Openvid-1m: a large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371. Cited by: §3, §4.2.
OpenAI (2024)Video generation models as world simulators. External Links: LinkCited by: §1, §2.4, §5.
W. Peebles and S. Xie (2023)Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.4195–4205. Cited by: §3.3.
Z. Qu, A. Garfinkel, J. N. Weiss, and M. Nivala (2011)Multi-scale modeling in biology: how to bridge the gaps between scales?. Progress in biophysics and molecular biology 107 (1), pp.21–31. Cited by: §2.1.
G. Romme (2002)The educational value of microworld simulation. Tilburg Univerisity, Netherlands. Cited by: §1.
Y. Seo, D. Hafner, H. Liu, F. Liu, S. James, K. Lee, and P. Abbeel (2023)Masked world models for visual control. External Links: 2206.14244, LinkCited by: §5.
K. Soomro, A. R. Zamir, and M. Shah (2012)Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §3.
B. Team (2025)Baichuan-m2: scaling medical capability with large verifier system. arXiv preprint arXiv:2509.02208. Cited by: §5.
Z. Tong, Y. Song, J. Wang, and L. Wang (2022)Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35, pp.10078–10093. Cited by: §3.1.
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. (2025)Wan: open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314. Cited by: 2nd item, §1, §1, §2.4, §4.
R. Wang, J. Chen, K. Ji, Z. Cai, S. Chen, Y. Yang, and B. Wang (2025)MedGen: unlocking medical video generation by scaling granularly-annotated medical videos. External Links: 2507.05675, LinkCited by: §5.
X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu (2023a)DriveDreamer: towards real-world-driven world models for autonomous driving. External Links: 2309.09777, LinkCited by: §5.
Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y. Wang, et al. (2023b)Internvid: a large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942. Cited by: §3.
B. Y. White (1992)A microworld-based approach to science education. In New directions in educational technology, pp.227–242. Cited by: §1.
P. Wu, A. Escontrela, D. Hafner, K. Goldberg, and P. Abbeel (2022)DayDreamer: world models for physical robot learning. External Links: 2206.14176, LinkCited by: §5.
Y. Yang, Z. Wang, Q. Liu, S. Sun, K. Wang, R. Chellappa, Z. Zhou, A. Yuille, L. Zhu, Y. Zhang, et al. (2025)Medical world model: generative simulation of tumor evolution for treatment planning. arXiv preprint arXiv:2506.02327. Cited by: §5.
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, et al. (2024)Cogvideox: text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072. Cited by: §1.
D. Zheng, Z. Huang, H. Liu, K. Zou, Y. He, F. Zhang, Y. Zhang, J. He, W. Zheng, Y. Qiao, et al. (2025)Vbench-2.0: advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755. Cited by: §2.
L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica (2023)Judging llm-as-a-judge with mt-bench and chatbot arena. External Links: 2306.05685, LinkCited by: §2.4.

Appendix A Data Filtering Pipeline

As illustrated in Figure7, our data construction process is designed to ensure high fidelity and semantic richness. We start with an initial collection of 128K microsimulation videos. To guarantee data quality, we implement a rigorous cleaning protocol that filters out non-microsimulation samples and eliminates temporal inconsistencies. Furthermore, we utilize OpenCV to crop black borders and employ EasyOCR to detect and remove hard-coded subtitles, thereby reducing visual noise. The preprocessed videos undergo a human-in-the-loop verification stage, where experts assess the content for meaningfulness and consistency. In the final stage, we leverage Multimodal Large Language Models (MLLMs) to automatically generate descriptive captions, resulting in a high-quality dataset suitable for downstream representation learning tasks.

Figure 7: Overview of the data construction process.

To ensure absolute data purity and avoid potential data leakage, we implemented a rigorous three-level deduplication pipeline. As shown in Table5, we found zero overlap at high similarity thresholds across source, text, and vision levels.

Table 5: Three-level deduplication pipeline results.

Furthermore, we lowered the text-level and vision-level thresholds to 0.60 to help remove potentially overlapping tasks and ensure absolute data purity. Table6 shows the task distribution after deduplication.

Table 6: Task distribution after deduplication.

We re-evaluated the model performance on the clean test set (445 tasks), as shown in Table7. The performance scores remained highly stable, with only negligible variations (decimal-level drops), therefore we ultimately adopted the 459 tasks.

Table 7: Performance comparison on original vs. clean test set.

Regarding private models like Veo3, we acknowledge they may have seen YouTube data. However, MicroWorldBench evaluates scientific fidelity, not just generalization. Even if a model has seen similar data, generating a simulation that strictly adheres to our expert-weighted physical rubrics remains a valid measure of capability. We plan to expand the benchmark with specialized microscopy databases in the future to broaden the domain distribution.

Appendix B Ablation Study

In this section, we present a series of ablation studies to evaluate the impact of different components and hyperparameters on the model’s performance.

Ablation Study on dataset filtering.

Filtered data (MicroSim-10K) yields more balanced and reliable improvements than using raw, uncleaned data, boosting scientific fidelity while avoiding major drops in visual quality and instruction following. Table8 summarizes the results.

Table 8: Ablation study on dataset filtering.

Ablation Study on dataset size.

Increasing the size of high-quality training data yields steady gains in scientific fidelity and instruction following with only minor visual-quality tradeoffs. Table9 summarizes the results.

Table 9: Ablation study on dataset size.

Ablation Study on training recipes (CFG rate).

Excessive CFG sharply degrades scientific fidelity and visual quality. Table10 summarizes the results.

Table 10: Ablation study on CFG rate.

Ablation Study on training recipes (training steps).

As shown in the training curves, the model converges rapidly and stabilizes around 5,000 steps. Therefore, we adopt this number of training steps as the default setting for reporting efficiency.

Ablation Study on training recipes (number of frames).

We selected 81 frames as a balanced trade-off:

• It provides sufficient temporal duration to capture complete distinct biological events (e.g., cell division) while maintaining high temporal coherence and manageable computational costs.
• Additionally, 81 frames is a native, optimal temporal window for the Wan2.1 architecture Wan et al. (2025).

Appendix C Analysis of the Process of Expert Revision and Validation

All experts involved in our evaluation hold doctoral degrees (Ph.D.) and possess extensive research experience in cellular and molecular biology. Table11 provides the background information for each expert.

Table 11: Background information of the experts involved in the evaluation.

(a) Expert 1

(b) Expert 2

Figure 8: Actions Log of Three Independent Experts.

We analyzed the frequency with which three experts employed the four types of rubric operations when handling different tasks. As shown in Figure8, all experts tended to favor Adjust Weights, while Supplement was used relatively infrequently. Follow-up interviews with the three experts revealed that the Supplement operation is more cumbersome, as it requires identifying additional evaluation criteria beyond those automatically generated by the LLM, which can introduce extra burden.

As detailed in Table12, experts performed three specific types of interventions to ensure biological and physical processes were captured: (1) Filtering scientifically trivial criteria; (2) Adjusting weights (e.g., setting w i=1.0 w_{i}=1.0) to prioritize underlying scientific mechanisms over superficial visual quality; and (3) Supplementing missing physical constraints. As shown in Table12, only about 7.6% of the 459 tasks required experts to add missing scientific criteria. This indicates that GPT-5’s initial rubric drafts already captured the essential mechanisms, while experts played a crucial role in refining the final outputs.

Table 12: Detailed expert actions during the rubric revision process.

Furthermore, Table13 shows that experts intervened more frequently in subcellular-level tasks to enforce strict physical constraints where GPT-5 is less reliable, ensuring scientific accuracy.

Table 13: Expert intervention frequency across different biological scales.

Furthermore, we conducted a correlation analysis between human experts and LLMs to validate the reliability of automated evaluation. Three domain experts independently evaluated a subset of videos, and we quantified the scoring agreement between a human-only panel and a panel including GPT-5. As shown in Table14, the inclusion of GPT-5 maintains or even slightly improves the inter-rater agreement (Fleiss’ Kappa).

We also report the agreement across different biological scales in Table15. Overall, GPT-5 shows agreement with expert scoring at a level comparable to human evaluators, and in some cases its consistency is even slightly higher. This suggests that GPT-5 is a reliable evaluator rather than a source of additional variance.

Table 14: Scoring agreement between human experts and LLMs.

Table 15: Agreement across biological scales.

Appendix D Training Settings on MicroVerse

We implemented our training pipeline utilizing a high-performance computational node equipped with 8 NVIDIA H200 GPUs. To ensure the model fully captures the intricate dynamics and visual nuances specific to the microscopic domain, we adopted a full parameter fine-tuning strategy on the pre-trained Wan2.1-T2V-1.3B model.

We utilized a global batch size of 8 and set the learning rate to 1×10−5 1\times 10^{-5} with a consistent weight decay of 0.01 to prevent overfitting. To maximize computational efficiency and memory utilization without compromising numerical precision, we employed bfloat16 (bf16) mixed-precision training. Furthermore, full gradient checkpointing was enabled to significantly reduce the memory footprint during the backpropagation phase.

For the visual data configuration, the model was trained to generate high-fidelity video sequences with a resolution of 480×832 480\times 832 pixels and a temporal duration of 81 frames. A training ControlNet/Classifier-Free Guidance (CFG) rate of 0.1 was applied to randomly drop text conditioning, thereby reinforcing the model’s capability for unconditional generation and improving overall robustness. Detailed training configurations are enumerated in Table16.

Table 16: Detailed hyperparameters used during the training phase of MicroVerse.

Appendix E Inference Settings on MicroVerse

To ensure a rigorous and unbiased assessment of generation quality, we standardized the inference protocol across all evaluated models. The inference configuration strictly mirrors the training resolution settings (480×832 480\times 832, 81 frames) to avoid potential domain shifts during the generation phase.

We employed a standard sampling strategy with 50 inference steps, striking an optimal balance between generation latency and visual fidelity. A guidance scale of 5.0 was selected to ensure strong adherence to the text prompts while maintaining natural visual diversity and avoiding artifacts associated with excessive guidance.

Crucially, to uphold the integrity of our evaluation benchmark, we enforced a strict single-shot inference policy. For each test prompt, only a single video sample was generated using a fixed random seed. This approach eliminates the possibility of cherry-picking—selecting the best output from multiple attempts—thereby providing an honest reflection of the model’s stability and generalized performance. The comprehensive parameter settings for inference are listed in Table17.

Table 17: Standardized inference parameter settings for model evaluation.

Appendix F Prompt GPT-5 to Generate Rubric Criteria

Appendix G Inter-Rater Reliability Among Human Evaluators

We used Cohen’s Kappa coefficient to measure agreement among the three experts. Table18 indicate strong agreement among the experts, confirming the reliability of the scoring process.

Table 18: Comparison of Cohen’s Kappa values between experts.

Appendix H A Rubric Example from MicroWorldBench

Figure 9: Example of a cell mitosis simulation video frame, taken from an excellent example video on YouTube.

Table 19: Rubric Example for Cell Mitosis Evaluation

Appendix I Example of real biological video clips

Figure 10: Example of real biological video clips.

Appendix J Generate Prompt from the Video Title and Description

Appendix K Prompt LLM Classifies Based on Task Descriptions

Appendix L VideoMAE-Based Microsimulation Classifier

To filter out video clips related to microsimulation, we trained a classifier using 2,580 manually annotated samples based on the VideoMAE model. The training was implemented within the Transformers, with a learning rate of 5e-5 and a total of 10 epochs, enabling the model to effectively capture video features and achieve accurate classification. Finally, our classifier achieved an accuracy of 92% on the test set.

Table 20: Dataset statistics for microsimulation classification.

Xet Storage Details

Size:: 78.2 kB
Xet hash:: fdd4e557c2cb3241ad64e822bfbb8f15ebb1661af93cafc0d2c04498943a3f51

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.