Crafter World Model (Beta):

Update:

Action sensitivity appears to be fixed in the current beta training setup, and the project has now moved into a beta phase.

The main system is now reliably action-conditioned enough to expose publicly as a serious research prototype rather than just an interesting video generator. The current focus is no longer “does the model use actions at all?” but instead:

  • improving rollout quality over longer horizons
  • reducing residual failure modes
  • cleaning up the training code and notebook structure
  • making the project easier for other researchers to reproduce and build on

Alot of that can likely be fixed simply by scale but I want to clean things up first.

Action Sensitivity Update

Current rollout behaviour:

Current rollout behaviour

Training progress

ODE progress

Rollout progress


What this repository is:

This repository contains my current work on an action-conditioned world model for Crafter, forming the first phase of a broader research agenda around:

  • model-based reinforcement learning.
  • imagination-based control.
  • long-horizon planning.
  • sparse-reward environments.
  • scalable world models trained on consumer hardware.

This project is currently best understood as a research prototype in beta.

It already includes a usable tokenizer, a latent-space action-conditioned world model, exported checkpoints, and an interactive demo/web app path. It is not yet the full agent stack. In particular, this repository mainly implements the first part of the Dreamer-4 style pipeline: the tokenizer and the action-conditioned world model. It does not yet include the downstream RL training in imagination / agent-learning stage.


Current status:

The current system can:

  • compress Crafter observations into compact latent tokens.
  • model future latent dynamics conditioned on actions.
  • generate coherent multi-step rollouts.
  • decode those rollouts back into plausible video.
  • expose the model through an interactive imagined-game demo.

Earlier versions of this project could generate convincing futures without really following the supplied action sequence. That was the central bottleneck. The current setup appears to have resolved that issue well enough for public beta release.

That said, the model is still not perfect. Remaining weaknesses include:

  • small-object confusion.
  • some inventory and HUD detail errors.
  • considerable object-location drift.
  • mixing of similar sprites or structures.
  • degradation over longer autoregressive rollouts (can sometimes end stuck surrounded by stone).

So this is a serious, working beta research system, not a final benchmarked product.


Hardware note:

A major goal of this project is to show that meaningful world-model research can be done on consumer hardware.

This work was trained on a single RTX 3090 (24 GB).

The setup should also be feasible on a 3060/(12 GB) class GPUs with smaller microbatches at the cost of training speed.


Project goal:

The immediate goal is to learn a world model that can:

  1. compress Crafter observations into useful latent tokens.
  2. model future latent dynamics conditioned on actions.
  3. produce multi-step rollouts that are both visually coherent and action-faithful.

The longer-term goal is to use these learned dynamics for:

  • planning.
  • control.
  • reinforcement learning in imagination.
  • eventually more general agents that can reason over imagined futures.

Relation to prior work:

This project is strongly inspired by recent scalable world-model work, especially the combination of:

  • causal or masked video tokenizers.
  • latent-space dynamics models.
  • action-conditioned rollout generation.
  • evaluation through rollout quality and action sensitivity rather than reconstruction alone.

It is inspired by Dreamer-4 style work, but it is not a full reproduction, and it currently covers only the world-modeling part of the pipeline, not the later RL agent-training phase.

It also draws from work on:

  • Crafter as a benchmark for sparse-reward, compositional environments.
  • masked autoencoders as tokenizers for generative models.
  • diffusion / shortcut-style training for latent dynamics.

What is included:

This repository currently includes code and assets for:

  • Crafter data collection.
  • causal MAE tokenizer pretraining.
  • latent world-model pretraining.
  • evaluation and diagnostics.
  • interactive imagination demo / web-app deployment path.
  • exported checkpoints in PyTorch, safetensors, and ONNX formats.

It also includes supporting outputs such as:

  • rollout visualisations.
  • validation plots.
  • action sensitivity plots.
  • failure-mode examples.
  • exported checkpoints under checkpoints/.

Included checkpoints and exports:

The current exported files live under checkpoints/ and include:

  • mae_model.safetensors.
  • mae_decode.onnx.
  • world_model.safetensors.
  • world_model_ema.safetensors.
  • world_model.onnx.

These are intended to support:

  • direct checkpoint download.
  • lightweight inference experiments.
  • Hugging Face Spaces deployment.
  • future ONNX Runtime / browser / API-based demos.

Interactive demo / app:

This repo includes a usable interactive demo path for testing the world model as an imagined game.

The basic idea is:

  • start from a real context window from the Crafter dataset.
  • choose an action.
  • predict the next latent frame with the world model.
  • decode it through the MAE decoder.
  • feed the prediction back into context.
  • continue rolling forward open-loop.

This is not the real Crafter environment running underneath. It is the model’s imagined continuation of the game.

The Hugging Face Space version is intended to make this easy to test without needing to run the training code locally.

Intended controls:

  • Arrow keys / WASD: movement.
  • Space: interact / do.
  • Tab: noop.
  • Shift: sleep.
  • 1–0: place / craft actions.
  • R: reset.
  • G: save gif/json in the local demo version.

Training data:

The current beta model was trained primarily on Crafter human expert data.

That choice was deliberate. At this stage, I wanted to maximize the density of meaningful action-conditioned transitions. The full plan is to make this setup fairly game agnostic with random policies, train in imagination, gather data with better policy retrain forming a feedback loop.

A later stage of the project will revisit broader or more mixed data collection, including more game-agnostic or random-policy style data, but the current release is mainly built around the human expert regime. This is only 100 episodes of human expert gameplay so not a huge dataset by any means.


Data collection policy:

This repository also includes my current Crafter data-collection policy code, which was designed to improve action-conditioned learning by increasing the fraction of transitions where actions produce meaningful state changes.

Key ideas include:

  • stuck detection through frame-change heuristics.
  • forced interaction bursts when the agent appears stuck.
  • periodic interleaving of interaction actions during movement.
  • adaptive exploration behaviour.
  • cleaner action classification logic to avoid action-name matching bugs.
  • shard-based storage with episode metadata, gifs, and achievement events.

This was all mainly to improve achievement coverage while remaining as a game-like agnostic loop (explore, try interact, try craft). It get's on average 10/22 achievements with some shards reaching 14.


Main components:

1. Causal MAE tokenizer:

The tokenizer is a causal masked autoencoder trained on Crafter frame sequences.

Main properties:

  • independent masking across frames (possible experiments with higher masking ratios with tube masking).
  • spatial self-attention within frames.
  • periodic temporal causal attention.
  • bottlenecked latent representation.
  • MAE-style masked reconstruction objective.
  • LPIPS-assisted reconstruction training.
  • latent outputs intended for downstream world modeling, not just pretty decoding.

A major lesson from this project so far is that the tokenizer matters a lot more than it may seem at first.

In particular:

  • high masking turned out to be important.
  • lower masking can give cleaner-looking reconstructions while producing worse downstream action sensitivity.
  • decoder quality alone is not a sufficient measure of whether the latent space is good for dynamics.

So although the decoder is mostly used for visualisation, the latent space quality is still critical, because the world model operates in that latent space.


2. Latent world model:

The world model is trained in latent space using an action-conditioned architecture based around a DiT-style token backbone.

Current ingredients include:

  • action-conditioned latent prediction.
  • shortcut-forcing style training.
  • bucketed context / prediction-length sampling.
  • autoregressive rollout evaluation.
  • action-sensitivity diagnostics.
  • EMA checkpointing.
  • validation across multiple (context, prediction) regimes.

The world model now produces:

  • coherent future rollouts.
  • much better action sensitivity than earlier versions.
  • usable imagined-game behaviour in open loop.

This is the main milestone that moved the project into beta.


3. Diagnostics and evaluation:

I track progress with multiple diagnostics rather than relying on training loss alone.

These include:

  • fixed-noise / denoising validation.
  • ODE-style reconstruction validation.
  • autoregressive rollout evaluation.
  • action sensitivity evaluation.
  • rollout gifs.
  • failure-case inspection.
  • multi-regime validation over several context/prediction bucket pairs.

This matters because a model can look good in one metric while still failing in the behaviour I actually care about.


4. Latent-space analysis:

I am also investigating better ways to reason about what makes a good latent space for downstream world modeling.

The current exploratory tooling includes:

  • UMAP visualisation of latent structure.
  • GMM complexity analysis over latent features.
  • checkpoint-to-checkpoint latent comparisons.

At the moment this remains exploratory. I do not yet think I fully understand how to interpret these plots in a way that is directly actionable for world-model training, but I think it is an important direction.

There is space in this repo for that analysis to become much more systematic over time.

Below is an example of a UMAP on a latent space known to have good and bad action sensitivity. Not sure these are the best to really probe such spaces though:

Example latent analysis:

Good action sensitivity: UMAP of good latent space

Bad action sensitivity: UMAP of bad latent space

Good action sensitivity with samples: UMAP of good latent space with examples

I need to properly go over these models. I think the GMM is currently setup incorrectly and the UMAP isn't really comparable as they're on different splits.


Current strengths:

The current beta model already shows several encouraging properties:

  • coherent latent rollouts.
  • meaningful action conditioning.
  • usable open-loop imagination.
  • multi-step rollout generation.
  • stable training on a single consumer GPU.
  • a clear path to demo deployment through Hugging Face Spaces.

Current failure modes:

The project is still very much an active research system, and several failure modes remain important.

Failure modes:

Typical world-model failures include:

  • Snapping away from chosen direction.
  • Object-position drift across rollout steps.
  • Memory of space (can lock you in a wall of stone).
  • Rare details disappearing (tokenizer issue not sure if it's the encoder or decoder).
  • NPC's and tiles becoming blurry or unstable and/or swapping places (moving sand/stone or coal).
  • Confusion between furnaces, crafting tables, and similar sprites
  • Imperfect preservation of object identity
  • Occasional loss of fine HUD detail

Example failure cases:

Failure mode arrows and npc swap Failure mode enemy, furnace and tree blurring Failure mode furnace and crafting issues images/action_following_and_object_position.gif

These examples are included deliberately. I do not want the repo to present only the successes. The failure modes are a major part of the research story.


Why the tokenizer matters so much:

One of the clearest takeaways from this work is that a tokenizer can look visually decent while still being a poor substrate for dynamics learning.

A world model does not need the prettiest decoder output. It needs latents that preserve the distinctions required for:

  • causality.
  • controllability.
  • object identity.
  • local change.
  • action consequence.

That is why masking level, bottleneck structure, and latent organisation matter so much here.

My current view is that a good world-model tokenizer is not just a compression model. It is part of the dynamics-learning problem.


Repository state:

A few caveats up front:

  • the training code is still a bit messy.
  • some scripts were written for active notebook-based iteration.
  • local paths may need editing before reuse.
  • there are still older comments, experimental branches, and rough edges.
  • names and interfaces may change as the project is cleaned up.

I am still sharing it because the core technical direction is now clear and useful.

Cleaning up the code for a more polished release is one of the next major tasks.


Intended direction:

My aim is for this repository to become a strong base for other researchers who want to work on:

  • world models.
  • latent dynamics.
  • imagination-based planning.
  • action-conditioned generative models.
  • model-based RL on consumer hardware.

Over time I want this project to include:

  • cleaner training scripts.
  • clearer explanations of each component.
  • more structured ablations.
  • better evaluation tools.
  • fuller reproduction instructions.
  • eventual downstream agent-training in imagination.

Scope of this release:

This release should be understood as:

  • a beta research release.
  • a working action-conditioned world model.
  • a portfolio / research artifact.
  • a foundation for future planning and RL work.

It should not be understood as:

  • a polished library.
  • a final benchmark result.
  • a full Dreamer-4 reproduction.
  • a complete end-to-end agent-training system.

If you want to explore the project:

Good places to start are:

  • the exported checkpoints under checkpoints/.
  • the demo / app.
  • the tokenizer training code.
  • the world-model training code.
  • the validation plots and rollout gifs.
  • the failure-mode examples.

Acknowledgements:

This project was strongly influenced by several pieces of prior work:

I also want to acknowledge ChatGPT 5.4 and Claude Opus 4.6 for coding assistance, debugging help, and iteration support during development.

Any mistakes, implementation choices, and deviations from the referenced work are my own.


Citation:

If this repository is useful to your work, please cite the repository and the relevant upstream papers.


Status:

Beta. Active research. Action sensitivity fixed in the current setup, with further training, testing, cleanup, and longer-horizon improvement still in progress.

Downloads last month
204
Video Preview
loading

Space using Camais03/camie-crafter 1

Papers for Camais03/camie-crafter