g8967's picture
v2: 64M decoder (2L, dim=1024), frozen VLM, dropout=0.0
62d6794 verified
metadata
license: apache-2.0
tags:
  - game-ai
  - flow-matching
  - action-prediction
  - elden-ring
  - vla
base_model: Qwen/Qwen3.5-4B

Pi-Lumine 4B — Flow-Matching Action Decoder for Elden Ring

A Pi0.5-style flow-matching action decoder trained on top of a frozen Qwen3.5-4B VLM backbone.

Architecture

  • Base VLM: Qwen/Qwen3.5-4B (frozen, not included — downloaded at runtime)
  • Action Decoder: FiLM-conditioned transformer with cross-attention to VLM hidden states
    • 2 decoder layers, VLM dim 2560 → decoder dim 1024, 8 attention heads
    • Projection layers decouple decoder from VLM hidden size
    • Instruction-conditioned via AdaptiveRMSNorm (FiLM)
    • Sinusoidal time embedding for flow matching
    • ~64M trainable parameters
  • Action Space: 6 steps x 20 dims (4 sticks + 16 buttons per step)
  • Training: Flow matching with Euler ODE integration at inference

Files

  • action_decoder.pt — Trained action decoder weights
  • decoder_config.json — Architecture and tokenizer config
  • tokenizer.json / tokenizer_config.json — Tokenizer with special tokens
  • chat_template.jinja — Chat template
  • processor_config.json — Processor config