The Journey
I recently stopped using Midjourney, after 3 years of (almost) non-stop usage. Not cause I don't think the model and features are worth the price, rather cause my attention has shifted towards local and self-serve, to reduce cloud dependencies.
That being said, I realized I was sitting on over 86k generations across images and videos, and I figured it was a great opportunity to cause me some new form of stress, figure out how to build an interactive experience, and make it also available as a standalone VR package.
And... that's how we got to The Journey.
What is it
Four distinct environments built around 86,352 Midjourney generations (84,143 images + 2,209 video clips) spanning April 2023 to March 2026.
- The Room -- Thousands of image tiles drifting across an infinity room. Music and voice reactive. 6DOF movement. Fly through three years of output.
- The Spectrum -- Two personal video pieces watched from inside a 1930s train cabin.
- The Cellar -- 150 curated images across three connected spaces. Walk through, approach, inspect.
- The Data -- Every number, every pattern, every decision surfaced as an editorial data piece. Volume over time, prompt evolution, version migration, color distribution, subject territory, format choices. Not a dashboard.
Desktop version includes all four. VR version (Quest 3) includes The Room, The Spectrum, and The Cellar.
How it started
The original idea was simpler: make an endless video from 86K images. Chronological, 1/24th of a second per frame, subliminal speed, occasional holds. A farewell document for 3 years of visual obsession, closing with the end of the subscription.
Then it got complicated.
Chronological ordering told a story of aesthetic evolution, but clustering by visual similarity created a more compelling flow. The question became: what does this archive actually look like when you stop treating it as files and start treating it as data?
That led to embedding all 84,143 images using CSD (Contrastive Style Descriptor) -- a ViT-L model specifically trained to capture visual style independent of content. Not CLIP (groups by what's in the image), not DINOv2 (groups by spatial structure). CSD uses gradient reversal to disentangle style from semantics: two images of hands -- one anime, one B&W photography -- land in completely different regions of the embedding space. That distinction is what makes the clustering useful for both the interactive experience and LoRA training sets.
500 style clusters emerged. Some obvious (analog film grain, pixel art, brutalist architecture), some surprising (a consistent cluster of liminal fluorescent interiors that spans 2023 through 2025). The clusters became the visual backbone of the application -- each represented by rotating sample mosaics that ensure repeat visits surface different images.
Two videos were rendered from the full image set, sorted by luminance (dark to bright), interpolated to 120fps. Accessible inside The Spectrum's train cabin.
How it was built
Data Pipeline (Python)
The pipeline evolved through several iterations:
- Extraction -- Unpack 176 zip archives, extract metadata (creation timestamp, full prompt, parameters, job UUID) from PNG text chunks
- Flattening -- 86,352 files into a single flat folder, timestamp-named (
YYYY-MM-DD-HH-MM-SS-[stem].[ext]), alphabetical order = chronological order - Embedding -- CSD ViT-L style embeddings (768-dimensional) for all 84,143 images. Separate embeddings for 2,209 video clips (single frame extraction)
- Clustering -- KMeans on CSD embeddings, 500 clusters. Auto-labelled via CLIP text similarity against a vocabulary extracted from the actual prompt corpus
- Dimensionality reduction -- UMAP 2D + 3D coordinates for spatial navigation
- Video rendering -- FFmpeg pipelines with custom filter chains (1:1 crop, pillarbox, chromatic aberration, colour grain, rounded corners), then editing and post-processing in DaVinci Resolve
- Statistics -- Prompt text analysis, co-occurrence matrices, per-version / per-year / per-format breakdowns, colour and subject classification
Desktop Application (Tauri v2)
- React 18 + TypeScript + Vite
- Three.js / React Three Fiber for all 3D environments
- D3.js for data visualizations
- Framer Motion for transitions
- Tauri v2 for native desktop packaging (NSIS installer)
- Custom encrypted asset pipeline (AES-256-CTR) for distribution
VR Application (Godot 4.6 + OpenXR)
- Native Quest 3 standalone APK
- OpenXR with hand tracking support
- 5,500 instanced tiles with per-frame drift animation and voice reactivity
- OGV Theora video playback in immersive video rooms
- Voice mic processing (LoFi distortion, bandpass 1400Hz, pitch shift 1.4x, delay 550ms, reverb)
- godot-xr-tools for controller interaction and UI panels
- Particle tunnel transitions between all environments
How to install
Desktop
- Download
The Journey_1.0.0_x64-setup.exefrom releases - Run the installer
- On first launch, the app downloads and extracts the asset pack (~2GB)
VR (Meta Quest 3)
- Enable developer mode on Quest 3
- Download the APK
- Sideload:
adb install -r journey.apk - Find it in Unknown Sources Note: It does theoretically run on Quest 2 as well, but my Q2 has decided to stop working. Main showstopper could be the Room and the 5.5k tiles due to the reduced GPU power.
Why CSD over CLIP
CLIP encodes what an image depicts. CSD encodes how it looks. For an archive spanning 3 years of aesthetic experimentation -- analog film grain, voxel art, cinematic stills, pixel horror, documentary photography, fashion editorial -- the visual style is the meaningful axis, not the subject matter. A portrait shot on Kodak Ektachrome and a portrait rendered as anime are fundamentally different images that happen to contain the same subject. CSD puts them where they belong.
The model: CSD on HuggingFace | Paper (Somepalli et al., 2024)
Credits
Created by Mattia Veltri
3D Assets (Sketchfab)
- Train cabin (1930s) -- thomas.rynne
- Warehouse -- Nicholas01
- Original Backrooms -- rjh41
- Liminal Hall -- Schmoldt5000
- The House Lost at Sea -- mdonze
Midjourney profile: @atmrmattv
Inspirations
The visual environments draw from Refik Anadol's approach to data as spatial material -- treating the archive not as files to browse but as a physical space to inhabit.
The data section is informed by the editorial data storytelling of The Pudding, the bespoke interactive graphics of the New York Times (particularly Shirley Wu's work on decision trees and network visualizations), Nadieh Bremer's generative data art, and Giorgia Lupi's data humanism philosophy -- the idea that data visualization should make things more human, not more efficient.