The Journey

I recently stopped using Midjourney, after 3 years of (almost) non-stop usage. Not cause I don't think the model and features are worth the price, rather cause my attention has shifted towards local and self-serve, to reduce cloud dependencies.

That being said, I realized I was sitting on over 86k generations across images and videos, and I figured it was a great opportunity to cause me some new form of stress, figure out how to build an interactive experience, and make it also available as a standalone VR package.

And... that's how we got to The Journey.

What is it

Four distinct environments built around 86,352 Midjourney generations (84,143 images + 2,209 video clips) spanning April 2023 to March 2026.

The Room -- Thousands of image tiles drifting across an infinity room. Music and voice reactive. 6DOF movement. Fly through three years of output.
The Spectrum -- Two personal video pieces watched from inside a 1930s train cabin.
The Cellar -- 150 curated images across three connected spaces. Walk through, approach, inspect.
The Data -- Every number, every pattern, every decision surfaced as an editorial data piece. Volume over time, prompt evolution, version migration, color distribution, subject territory, format choices. Not a dashboard.

Desktop version includes all four. VR version (Quest 3) includes The Room, The Spectrum, and The Cellar.

How it started

The original idea was simpler: make an endless video from 86K images. Chronological, 1/24th of a second per frame, subliminal speed, occasional holds. A farewell document for 3 years of visual obsession, closing with the end of the subscription.

Then it got complicated.

Chronological ordering told a story of aesthetic evolution, but clustering by visual similarity created a more compelling flow. The question became: what does this archive actually look like when you stop treating it as files and start treating it as data?

That led to embedding all 84,143 images using CSD (Contrastive Style Descriptor) -- a ViT-L model specifically trained to capture visual style independent of content. Not CLIP (groups by what's in the image), not DINOv2 (groups by spatial structure). CSD uses gradient reversal to disentangle style from semantics: two images of hands -- one anime, one B&W photography -- land in completely different regions of the embedding space. That distinction is what makes the clustering useful for both the interactive experience and LoRA training sets.

500 style clusters emerged. Some obvious (analog film grain, pixel art, brutalist architecture), some surprising (a consistent cluster of liminal fluorescent interiors that spans 2023 through 2025). The clusters became the visual backbone of the application -- each represented by rotating sample mosaics that ensure repeat visits surface different images.

Two videos were rendered from the full image set, sorted by luminance (dark to bright), interpolated to 120fps. Accessible inside The Spectrum's train cabin.

How it was built

Data Pipeline (Python)

The pipeline evolved through several iterations:

Extraction -- Unpack 176 zip archives, extract metadata (creation timestamp, full prompt, parameters, job UUID) from PNG text chunks
Flattening -- 86,352 files into a single flat folder, timestamp-named (YYYY-MM-DD-HH-MM-SS-[stem].[ext]), alphabetical order = chronological order
Embedding -- CSD ViT-L style embeddings (768-dimensional) for all 84,143 images. Separate embeddings for 2,209 video clips (single frame extraction)
Clustering -- KMeans on CSD embeddings, 500 clusters. Auto-labelled via CLIP text similarity against a vocabulary extracted from the actual prompt corpus
Dimensionality reduction -- UMAP 2D + 3D coordinates for spatial navigation
Video rendering -- FFmpeg pipelines with custom filter chains (1:1 crop, pillarbox, chromatic aberration, colour grain, rounded corners), then editing and post-processing in DaVinci Resolve
Statistics -- Prompt text analysis, co-occurrence matrices, per-version / per-year / per-format breakdowns, colour and subject classification

Desktop Application (Tauri v2)

React 18 + TypeScript + Vite
Three.js / React Three Fiber for all 3D environments
D3.js for data visualizations
Framer Motion for transitions
Tauri v2 for native desktop packaging (NSIS installer)
Custom encrypted asset pipeline (AES-256-CTR) for distribution

VR Application (Godot 4.6 + OpenXR)

Native Quest 3 standalone APK
OpenXR with hand tracking support
5,500 instanced tiles with per-frame drift animation and voice reactivity
OGV Theora video playback in immersive video rooms
Voice mic processing (LoFi distortion, bandpass 1400Hz, pitch shift 1.4x, delay 550ms, reverb)
godot-xr-tools for controller interaction and UI panels
Particle tunnel transitions between all environments

How to install

Desktop

Download The Journey_1.0.0_x64-setup.exe from releases
Run the installer
On first launch, the app downloads and extracts the asset pack (~2GB)

VR (Meta Quest 3)

Enable developer mode on Quest 3
Download the APK
Sideload: adb install -r journey.apk
Find it in Unknown Sources Note: It does theoretically run on Quest 2 as well, but my Q2 has decided to stop working. Main showstopper could be the Room and the 5.5k tiles due to the reduced GPU power.

Why CSD over CLIP

CLIP encodes what an image depicts. CSD encodes how it looks. For an archive spanning 3 years of aesthetic experimentation -- analog film grain, voxel art, cinematic stills, pixel horror, documentary photography, fashion editorial -- the visual style is the meaningful axis, not the subject matter. A portrait shot on Kodak Ektachrome and a portrait rendered as anime are fundamentally different images that happen to contain the same subject. CSD puts them where they belong.

The model: CSD on HuggingFace | Paper (Somepalli et al., 2024)

Credits

Created by Mattia Veltri

3D Assets (Sketchfab)

Train cabin (1930s) -- thomas.rynne
Warehouse -- Nicholas01
Original Backrooms -- rjh41
Liminal Hall -- Schmoldt5000
The House Lost at Sea -- mdonze

Midjourney profile: @atmrmattv

Inspirations

The visual environments draw from Refik Anadol's approach to data as spatial material -- treating the archive not as files to browse but as a physical space to inhabit.

The data section is informed by the editorial data storytelling of The Pudding, the bespoke interactive graphics of the New York Times (particularly Shirley Wu's work on decision trees and network visualizations), Nadieh Bremer's generative data art, and Giorgia Lupi's data humanism philosophy -- the idea that data visualization should make things more human, not more efficient.

mattiaveltri.com · iammattia.com · HuggingFace· GitHub

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for atMrMattV/journey

Base model

openai/clip-vit-large-patch14

Quantized

yuxi-liu-wired/CSD

Finetuned

(1)

this model

Paper for atMrMattV/journey

Measuring Style Similarity in Diffusion Models

Paper • 2404.01292 • Published Apr 1, 2024 • 15