I taught a SO-101 arm to pick a strawberry. Then I asked what it was thinking.
Here is the video that presents the main idea of this blog -
All you need to try this yourself is a standard SO-101 leader-follower setup, one wrist camera, and an object for pick and place. While having an overhead camera would indeed be beneficial, I wanted to keep the overall setup as minimal as possible in order to make the future probing into the VLAs brain simpler.
Speaking of the VLA, the policy is a fine-tuned SmolVLA, trained on just 40 teleoperated demonstrations I collected over a weekend. It picks the strawberry up, carries it, and drops it onto a green napkin.
Honestly, that part is not hard at all. You can probably knock it off over a weekend with little to no experience in robotics by simply following the guides curated by HuggingFace and a little help from your coding agents for the setup and hardware debugging.
My main objective with this experiment was a deeper understanding of what the model is doing internally - even for such simple tasks. Could we map the different stages of pick up like reaching, grasping, transporting, releasing and returning home? Was the frozen vision-language backbone carrying the task structure, or was the action expert just replaying motor traces? And more importantly, could I watch its hidden state move through the task like a little brain trajectory?
This is especially interesting when we realise that human operators collecting teleop data tend to follow some standard and what we would call obvious policies for completing such tasks. It would be cool to visualise the model imbibing these policies implicitly.
This article is the story of that endeavour along with the multiple rounds of instrumentation, audits and repetitions until we arrived at a setup that survives contact with the data.
Here's the tl;dr takeaways:
- Stage is linearly decodable from the frozen SmolVLA VLM hidden states at about 86%.
- The original unsupervised stage labels looked beautiful and were wrong in exactly the dangerous places: GRASP, TRANSPORT, RELEASE.
- A strict left-to-right (Bakis) HMM made each episode renderable, but made recovery episodes impossible to represent since going back to a previous stage is forbidden by this model.
- A supervised MS-TCRNet trained on 11 sparsely hand-labeled episodes caught recovery that the old HMM models could not.
- The final brain videos became clearer and more honest after I stopped forcing frames onto broken centroids and embedded all 28,450 v1 hidden states directly.
The model picking the strawberry is a nice showcase. The far cooler win is the glimpse into how the model...thinks.
1. Setup
Here is the entire setup in one place.
Hardware
- SO-101 leader and follower pair, 6 DoF, Feetech STS3215 servos. It cost around $300 to purchase and set up. I got mine here (not an advert or promotion, it's the same link among the vendor options by HuggingFace) - https://shop.wowrobo.com/products/so-arm101-diy-kit-assembled-version-1
- One wrist-mounted USB camera, 640x480 at 30 fps - came included with the robot above.
- RTX 5090 laptop GPU for local inference though lower specs should do just fine.
- Colab A100 for the original fine-tune.
Task
- Pick one strawberry from a wooden table.
- Move it to a green napkin.
- Same camera pose, same lighting, same table - VLA training deteriorates significantly with the number of variables you introduce in all episodes. The fewer variations you have, the lesser the data you need to collect.
- The strawberry position varies but stays within the field of view of the camera.
- I also tried converting the electric load on the gripper motor to try and extrapolate the total force it was applying and trying to restrict it under about 50 on the SO-101
Present_Loadscale, since the fruit visibly deforms with greater loads. But this is very noisy and was just a fill-in for lacking a good force sensor.
Data
- 40 teleop demonstrations recorded with the leader arm in teleop and using a custom OpenCV visualiser - I originally started with Rerun but had some delays in recording and saving so switched to a more customised setup with important information I would need.
- Each episode averages around 23 seconds each.
- A grand total of 28,450 frames.
- Published as
cn0303/so101-strawberry-pick-v1in LeRobot v3.0 format.
Model
- Started from
lerobot/smolvla_base. - Fine-tuned for 20,000 steps, batch size 64, on a Colab A100.
- Final checkpoint:
cn0303/smolvla-so101-strawberry-v1. - Roughly $3-4 of compute and took around 3 hours and 30 minutes.
SmolVLA is a nice fit for this kind of experiment because it is small enough to actually run inference on a laptop. The released model uses a SmolVLM2-500M vision-language backbone plus a flow-matching action expert. During fine-tuning, I ensured the VLM stayed frozen; the action expert and the small state-projection head are what get tuned.
One detail will matter later: SmolVLA truncates the VLM to half depth at init. The 32-layer SmolVLM2 text decoder becomes 16 layers. The paper reports that this roughly halves compute while costing about 1.8 pp on LIBERO.
That means the thing I probe later is a frozen, half-depth pretrained backbone. My 40 demos never touched those VLM weights.
Why SmolVLA instead of OpenVLA / pi0?
OpenVLA and pi0 are bigger and heavier for a simple local inference setup - doable but the amount of time it takes to finetune, higher cost of compute and increased inference requirements made me table their investigation for later. SmolVLA was the best option to begin with and quickly iterate upon that hit all three constraints I cared about at once: public base checkpoint, editable Python source, and local inference.
The honest caveat: v1 does not have a formal 20-trial scorecard. I formalised that infrastructure later on for other versions/experiments so that you can quantitatively measure success as a score of task completion, trajectory smoothness, damage to berry, total time taken etc in a neat rubric; I will share that approach in upcoming blogs. But when I say "v1 worked," I mean the qualitative bar where the arm could pick the strawberry on the desk without crushing it and lay it on the napkin. Real, useful, not a benchmark claim.
Now the interesting question:
After a VLA works, how do you tell what it has learned?
2. Initial Temptation: Make It Hierarchical
A strawberry pick has certain obvious stages:
- start at home
- move toward the strawberry
- position around it
- grasp
- transport
- release
- return home
If you were to teleoperate a robot to pick up a berry from the table or even do it yourself manually, odds are you would follow a similar approach although the timing between each segment would differ significantly. We are so accustomed to such simple manual tasks that it can fire on happily in our subconscious without us paying it much attention.
So based on this the first idea is obvious: make these stages explicit and train a hierarchical policy. Add stage tokens. Put a planner on top.
I spent a while reading before touching code. The literature has a few families of approaches:
| Approach | Why I did not start there |
|---|---|
| Stage-conditioned VLAs | You need new data or retraining, and fixed-camera tasks are vulnerable to the "vision shortcut": the image already tells the model what phase it is in, so the prompt token can get ignored. (Shortcut Learning in Generalist Robot Policies, 2025 (https://arxiv.org/abs/2508.06426); the underlying mechanism is the causal confusion (https://arxiv.org/abs/1905.11979) of de Haan, Jayaraman & Levine, NeurIPS 2019 |
| One policy per subtask | Aside from taking way longer and needing more resources, boundary drift compounds. Five 90%-reliable skills chained together do not give a 90%-reliable system. |
| LLM planner over a low-level policy | Great when the task set is broad. Overkill for one fixed pick-and-place task with five-ish phases. |
I wanted something smaller and more diagnostic:
Do not retrain the policy. Wrap it in a transparent observer. Then ask whether the frozen VLM already encodes task stage.
The plan became:
- Discover per-frame task stages from the demonstrations.
- Use those stages to build a finite-state observer.
- Probe the frozen VLM hidden states to see where stage information lives.
- Visualise the model's hidden state through the episode.
3. Using Hidden Markov Models for labelling the stages of each episode
To instrument the policy, I first needed stage labels for every frame.
I did not have human labels yet. So I started with the classic (and slightly lazy) move: fit a Hidden Markov Model over proprioceptive features.
Each frame had 16 features:
- 6 joint positions
- 6 joint velocities
- gripper position
- gripper velocity
- action magnitude
- action-delta magnitude
I z-scored per episode and fit one HMM jointly across all 40 episodes. Joint fitting matters because it gives cross-episode alignment: "stage 3" means the same thing in every demo.
Fully-Connected HMM Failed Immediately
The first HMM was fully connected. It could jump from any stage to any other stage.It discovered motion regimes, not human-meaningful task phases. The live episode playback flickered between stages rapidly. During a grasp it might bounce between APPROACH and TRANSPORT because one frame's proprio features happened to look a bit like an earlier cluster.
So I needed a HMM that was stricter and would not exhibit this degree of randomness in its prediction.
Bakis HMM Fixed The Flicker
A Bakis HMM is a left-to-right HMM. State i can stay at i or move to i+1. It cannot go backwards.
That matches a clean pick:
HOME_START -> REACH_OUT -> APPROACH -> GRASP -> TRANSPORT -> RELEASE -> RETURN_HOME
I made three changes:
- Use K=7, splitting HOME into
HOME_STARTandRETURN_HOME. - Force the transition matrix to be upper-bidiagonal: self-loop 0.93, forward 0.07, final state absorbing.
- Initialise each state's Gaussian from normalised episode time, so state order matches task order from the beginning.
I used 0.93/0.07 as a simple first-pass smoothing prior. It strongly discourages flickering while still allowing forward progress. It does not claim or reflect the true duration model for every stage. A more principled version would estimate stage-specific forward probabilities from hand-labeled or pseudo-labeled durations, but that comes later on.
The time initialisation is the load-bearing trick. If you let k-means initialise states arbitrarily, you cannot easily recover the intended left-to-right order after freezing the transition matrix.
The result looked fantastic.
| Metric | Fully-connected K=6 | Bakis K=7 |
|---|---|---|
| Stage revisits across 40 episodes | many | 0 |
| Backward transition mass | ~0.5 | 0.0 |
| Largest single-state share | 44% | 18% |
| RF agreement with HMM labels | 88.8% | 94.0% |
The aligned timeline was beautiful:
The transition matrix showed exactly what I wanted: diagonal plus one forward band, no backward mass:
And the retry plot was perfectly clean:
One episode in detail:
At this point, the easy narrative would be:
Great. The robot has stages. The observer works. Now probe the VLM.
Except there are two distinct problems -
- When I saw some of the old playbacks of the camera POV during an episode, the labels assigned by the Bakis HMM and what I was observing were considerably off.
- The very sequential constraint of Bakis HMM prevents it from handling recovery episodes because let us say during TRANSPORT the berry accidentally falls off. Now in reality we go back to APPROACH and then GRASP before resuming TRANSPORT but this is not allowed in this architecture.
Here is a video that illustrates both of these issues clearly -
A short list of all issues we see here -
- The robot has begun grasping but the label assigned is still APPROACH
- What is labelled as GRASP is very clearly TRANSPORT
- As the berry drops the robot is stuck for a very long duration in the stage of RELEASE since it cannot go backwards.
So 2 issues - the labels assigned by the model are not trustworthy and the current architecture cannot handle realistic recovery scenarios. Let us tackle each of them one by one.
4. The Labels Looked Right. They Weren’t.
The dangerous sentence in the previous section is:
RF agreement with HMM labels = 94%.
That is not the same as "the stages are correct 94% of the time."
It only says the Random Forest and the HMM agree. If the HMM is wrong, the RF can faithfully learn the wrong thing.
So I built a small browser labeler and hand-labeled transition points.
A simple labelling tool made with the help of Claude Code which loads an episode, allows you to treat the episode like a video editing software where you can shorten and lengthen each segment and save the labels in json. You also get some useful insights at the bottom for accurate positioning about the joint velocities and gripper load extrapolated from the current load for fine positioning of the segments.
With this tool you can easily and quickly record the sparse stage boundaries. About 6 marks per episode in clean cases a couple more for edge recovery scenarios. The first audit used 8 episodes:
3, 7, 11, 17, 22, 28, 33, 38
About 48 transition marks. Roughly 45 minutes of work.
The result was the moment this project changed shape.
| Human stage | Bakis diagonal | Mostly predicted as |
|---|---|---|
| HOME_START | 0.95 | correct |
| REACH_OUT | 0.71 | APPROACH |
| APPROACH | 0.55 | TRANSPORT |
| GRASP | 0.08 | APPROACH (82%) |
| TRANSPORT | 0.33 | GRASP |
| RELEASE | 0.61 | TRANSPORT |
| RETURN_HOME | 0.47 | RELEASE |
Plain English:
The Bakis HMM was calling GRASP frames APPROACH four out of five times.
That is not a small calibration error. That breaks every downstream number that depends on per-stage truth:
- per-stage F1
- probe confusion matrices
- stage centroids
- geometry plots
- brain trajectory maps
So we need a better solution that is capable of labelling more accurately and handling recovery episodes.
5. Replacing Bakis: Cleaner Labelling and Handling Recovery Episodes
After the hand audit, I needed a real segmenter shootout. At this point, the problem had split into two separate questions:
- Can I get cleaner and more accurate stage labels on normal pick-and-place episodes?
- Can the segmenter represent recovery when the robot has to go backwards in the task?
So I tried four alternatives while keeping the original Bakis model as the baseline.
Another key insight I gained during hand labeling was how I struggled to separate REACH_OUT and APPROACH myself. At what degree of proximity does one claim that we transitioned from reaching out to approach? Instead combining the two into one stage of POSITIONING reduces one stage. The question that defines this stage then becomes much simpler -
Has the robot moved from home into a good pre-grasp position?
So the cleaned-up stage set became:
HOME_START -> POSITIONING -> GRASP -> TRANSPORT -> RELEASE -> RETURN_HOME
That gave me a 6-stage problem instead of a 7-stage problem, with boundaries that were easier to label and easier to defend.
The first improvement was an HSMM, or Hidden Semi-Markov Model.
A normal HMM decides the stage frame by frame. It can prefer staying in the same state, but it does not explicitly model how long a stage should last. An HSMM adds that missing piece: duration.
In plain English, the HSMM can say:
POSITIONING usually lasts longer. RELEASE is usually short. GRASP has its own typical duration.
That is already closer to the real task. The robot does not spend equal time in every phase. It may spend a long time positioning, a shorter time transporting, and only a brief moment releasing. So adding duration priors made the model much more realistic than a plain Bakis HMM with the same stay/advance probability everywhere.
Then I tried Bakis-HSMM.
This kept the clean left-to-right structure of Bakis, but added HSMM-style duration modelling. So it got the best of both worlds for clean episodes:
- no flicker
- interpretable stages
- better duration behaviour
- clean left-to-right timelines
But it still inherited the main Bakis limitation: it could not go backwards. If the strawberry slipped during transport, the model could not return to POSITIONING or GRASP. It had no effective transition for that. So it could produce a beautiful timeline on clean episodes, but it could not honestly represent recovery.
The next family of models was MS-TCRNet.
This is a temporal segmentation network. Instead of forcing a hand-designed transition matrix, it learns to label each frame using temporal context. In simple terms, it looks at a sequence of robot motion and predicts which stage each frame belongs to. Unlike Bakis, it is not forced to move only left-to-right. If the robot returns to POSITIONING after a failed grasp or dropped berry, MS-TCRNet can represent that.
I first tried MS-TCRNet using pseudo-labels from the previous models. That looked promising at first, but the problem was obvious: if the teacher labels are wrong, the student learns the teacher’s mistakes. It became a better imitator of an imperfect segmenter, not a truly better source of truth.
The real jump came from training MS-TCRNet on sparse human labels from 11 episodes. Those labels were not dense, perfect annotations of every frame from the beginning. They were mostly stage boundaries: the moments where I said, “this is where POSITIONING ends and GRASP begins,” or “this is where the recovery starts.”
That small amount of human supervision led to significant gains. The supervised MS-TCRNet was no longer just copying Bakis. It had learned from actual stage boundaries, and it could catch recovery behaviour that the left-to-right models structurally could not express.
Comparing all 5 approaches against held-out v1 numbers on episodes 7, 27, and 34:
| Variant | MoF | F1@25 | F1@50 | ep34 recovery |
|---|---|---|---|---|
| bakis_raw | 0.441 | 0.564 | 0.292 | FAIL |
| bakis_refined | 0.439 | 0.577 | 0.302 | FAIL |
| bakis_hsmm | 0.721 | 0.863 | 0.812 | FAIL |
| hsmm_v6_tuned | 0.658 | 0.826 | 0.590 | FAIL |
| mstcrnet_sup | 0.894 | 0.872 | 0.872 | PASS |
I have visualised Episode 34 below. In this episode the strawberry slips and the policy recovers - making it a good ground of comparison between the two models.
Bakis-HSMM:
MS-TCRNet:
Bakis-HSMM cannot emit a backward transition, so it smears the recovery forward. It runs through TRANSPORT and then into a long RELEASE block without ever revisiting POSITIONING. MS-TCRNet however catches the backward edge. In its prediction row, POSITIONING reappears around frame ~335 and runs through the recovery attempt, matching the second POSITIONING segment in ground truth. Even the ending is completely clean as it is able to understand the link between starting at home and returning home at the tail end.
Qualitatively, the comparison looked like this:
| Model | What happened |
|---|---|
| Bakis baseline | Clean-looking, wrong middle stages. Could not recover. |
| 6-state HSMM | Added explicit duration priors and merged REACH_OUT + APPROACH into POSITIONING. Big jump. |
| Bakis-HSMM | Cleanest interpretable model for clean episodes. Still cannot go backwards. |
| MS-TCRNet from pseudo-labels | Looked promising, but mostly learned the teacher's behavior. |
| MS-TCRNet from real labels | Best recovery behavior. Learned from 11 sparsely hand-labeled episodes. |
The important one line descriptions of the three best approaches are as follows:
- Bakis-HSMM is the best clean, interpretable, structured observer.
- HSMM-v6-tuned has explicit recovery edges / anomaly-style behaviour, but did not generalize cleanly on ep34.
- MS-TCRNet supervised is the best labeler when I care about matching real episodes, especially recovery episodes.
That is why I did not throw the older models away. They answer different questions.
If I want a clean explanatory baseline, I still like Bakis-HSMM. If I want explicit recovery edges and anomaly-style behaviour, HSMM-v6-tuned is useful. If I want the best supervised segmentation target for probing SmolVLA, I use MS-TCRNet.
6. Probing SmolVLA: The Frozen Backbone Already Knows A Lot
The next question was the one I cared about most:
Does the frozen VLM know what stage the robot is in?
This wording can be misleading, so let me be precise.
The VLM weights did not learn from my strawberry demos (frozen weights during finetuning). I am looking at the activation state the frozen VLM produces for each frame: wrist image, prompt, and robot state tokens flowing through a pretrained visual-language model.
That is still worth inspecting because the action expert is trained to read from those frozen activations. In SmolVLA, the frozen backbone is the perceptual interface. The fine-tune does not rewrite the visual features; it teaches the action expert how to use them.
So when I say "the model's brain" in this section, read it as: the frozen perceptual workspace that the trained action expert sees, not "a VLM whose weights were updated by this task."
The simplest way to view this happens to be a cornerstone of interpretable AI - the linear probe.
Linear probe, in one paragraph
Take the hidden state from one layer of a frozen model. Train a simple classifier to predict a target from it. Here the target is "what stage is this frame?" If a linear classifier gets high accuracy, the stage is linearly decodable from that hidden state. This does not prove the model uses that variable causally. It does tell you the information is present and easy to read.
Practical tidbit - The Funny PyTorch Hook Bug
The obvious hook did not work:
layer = policy.model.vlm_with_expert.get_vlm_model().text_model.layers[8]
h = layer.register_forward_hook(hook)
# run select_action(...)
# hook never fires
SmolVLA's wrapper manually iterates through the VLM submodules. It does not call the decoder layer object in the usual way, so a hook on the layer never fires.
The fix was to hook a submodule that is actually called:
layers = policy.model.vlm_with_expert.get_vlm_model().text_model.layers
handles = [
layers[i].post_attention_layernorm.register_forward_hook(make_hook(i))
for i in PROBE_LAYERS
]
That gave me per-frame hidden states from layers [2, 4, 8, 12, 15].
Feature for each frame:
- mean-pool the 241-token hidden state
- tokens include image patches, language tokens, and state token
- same architecture for every layer
- logistic regression probe with a standard scaler
- episode-level train/test split
- label-shuffled control with fixed seed
Important nuance: this is not "pure image." The mean-pooled VLM-side state includes contextualised language and state tokens too. The task text is constant, but the hidden states after attention are not merely a constant offset.
Result: Flat Across Depth
I ran the probes twice:
- Original K=7 Bakis labels.
- Cleaner K=6 MS-TCRNet labels.
Same hidden-state cache. Same train/test split. Only the target labels changed.
| Layer | K=7 Bakis acc / F1 / ctrl | K=6 MS-TCRNet acc / F1 / ctrl | Delta acc |
|---|---|---|---|
| 2 | 0.810 / 0.793 / 0.157 | 0.860 / 0.824 / 0.248 | +5.1 pp |
| 4 | 0.810 / 0.794 / 0.157 | 0.871 / 0.838 / 0.240 | +6.1 pp |
| 8 | 0.809 / 0.795 / 0.136 | 0.869 / 0.834 / 0.228 | +6.0 pp |
| 12 | 0.811 / 0.795 / 0.149 | 0.861 / 0.824 / 0.243 | +5.0 pp |
| 15 | 0.806 / 0.789 / 0.149 | 0.861 / 0.822 / 0.235 | +5.5 pp |
Probe accuracy by layer, now under the K=6 MS-TCRNet labels — flat across depth, best layer 4:
Each marker is a separate linear probe trained on one frozen VLM layer's hidden state to read off the task stage; the near-flat curves mean stage information is spread evenly across depth, and the green line marks the best layer (layer 4).
Blue is the probe's raw accuracy on the true labels (0.86); orange is its macro-F1 (0.83), which weights all six stages equally so it isn't flattered by the common, easy ones. Red is the control — the same probe trained on randomly shuffled labels (~0.24, the chance floor for these class sizes); the wide blue-over-red gap is what shows the signal is real and not memorise
The important part is not the exact best layer. It is the flatness.
Under cleaner labels, stage is decodable at about 86% from layer 2 onward. The best layer shifts from 12 to 4, but the range across layers is only about 1.1 pp.
The controls matter too. K=7 shuffled controls sit near chance, about 0.14. K=6 controls sit around 0.23-0.25 because the class distribution is imbalanced, but the real probes are still far above them.
What I take from this:
- The stage signal is shallow.
- The frozen pretrained backbone already contains a lot of the visual/task structure.
- The original 81% number was partly eaten by broken labels.
- The probe result is about stage decodability, not end-to-end policy performance.
The remaining confusions are also intuitive. Under K=6, the dominant off-diagonal errors are:
HOME_STARTvsRETURN_HOME: same pose, different time.GRASPvsPOSITIONING: boundary frames and visually similar pre-grasp motion.
Changing layers does not solve a temporal ambiguity. Adding time does.
The per-stage F1 heatmap:
Each cell is the F1 score (a 0–1 measure that balances false alarms against misses) for one stage at one layer, with darker green = better. Reading across any row shows a layer is about equally good at every stage, and reading down any column shows a stage is about equally decodable at every depth; the same flatness, per-class.
TRANSPORT and RETURN_HOME are easy everywhere (0.90+), while HOME_START and RELEASE stay the hard ones (0.65–0.78) because HOME_START shares a pose with RETURN_HOME and RELEASE is brief and thus no layer rescues them.
Going Past Aggregate: No Layer Specialises Either
Aggregate accuracy is the easy thing to plot, and it hides a sharper question. Maybe some layer is secretly the GRASP expert while another is the TRANSPORT expert, and the two cancel out to "flat."
So I broke per-layer accuracy down by stage:
| Stage | L2 | L4 | L8 | L12 | L15 | Range |
|---|---|---|---|---|---|---|
| HOME_START | 0.675 | 0.695 | 0.689 | 0.677 | 0.652 | 0.043 |
| POSITIONING | 0.873 | 0.889 | 0.889 | 0.879 | 0.875 | 0.017 |
| GRASP | 0.832 | 0.846 | 0.820 | 0.831 | 0.841 | 0.026 |
| TRANSPORT | 0.924 | 0.920 | 0.922 | 0.922 | 0.924 | 0.005 |
| RELEASE | 0.737 | 0.769 | 0.778 | 0.730 | 0.735 | 0.048 |
| RETURN_HOME | 0.904 | 0.906 | 0.905 | 0.903 | 0.905 | 0.003 |
The head-to-head between layer 2 (shallowest we probed) and layer 12 (the original best under K=7) is the part that genuinely surprised me:
| HOME | POS | GRASP | TRANS | REL | RETURN | overall | |
|---|---|---|---|---|---|---|---|
| L2 | 0.675 | 0.873 | 0.832 | 0.924 | 0.737 | 0.904 | 0.860 |
| L12 | 0.677 | 0.879 | 0.831 | 0.922 | 0.730 | 0.903 | 0.861 |
| Δ | +0.002 | +0.006 | −0.001 | −0.002 | −0.007 | −0.001 | +0.001 |
Every per-stage delta is under 1pp. L2 and L12 are functionally identical on every class. No specialisation, no trade.
So the "flat across depth" finding is actually stronger than the aggregate plot lets on. It is not that the layers have different per-stage strengths that happen to average out — it is that every layer is roughly equally good at every stage. The frozen backbone has stage information broadly and uniformly distributed across depth.
This is the per-class version of the analysis Tenney, Das & Pavlick (2019) ran for BERT, where they found POS tagging gets resolved early in the network and coreference late. The robotics analog would be "motion-based stages decodable at different depths than semantic stages." With only five layers sampled here we can rule out the gross version of that pattern; a denser sweep at the shallow end (layers 0-7) would be the right way to actually answer it. That's a v2 experiment, not a v1 conclusion.
7. Vision Separates Stages Better; With Honest Labels, Fusing It Helps
Now that I had VLM features, I wanted to compare two spaces:
- the 16-dimensional proprioceptive HMM feature space
- the 960-dimensional VLM hidden-state space
For each representation I computed per-stage centroids and a Mahalanobis-style separability score:
sep(a, b) = distance(centroid_a, centroid_b) / pooled_within_stage_std
Higher means the two stages are easier to separate. I ran this on the K=6 MS-TCRNet stages (the six I use everywhere after Section 5), with centroids built from training episodes only.
| Representation | Mean off-diagonal separability |
|---|---|
| Proprio HMM space | 6.22 |
| VLM layers | 31.1 – 32.3 |
The VLM-side hidden states separate stages about 5x more cleanly than the proprio features (5.06x on the across-layer mean), and the advantage is almost perfectly flat across depth. All 15 stage pairs are more separable in vision — every point lands above the diagonal:
So the real question: does that cleaner geometry help the online stage classifier? I fused the proprio Random Forest with the layer-4 vision probe and ran Viterbi under three strategies, scored on 8 held-out episodes:
log_emis = vision_weight * log(probe_probs) + (1 - vision_weight) * log(rf_probs)
| Strategy | Δ vs proprio-only (MS-TCRNet truth) | Δ vs proprio-only (human truth) |
|---|---|---|
| scalar fusion (w=0.3) | +5.2 pp | +4.2 pp |
| per-state pair-aware | +5.1 pp | +4.0 pp |
| agreement-boost (vision only confirms) | +0.8 pp | +1.3 pp |
| (“human truth” here means the hand-labeled held-out episodes, not a full dense annotation campaign over the entire dataset.) |
Naive scalar fusion wins, and the gain lands where it should: TRANSPORT goes from 0.69 to 0.90, a 21-point jump. That is the stage proprioception is worst at. Mid-carry, the joint angles look almost identical whether or not the strawberry is in the gripper, and only the wrist image knows the difference.
The first time I ran the sweep, naive fusion actually hurt by about 3 points, and the only mode that helped was a cautious one: let vision confirm the proprio decision, never override it. The fault was in the answer key. That first sweep scored against the Bakis labels, and the Bakis labels are a deterministic function of the proprioceptive features — that is what the HMM was fit on. A proprio classifier can reproduce a proprio-defined label set almost perfectly; mine agreed with Bakis 94% of the time. Vision had nothing left to add, so it could only inject noise. This was not because proprioception was truly better. It was because the benchmark itself was proprio-defined. I had built a test proprio could not lose.
Scored against labels that are not a disguised copy of proprio — MS-TCRNet, and human annotations as a neutral check on the hand-labeled held-out episodes — proprio falls to about 80%, and vision becomes the stronger signal on exactly the stages proprio is blind to. Fusion helps.
Two smaller notes survive the correction. The per-state weights still collapse to 0.30 ± 0.01, because the geometric advantage is uniform across every pair, so per-state fusion never pulls away from the scalar case. And agreement_boost stays mildly positive — the conservative choice for when you do not trust your labels. But once the labels are honest, letting vision vote beats letting it only confirm.
The real lesson sits one level above the model:
Grade a fusion experiment against a truth that is not secretly one of your inputs. The modality of your answer key decides which signal looks useful.
8. The Brain Video: Watching Hidden States Move Through The Task
After finishing all of this, one of my friends working in a similar domain suggested the following -
Project the VLM hidden state for each frame into 2D and draw it as a moving dot over the stage map.
My initial attempt at this used stage centroids:
- Group hidden states by stage label.
- Compute a 960-dimensional centroid for each stage.
- Embed the six centroids in 2D with MDS.
- Solve a linear map
W: 960 -> 2so the hidden-state centroids land on those 2D points. - For each frame, draw
W @ hidden_state.
In code, the core projection was:
W, *_ = np.linalg.lstsq(centroids, mds_coords, rcond=None)
position_t = hidden_state_t @ W
As a snapshot, it looks like this -
This worked visually, but I did not fully trust it.The problem is that the centroid version bakes the labels into the geometry. It first decides where the six labeled stages should sit, then forces every frame through a projection built from only those six anchors. That makes the visualisation really fragile. A frame can only be interpreted relative to a hand-made stage map. The more I looked at it, the more it felt like I was visualising my labels as much as I was visualising the model.
Here's how it looks as the red point traverses through the 6 centroids with the clouds of different points overlaid in the background -
So I rebuilt the brain map without stage centroids:
- Take all 28,450 v1 hidden states.
- Reduce them directly to 2D.
- Color each point by MS-TCRNet label only after the fact.
- Animate the current frame as a red dot.
This now visualises the full cloud of hidden states instead of only six stage averages. The label set still affects the colors and stage interpretation, but not the 2D coordinates.
I tried PCA, t-SNE, and UMAP.
- PCA was honest but visually muddy.
- UMAP gave clean cluster regions.
- t-SNE gave a dense, populated field where the dot always lives among other dots.
I liked both the UMAP and t-SNE visualisations so decided to retain both.
The episode below is ep27, a clean pick held out from MS-TCRNet training. It has hand-labeled ground truth, and MS-TCRNet gets F1@30 = 1.00 here. The hidden states are from layer 4, the best layer under the K=6 probe rerun.
UMAP version:
t-SNE version:
This is the visualisation I wanted from the beginning. The dot walks through stage regions. The stage bar fills left to right. The wrist camera shows the physical motion. You can correlate what the arm is doing with where the hidden state sits. The best part is not that the dot is pretty. The best part is that this version no longer forced the trajectory through hand-made stage centroids.
9. What I Learned
1. A model working is not the same as you understanding it
The robot picked the strawberry before I had a trustworthy explanation of what was happening inside. That is normal but it is also dangerous. The more convincing the demo, the easier it is to stop asking whether your interpretation tools are true.
2. The frozen backbone was already doing useful perception work
Layer 2 of the frozen SmolVLM2 backbone already contains enough information to decode task stage at about 86% under cleaner labels. That does not prove the action expert uses this representation in the same way. It does mean the stage structure is present and easy to read.
3. Validate the validator early
Eight hand-labeled episodes were enough to catch the core bug. Not 800. Not a perfect dense annotation campaign. Sparse boundary labels are cheap, and they prevent weeks of analysis from sitting on top of wrong targets.
4. Constraints buy stability and sell flexibility
The Bakis HMM was not dumb. It solved a real problem: flicker. But the exact constraint that made it readable also made retries impossible. That is the trade. The lesson is not "never use strict structure." The lesson is "name the behavior your structure forbids."
5. Pretty interpretability can be wrong
The first brain video I got was seductive and told an exciting story. The centroid projection made it look as if the model's hidden state was running ahead of the task — predicting what it was about to do. For a few minutes that was thrilling. It was also an artifact: the projection forced every frame through a fragile six-anchor map, and once I audited it the effect disappeared. This is the trap — a visualisation can be internally consistent, beautiful, and false. Always audit your findings before proclaiming a great victory.
6. The pipeline is the contribution
The v1 policy is nice. The audit loop is more reusable:
- train a small VLA
- infer stages
- hand-check sparse boundaries
- probe hidden states
- compare geometry
- visualise trajectories
That is the part I would carry to the next robot task.
What Is Still Open
- A real rollout scorecard for v1. The model works qualitatively, but I do not have a published 20-trial number for this version.
- More recovery examples. MS-TCRNet catches ep34, but the recovery prior is still based on little data.
- A better comparison between proprio and VLM geometries. I say the spaces differ, but there are cleaner statistical tests still sitting on the table.
Closing
The honest summary:
I fine-tuned a small VLA and it picked a strawberry. Then the project became less about training and more about measurement.The first stage model looked clean and was wrong. The first brain map looked meaningful and was wrong. Each wrong thing became useful only because it failed under an audit.
That is the pattern I trust now:
Build the thing. Make the measurement. Break the measurement. Keep the parts that survive.
Model: cn0303/smolvla-so101-strawberry-v1
Dataset: cn0303/so101-strawberry-pick-v1
I will release the full code once I have cleaned up the repo. I also plan to share the reusable pieces first: the visualisers, labeler, and lower-level analysis tools.
Let me know what you think and happy to answer any questions below.
Credits
A huge thank you to Lucas Mair for many of the ideas and brainstorming sessions, Luca Frattini for tips on finetuning the VLA and data collection. Working alongside Akshay Khanna at Alpine Valley to build robotics for automated berry harvesting for Europe.










