Forager-Field-Notes / FIELD_NOTES.md
HomesteaderLabs's picture
Add Field Notes writeup + README link
6ef157d verified
# No signal, no GPU, no second chances: building an AI for the woods
Picture the worst possible place to run a machine learning model. No cell signal.
No cloud to phone home to. No GPU, just a Raspberry Pi and a little accelerator
chip the size of a stamp, running off a battery I've been babying since sunrise.
And the user? The user is standing in a damp forest about to decide whether to put
something in their mouth.
That's foraging. That's the actual deployment target. I'm not shipping a chatbot
that gets to be wrong in an interesting way. I'm shipping a thing that, if it's
confidently wrong, helps somebody have the last bad afternoon of their life.
So when people ask why my models are so small, I laugh a little. Small isn't the
compromise here. Small is the only thing that survives contact with the woods. The
whole project is a study in constraints, and honestly, the constraints did most of
the design work for me. I just had to stop fighting them.
Here's the stack: a domain router plus three expert classifiers, all
`tf_efficientnet_lite2`, about nine million parameters each. Four models, roughly
0.04 billion parameters total, which in the year of our lord 2026 is a rounding
error. They run offline on the device and they run on a plain CPU, no GPU required.
That's not me being humble. That's me being a forager with a dead phone two miles
from the trailhead. The "edge" everybody name-drops is, for me, just Tuesday.
## Constraint number one: the model is not allowed to vote
My first real architecture was clever and I was proud of it, which is usually the
tell that something is about to go wrong. The router would decide a photo was a
"plant," then I'd run two expert models on it and take whichever was more
confident. Max confidence wins. Democratic. Elegant.
It also tried to feed people poison hemlock.
Here's the failure, and it's a beautiful one. A deadly plant β€” say, poison hemlock β€”
gets routed to "plant." My high-value forageables expert has never seen hemlock in
its life, but it has seen ramps, and hemlock and ramps are cousins that have killed
people who confused them. So the high-value model looks at a deadly plant and
announces, at 0.9 confidence, that it's ramps. Delicious ramps. Meanwhile my
medicinals expert, the one that actually knows hemlock is death, correctly flags it
as deadly at lower confidence. Max confidence wins. The confident idiot
out-votes the cautious expert. In my tests this leaked deadly-as-edible about six
percent of the time, and not one of my per-model benchmarks caught it, because no
single model was ever wrong. The *system* was wrong.
So I killed the vote. Now the router picks exactly one expert and that expert owns
the call, full stop. The mushroom expert never sees a plant, so it can never call a
plant anything. On top of that, any "deadly" verdict vetoes a more-confident "safe"
one, because in this domain a false reassurance is the only error that actually
matters. Deadly-as-edible dropped from around six percent to half a percent. The
constraint β€” keep it small, keep it dumb, keep the experts in their lanes β€” turned
out to be the safety feature.
## Constraint number two: being safe is really easy if you don't mind being useless
Now the push and pull. Once you've been spooked by a near-miss, the temptation is to
crank every safety dial to the max. Refuse by default. When in doubt, abstain. Very
responsible. Very noble.
It's also how you build a tool nobody uses. I pushed the confidence gate up and
watched the thing turn into a coward. It started abstaining on blackberries.
Blackberries. At the safe-but-useless end of the curve it would refuse on roughly
half of perfectly edible finds, which is a fantastic way to teach your users that the
gadget is a paperweight and their own guessing is faster. And here's the part that
took me an embarrassing while to accept: you cannot gate your way to zero. I mapped
the whole curve. Loosen the gate and you get a decisive, useful tool that
occasionally calls something dangerous safe. Tighten it and you get a safe tool that
cries "I don't know" at a raspberry. There is no magic threshold that gives you both,
because the residual risk isn't low confidence β€” it's the model being confidently,
specifically wrong, and no confidence knob catches that.
Tightening the screw was a dead end. So instead of asking the model to be more sure,
I started asking it for more evidence.
The move we landed on β€” and I'll be straight, this lives in a test harness right now,
not in the shipped app yet β€” is to stop treating one photo as the whole story. Show
the model the same subject from a couple of angles, or even just a few augmented crops
of the one shot, and fuse the results before it commits. In our harness this was
close to a free lunch: the multi-angle version drove deadly-as-edible toward zero on
the domains we tested while *lifting* accuracy on the safe stuff, not crushing it.
Even better is making the second photo a targeted ask β€” only when the top guess is a
safe-looking thing that has a deadly twin. The app turns to you and says "okay,
photograph the stem before I sign off on this." The friction is the feature. The
moment of "hang on, show me more" is exactly the moment a careful forager would slow
down anyway. We're not nagging with a banner nobody reads; we're building the caution
into the interaction. That part's still being wired in, and I'll write it up properly
when it is.
## The one-line bug that told me my safety net was broken
A short detour into humility, because this is a field notes post and field notes
should include the faceplants.
For ages I believed my out-of-distribution detector didn't work. This is the piece
that's supposed to notice when you point the camera at, I don't know, a car, or your
own shoe, and say "that's not food, I'm not playing." I was using an energy score for
it, and every time I measured the thing it scored about 0.25 AUROC, which for the
non-nerds means it was worse than a coin flip. Inverted. Useless. I shelved it and
moved on, mildly betrayed.
Then I actually read my own code. The energy formula subtracts the max value for
numerical stability and is supposed to add it right back. Mine subtracted it and
forgot to add it back. One term. That single missing piece didn't just make the
number wrong, it flipped its meaning, so my detector was confidently pointing the
wrong direction the entire time. Fixed the one line, re-ran it, and the same detector
jumped to around 0.90 AUROC on my hardest domain. It had been working all along. I
had been reading the dial upside down.
The lesson I keep taped to the inside of my skull now: when a detector says it's
failing, check that you're not holding it backwards before you believe it. SHOCKING,
I know.
## The actual thesis
Here's why I think the constraints were a gift and not a punishment. When you've got
a giant model and infinite compute, you can paper over hard calls. You can be a
little wrong everywhere and call it nuance. Out here, on a stamp-sized chip in a
forest, there's nowhere to hide. Every tradeoff has to be made out loud: do I want
decisive or do I want safe, and where exactly do I put the line, and who gets hurt at
each setting. The edge didn't limit the engineering. It made the engineering honest.
Which, now that I say it out loud, is the whole reason I'm doing any of this. When
everything else goes digital and frictionless and confident, the stuff that keeps you
alive is still physical, still slow, still asks you to look twice. I built a machine
that's good enough to be worth beating. The goal was never to replace the forager. It
was to keep them sharp.
Or so I tell myself, two miles from the trailhead, photographing a stem.
What would you have done at the safety-versus-useful fork β€” held the line on refusing,
or chased the second photo? And what's the dumbest one-line bug that ever cost you a
week?