Add Field Notes writeup + README link
Browse files- FIELD_NOTES.md +131 -0
FIELD_NOTES.md
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# No signal, no GPU, no second chances: building an AI for the woods
|
| 2 |
+
|
| 3 |
+
Picture the worst possible place to run a machine learning model. No cell signal.
|
| 4 |
+
No cloud to phone home to. No GPU, just a Raspberry Pi and a little accelerator
|
| 5 |
+
chip the size of a stamp, running off a battery I've been babying since sunrise.
|
| 6 |
+
And the user? The user is standing in a damp forest about to decide whether to put
|
| 7 |
+
something in their mouth.
|
| 8 |
+
|
| 9 |
+
That's foraging. That's the actual deployment target. I'm not shipping a chatbot
|
| 10 |
+
that gets to be wrong in an interesting way. I'm shipping a thing that, if it's
|
| 11 |
+
confidently wrong, helps somebody have the last bad afternoon of their life.
|
| 12 |
+
|
| 13 |
+
So when people ask why my models are so small, I laugh a little. Small isn't the
|
| 14 |
+
compromise here. Small is the only thing that survives contact with the woods. The
|
| 15 |
+
whole project is a study in constraints, and honestly, the constraints did most of
|
| 16 |
+
the design work for me. I just had to stop fighting them.
|
| 17 |
+
|
| 18 |
+
Here's the stack: a domain router plus three expert classifiers, all
|
| 19 |
+
`tf_efficientnet_lite2`, about nine million parameters each. Four models, roughly
|
| 20 |
+
0.04 billion parameters total, which in the year of our lord 2026 is a rounding
|
| 21 |
+
error. They run offline on the device and they run on a plain CPU, no GPU required.
|
| 22 |
+
That's not me being humble. That's me being a forager with a dead phone two miles
|
| 23 |
+
from the trailhead. The "edge" everybody name-drops is, for me, just Tuesday.
|
| 24 |
+
|
| 25 |
+
## Constraint number one: the model is not allowed to vote
|
| 26 |
+
|
| 27 |
+
My first real architecture was clever and I was proud of it, which is usually the
|
| 28 |
+
tell that something is about to go wrong. The router would decide a photo was a
|
| 29 |
+
"plant," then I'd run two expert models on it and take whichever was more
|
| 30 |
+
confident. Max confidence wins. Democratic. Elegant.
|
| 31 |
+
|
| 32 |
+
It also tried to feed people poison hemlock.
|
| 33 |
+
|
| 34 |
+
Here's the failure, and it's a beautiful one. A deadly plant β say, poison hemlock β
|
| 35 |
+
gets routed to "plant." My high-value forageables expert has never seen hemlock in
|
| 36 |
+
its life, but it has seen ramps, and hemlock and ramps are cousins that have killed
|
| 37 |
+
people who confused them. So the high-value model looks at a deadly plant and
|
| 38 |
+
announces, at 0.9 confidence, that it's ramps. Delicious ramps. Meanwhile my
|
| 39 |
+
medicinals expert, the one that actually knows hemlock is death, correctly flags it
|
| 40 |
+
as deadly at lower confidence. Max confidence wins. The confident idiot
|
| 41 |
+
out-votes the cautious expert. In my tests this leaked deadly-as-edible about six
|
| 42 |
+
percent of the time, and not one of my per-model benchmarks caught it, because no
|
| 43 |
+
single model was ever wrong. The *system* was wrong.
|
| 44 |
+
|
| 45 |
+
So I killed the vote. Now the router picks exactly one expert and that expert owns
|
| 46 |
+
the call, full stop. The mushroom expert never sees a plant, so it can never call a
|
| 47 |
+
plant anything. On top of that, any "deadly" verdict vetoes a more-confident "safe"
|
| 48 |
+
one, because in this domain a false reassurance is the only error that actually
|
| 49 |
+
matters. Deadly-as-edible dropped from around six percent to half a percent. The
|
| 50 |
+
constraint β keep it small, keep it dumb, keep the experts in their lanes β turned
|
| 51 |
+
out to be the safety feature.
|
| 52 |
+
|
| 53 |
+
## Constraint number two: being safe is really easy if you don't mind being useless
|
| 54 |
+
|
| 55 |
+
Now the push and pull. Once you've been spooked by a near-miss, the temptation is to
|
| 56 |
+
crank every safety dial to the max. Refuse by default. When in doubt, abstain. Very
|
| 57 |
+
responsible. Very noble.
|
| 58 |
+
|
| 59 |
+
It's also how you build a tool nobody uses. I pushed the confidence gate up and
|
| 60 |
+
watched the thing turn into a coward. It started abstaining on blackberries.
|
| 61 |
+
Blackberries. At the safe-but-useless end of the curve it would refuse on roughly
|
| 62 |
+
half of perfectly edible finds, which is a fantastic way to teach your users that the
|
| 63 |
+
gadget is a paperweight and their own guessing is faster. And here's the part that
|
| 64 |
+
took me an embarrassing while to accept: you cannot gate your way to zero. I mapped
|
| 65 |
+
the whole curve. Loosen the gate and you get a decisive, useful tool that
|
| 66 |
+
occasionally calls something dangerous safe. Tighten it and you get a safe tool that
|
| 67 |
+
cries "I don't know" at a raspberry. There is no magic threshold that gives you both,
|
| 68 |
+
because the residual risk isn't low confidence β it's the model being confidently,
|
| 69 |
+
specifically wrong, and no confidence knob catches that.
|
| 70 |
+
|
| 71 |
+
Tightening the screw was a dead end. So instead of asking the model to be more sure,
|
| 72 |
+
I started asking it for more evidence.
|
| 73 |
+
|
| 74 |
+
The move we landed on β and I'll be straight, this lives in a test harness right now,
|
| 75 |
+
not in the shipped app yet β is to stop treating one photo as the whole story. Show
|
| 76 |
+
the model the same subject from a couple of angles, or even just a few augmented crops
|
| 77 |
+
of the one shot, and fuse the results before it commits. In our harness this was
|
| 78 |
+
close to a free lunch: the multi-angle version drove deadly-as-edible toward zero on
|
| 79 |
+
the domains we tested while *lifting* accuracy on the safe stuff, not crushing it.
|
| 80 |
+
Even better is making the second photo a targeted ask β only when the top guess is a
|
| 81 |
+
safe-looking thing that has a deadly twin. The app turns to you and says "okay,
|
| 82 |
+
photograph the stem before I sign off on this." The friction is the feature. The
|
| 83 |
+
moment of "hang on, show me more" is exactly the moment a careful forager would slow
|
| 84 |
+
down anyway. We're not nagging with a banner nobody reads; we're building the caution
|
| 85 |
+
into the interaction. That part's still being wired in, and I'll write it up properly
|
| 86 |
+
when it is.
|
| 87 |
+
|
| 88 |
+
## The one-line bug that told me my safety net was broken
|
| 89 |
+
|
| 90 |
+
A short detour into humility, because this is a field notes post and field notes
|
| 91 |
+
should include the faceplants.
|
| 92 |
+
|
| 93 |
+
For ages I believed my out-of-distribution detector didn't work. This is the piece
|
| 94 |
+
that's supposed to notice when you point the camera at, I don't know, a car, or your
|
| 95 |
+
own shoe, and say "that's not food, I'm not playing." I was using an energy score for
|
| 96 |
+
it, and every time I measured the thing it scored about 0.25 AUROC, which for the
|
| 97 |
+
non-nerds means it was worse than a coin flip. Inverted. Useless. I shelved it and
|
| 98 |
+
moved on, mildly betrayed.
|
| 99 |
+
|
| 100 |
+
Then I actually read my own code. The energy formula subtracts the max value for
|
| 101 |
+
numerical stability and is supposed to add it right back. Mine subtracted it and
|
| 102 |
+
forgot to add it back. One term. That single missing piece didn't just make the
|
| 103 |
+
number wrong, it flipped its meaning, so my detector was confidently pointing the
|
| 104 |
+
wrong direction the entire time. Fixed the one line, re-ran it, and the same detector
|
| 105 |
+
jumped to around 0.90 AUROC on my hardest domain. It had been working all along. I
|
| 106 |
+
had been reading the dial upside down.
|
| 107 |
+
|
| 108 |
+
The lesson I keep taped to the inside of my skull now: when a detector says it's
|
| 109 |
+
failing, check that you're not holding it backwards before you believe it. SHOCKING,
|
| 110 |
+
I know.
|
| 111 |
+
|
| 112 |
+
## The actual thesis
|
| 113 |
+
|
| 114 |
+
Here's why I think the constraints were a gift and not a punishment. When you've got
|
| 115 |
+
a giant model and infinite compute, you can paper over hard calls. You can be a
|
| 116 |
+
little wrong everywhere and call it nuance. Out here, on a stamp-sized chip in a
|
| 117 |
+
forest, there's nowhere to hide. Every tradeoff has to be made out loud: do I want
|
| 118 |
+
decisive or do I want safe, and where exactly do I put the line, and who gets hurt at
|
| 119 |
+
each setting. The edge didn't limit the engineering. It made the engineering honest.
|
| 120 |
+
|
| 121 |
+
Which, now that I say it out loud, is the whole reason I'm doing any of this. When
|
| 122 |
+
everything else goes digital and frictionless and confident, the stuff that keeps you
|
| 123 |
+
alive is still physical, still slow, still asks you to look twice. I built a machine
|
| 124 |
+
that's good enough to be worth beating. The goal was never to replace the forager. It
|
| 125 |
+
was to keep them sharp.
|
| 126 |
+
|
| 127 |
+
Or so I tell myself, two miles from the trailhead, photographing a stem.
|
| 128 |
+
|
| 129 |
+
What would you have done at the safety-versus-useful fork β held the line on refusing,
|
| 130 |
+
or chased the second photo? And what's the dumbest one-line bug that ever cost you a
|
| 131 |
+
week?
|