Spaces:

build-small-hackathon
/

Forager-Field-Notes

Running

App Files Files Community

Forager-Field-Notes / FIELD_NOTES.md

HomesteaderLabs

Add Field Notes writeup + README link

6ef157d verified 1 day ago

preview code

raw

history blame contribute delete

7.94 kB

A newer version of the Gradio SDK is available: 6.17.3

Upgrade

No signal, no GPU, no second chances: building an AI for the woods

Picture the worst possible place to run a machine learning model. No cell signal. No cloud to phone home to. No GPU, just a Raspberry Pi and a little accelerator chip the size of a stamp, running off a battery I've been babying since sunrise. And the user? The user is standing in a damp forest about to decide whether to put something in their mouth.

That's foraging. That's the actual deployment target. I'm not shipping a chatbot that gets to be wrong in an interesting way. I'm shipping a thing that, if it's confidently wrong, helps somebody have the last bad afternoon of their life.

So when people ask why my models are so small, I laugh a little. Small isn't the compromise here. Small is the only thing that survives contact with the woods. The whole project is a study in constraints, and honestly, the constraints did most of the design work for me. I just had to stop fighting them.

Here's the stack: a domain router plus three expert classifiers, all tf_efficientnet_lite2, about nine million parameters each. Four models, roughly 0.04 billion parameters total, which in the year of our lord 2026 is a rounding error. They run offline on the device and they run on a plain CPU, no GPU required. That's not me being humble. That's me being a forager with a dead phone two miles from the trailhead. The "edge" everybody name-drops is, for me, just Tuesday.

Constraint number one: the model is not allowed to vote

My first real architecture was clever and I was proud of it, which is usually the tell that something is about to go wrong. The router would decide a photo was a "plant," then I'd run two expert models on it and take whichever was more confident. Max confidence wins. Democratic. Elegant.

It also tried to feed people poison hemlock.

Here's the failure, and it's a beautiful one. A deadly plant — say, poison hemlock — gets routed to "plant." My high-value forageables expert has never seen hemlock in its life, but it has seen ramps, and hemlock and ramps are cousins that have killed people who confused them. So the high-value model looks at a deadly plant and announces, at 0.9 confidence, that it's ramps. Delicious ramps. Meanwhile my medicinals expert, the one that actually knows hemlock is death, correctly flags it as deadly at lower confidence. Max confidence wins. The confident idiot out-votes the cautious expert. In my tests this leaked deadly-as-edible about six percent of the time, and not one of my per-model benchmarks caught it, because no single model was ever wrong. The system was wrong.

So I killed the vote. Now the router picks exactly one expert and that expert owns the call, full stop. The mushroom expert never sees a plant, so it can never call a plant anything. On top of that, any "deadly" verdict vetoes a more-confident "safe" one, because in this domain a false reassurance is the only error that actually matters. Deadly-as-edible dropped from around six percent to half a percent. The constraint — keep it small, keep it dumb, keep the experts in their lanes — turned out to be the safety feature.

Constraint number two: being safe is really easy if you don't mind being useless

Now the push and pull. Once you've been spooked by a near-miss, the temptation is to crank every safety dial to the max. Refuse by default. When in doubt, abstain. Very responsible. Very noble.

It's also how you build a tool nobody uses. I pushed the confidence gate up and watched the thing turn into a coward. It started abstaining on blackberries. Blackberries. At the safe-but-useless end of the curve it would refuse on roughly half of perfectly edible finds, which is a fantastic way to teach your users that the gadget is a paperweight and their own guessing is faster. And here's the part that took me an embarrassing while to accept: you cannot gate your way to zero. I mapped the whole curve. Loosen the gate and you get a decisive, useful tool that occasionally calls something dangerous safe. Tighten it and you get a safe tool that cries "I don't know" at a raspberry. There is no magic threshold that gives you both, because the residual risk isn't low confidence — it's the model being confidently, specifically wrong, and no confidence knob catches that.

Tightening the screw was a dead end. So instead of asking the model to be more sure, I started asking it for more evidence.

The move we landed on — and I'll be straight, this lives in a test harness right now, not in the shipped app yet — is to stop treating one photo as the whole story. Show the model the same subject from a couple of angles, or even just a few augmented crops of the one shot, and fuse the results before it commits. In our harness this was close to a free lunch: the multi-angle version drove deadly-as-edible toward zero on the domains we tested while lifting accuracy on the safe stuff, not crushing it. Even better is making the second photo a targeted ask — only when the top guess is a safe-looking thing that has a deadly twin. The app turns to you and says "okay, photograph the stem before I sign off on this." The friction is the feature. The moment of "hang on, show me more" is exactly the moment a careful forager would slow down anyway. We're not nagging with a banner nobody reads; we're building the caution into the interaction. That part's still being wired in, and I'll write it up properly when it is.

The one-line bug that told me my safety net was broken

A short detour into humility, because this is a field notes post and field notes should include the faceplants.

For ages I believed my out-of-distribution detector didn't work. This is the piece that's supposed to notice when you point the camera at, I don't know, a car, or your own shoe, and say "that's not food, I'm not playing." I was using an energy score for it, and every time I measured the thing it scored about 0.25 AUROC, which for the non-nerds means it was worse than a coin flip. Inverted. Useless. I shelved it and moved on, mildly betrayed.

Then I actually read my own code. The energy formula subtracts the max value for numerical stability and is supposed to add it right back. Mine subtracted it and forgot to add it back. One term. That single missing piece didn't just make the number wrong, it flipped its meaning, so my detector was confidently pointing the wrong direction the entire time. Fixed the one line, re-ran it, and the same detector jumped to around 0.90 AUROC on my hardest domain. It had been working all along. I had been reading the dial upside down.

The lesson I keep taped to the inside of my skull now: when a detector says it's failing, check that you're not holding it backwards before you believe it. SHOCKING, I know.

The actual thesis

Here's why I think the constraints were a gift and not a punishment. When you've got a giant model and infinite compute, you can paper over hard calls. You can be a little wrong everywhere and call it nuance. Out here, on a stamp-sized chip in a forest, there's nowhere to hide. Every tradeoff has to be made out loud: do I want decisive or do I want safe, and where exactly do I put the line, and who gets hurt at each setting. The edge didn't limit the engineering. It made the engineering honest.

Which, now that I say it out loud, is the whole reason I'm doing any of this. When everything else goes digital and frictionless and confident, the stuff that keeps you alive is still physical, still slow, still asks you to look twice. I built a machine that's good enough to be worth beating. The goal was never to replace the forager. It was to keep them sharp.

Or so I tell myself, two miles from the trailhead, photographing a stem.

What would you have done at the safety-versus-useful fork — held the line on refusing, or chased the second photo? And what's the dumbest one-line bug that ever cost you a week?