Spaces:

build-small-hackathon
/

Forager-Field-Notes

Running

App Files Files Community

Forager-Field-Notes / FIELD_NOTES.md

HomesteaderLabs

Add Field Notes writeup + README link

6ef157d verified 1 day ago

preview code

raw

history blame contribute delete

7.94 kB

	# No signal, no GPU, no second chances: building an AI for the woods

	Picture the worst possible place to run a machine learning model. No cell signal.
	No cloud to phone home to. No GPU, just a Raspberry Pi and a little accelerator
	chip the size of a stamp, running off a battery I've been babying since sunrise.
	And the user? The user is standing in a damp forest about to decide whether to put
	something in their mouth.

	That's foraging. That's the actual deployment target. I'm not shipping a chatbot
	that gets to be wrong in an interesting way. I'm shipping a thing that, if it's
	confidently wrong, helps somebody have the last bad afternoon of their life.

	So when people ask why my models are so small, I laugh a little. Small isn't the
	compromise here. Small is the only thing that survives contact with the woods. The
	whole project is a study in constraints, and honestly, the constraints did most of
	the design work for me. I just had to stop fighting them.

	Here's the stack: a domain router plus three expert classifiers, all
	`tf_efficientnet_lite2`, about nine million parameters each. Four models, roughly
	0.04 billion parameters total, which in the year of our lord 2026 is a rounding
	error. They run offline on the device and they run on a plain CPU, no GPU required.
	That's not me being humble. That's me being a forager with a dead phone two miles
	from the trailhead. The "edge" everybody name-drops is, for me, just Tuesday.

	## Constraint number one: the model is not allowed to vote

	My first real architecture was clever and I was proud of it, which is usually the
	tell that something is about to go wrong. The router would decide a photo was a
	"plant," then I'd run two expert models on it and take whichever was more
	confident. Max confidence wins. Democratic. Elegant.

	It also tried to feed people poison hemlock.

	Here's the failure, and it's a beautiful one. A deadly plant — say, poison hemlock —
	gets routed to "plant." My high-value forageables expert has never seen hemlock in
	its life, but it has seen ramps, and hemlock and ramps are cousins that have killed
	people who confused them. So the high-value model looks at a deadly plant and
	announces, at 0.9 confidence, that it's ramps. Delicious ramps. Meanwhile my
	medicinals expert, the one that actually knows hemlock is death, correctly flags it
	as deadly at lower confidence. Max confidence wins. The confident idiot
	out-votes the cautious expert. In my tests this leaked deadly-as-edible about six
	percent of the time, and not one of my per-model benchmarks caught it, because no
	single model was ever wrong. The system was wrong.

	So I killed the vote. Now the router picks exactly one expert and that expert owns
	the call, full stop. The mushroom expert never sees a plant, so it can never call a
	plant anything. On top of that, any "deadly" verdict vetoes a more-confident "safe"
	one, because in this domain a false reassurance is the only error that actually
	matters. Deadly-as-edible dropped from around six percent to half a percent. The
	constraint — keep it small, keep it dumb, keep the experts in their lanes — turned
	out to be the safety feature.

	## Constraint number two: being safe is really easy if you don't mind being useless

	Now the push and pull. Once you've been spooked by a near-miss, the temptation is to
	crank every safety dial to the max. Refuse by default. When in doubt, abstain. Very
	responsible. Very noble.

	It's also how you build a tool nobody uses. I pushed the confidence gate up and
	watched the thing turn into a coward. It started abstaining on blackberries.
	Blackberries. At the safe-but-useless end of the curve it would refuse on roughly
	half of perfectly edible finds, which is a fantastic way to teach your users that the
	gadget is a paperweight and their own guessing is faster. And here's the part that
	took me an embarrassing while to accept: you cannot gate your way to zero. I mapped
	the whole curve. Loosen the gate and you get a decisive, useful tool that
	occasionally calls something dangerous safe. Tighten it and you get a safe tool that
	cries "I don't know" at a raspberry. There is no magic threshold that gives you both,
	because the residual risk isn't low confidence — it's the model being confidently,
	specifically wrong, and no confidence knob catches that.

	Tightening the screw was a dead end. So instead of asking the model to be more sure,
	I started asking it for more evidence.

	The move we landed on — and I'll be straight, this lives in a test harness right now,
	not in the shipped app yet — is to stop treating one photo as the whole story. Show
	the model the same subject from a couple of angles, or even just a few augmented crops
	of the one shot, and fuse the results before it commits. In our harness this was
	close to a free lunch: the multi-angle version drove deadly-as-edible toward zero on
	the domains we tested while lifting accuracy on the safe stuff, not crushing it.
	Even better is making the second photo a targeted ask — only when the top guess is a
	safe-looking thing that has a deadly twin. The app turns to you and says "okay,
	photograph the stem before I sign off on this." The friction is the feature. The
	moment of "hang on, show me more" is exactly the moment a careful forager would slow
	down anyway. We're not nagging with a banner nobody reads; we're building the caution
	into the interaction. That part's still being wired in, and I'll write it up properly
	when it is.

	## The one-line bug that told me my safety net was broken

	A short detour into humility, because this is a field notes post and field notes
	should include the faceplants.

	For ages I believed my out-of-distribution detector didn't work. This is the piece
	that's supposed to notice when you point the camera at, I don't know, a car, or your
	own shoe, and say "that's not food, I'm not playing." I was using an energy score for
	it, and every time I measured the thing it scored about 0.25 AUROC, which for the
	non-nerds means it was worse than a coin flip. Inverted. Useless. I shelved it and
	moved on, mildly betrayed.

	Then I actually read my own code. The energy formula subtracts the max value for
	numerical stability and is supposed to add it right back. Mine subtracted it and
	forgot to add it back. One term. That single missing piece didn't just make the
	number wrong, it flipped its meaning, so my detector was confidently pointing the
	wrong direction the entire time. Fixed the one line, re-ran it, and the same detector
	jumped to around 0.90 AUROC on my hardest domain. It had been working all along. I
	had been reading the dial upside down.

	The lesson I keep taped to the inside of my skull now: when a detector says it's
	failing, check that you're not holding it backwards before you believe it. SHOCKING,
	I know.

	## The actual thesis

	Here's why I think the constraints were a gift and not a punishment. When you've got
	a giant model and infinite compute, you can paper over hard calls. You can be a
	little wrong everywhere and call it nuance. Out here, on a stamp-sized chip in a
	forest, there's nowhere to hide. Every tradeoff has to be made out loud: do I want
	decisive or do I want safe, and where exactly do I put the line, and who gets hurt at
	each setting. The edge didn't limit the engineering. It made the engineering honest.

	Which, now that I say it out loud, is the whole reason I'm doing any of this. When
	everything else goes digital and frictionless and confident, the stuff that keeps you
	alive is still physical, still slow, still asks you to look twice. I built a machine
	that's good enough to be worth beating. The goal was never to replace the forager. It
	was to keep them sharp.

	Or so I tell myself, two miles from the trailhead, photographing a stem.

	What would you have done at the safety-versus-useful fork — held the line on refusing,
	or chased the second photo? And what's the dumbest one-line bug that ever cost you a
	week?