Field Notes and Lessons Learned: Finetuning a 1.5B Travel Risk Analyst and low-cost deployment.

Published June 14, 2026

Working in security operations has always been an interesting challenge. Every time a team member travels somewhere, we are always looking up their risk from active conflict, deteriorating infrastructure, or civil unrest. Someone has to synthesize that picture fast — ACLED data, State Department advisories, live airspace restrictions, regional news. Normally, that's more than a few tabs and hours of work.

I wanted to automate it. Not with a 70B model I can't run, and not by paying per-token indefinitely for something I could own. This hackathon was the push I needed to actually ship it small.

Here's what I built, what broke, who I helped and what I learned.

The product

Model:

Qwen2.5-1.5B fine-tuned on a custom travel risk dataset (Firemedic15/Travel_Risk_Data on HuggingFace), published as Firemedic15/qwen25-1.5B-travel-risk-analysis-merged.

Backend:

Hosted on Modal.com with an A10G GPU. Exposes a single POST endpoint that accepts a prompt and returns structured text. No always-on GPU cost. The modal spins up the container on demand and tears it down after two minutes of inactivity.

Frontend:

A HuggingFace Gradio Space (Firemedic15/OSINTTool) powered by smolagents. The agent orchestrates five tools. ACLED event fetches, RSS headline pulls from 30+ sources, State Department travel advisories, airspace status, and a source availability checker, then synthesizes everything into a structured threat brief.

The Architecture:

The architecture is pretty clean in concept: Space calls Modal, Modal runs inference, Space renders the brief. No weights on the Space. No GPU cost on HuggingFace. The heavy lifting is isolated behind one endpoint.

The Training Side

I built the dataset to reflect what an OSINT analyst actually needs. Country-level risk assessments, conflict event summaries, advisory-level reasoning, embassy contacts, airspace flags. Not generic instruction-following. Domain-specific reasoning about security contexts.

The base model is Qwen2.5-1.5B-Instruct. I chose it because Qwen2.5 punches above its weight on structured output, handles multi-turn instruction formats cleanly, and 1.5B fits inside the memory constraints I was working with. The training script used HuggingFace Jobs with SFT, and it took more iterations than I expected to get the dependency stack right. Transformers, trl, peft, and accelerate have opinions about each other that they do not always express clearly up front.

The merged model is public. Load it, test it, break it, tune it. Have fun!

The Deployment Side

This is where I spent most of my time, and where I learned the most. Modal changed their API recently and it was quite the challenge. Within two modal deploy runs, I hit three separate deprecation errors:

container_idle_timeout was renamed to scaledown_window (February 24, 2025) @modal.web_endpoint was renamed to @modal.fastapi_endpoint (March 5, 2025)

Neither was obvious until the traceback appeared. The pattern every time: the bottom of the traceback says DeprecationError, the middle tells you what it was renamed to. Fix one line, redeploy, find the next one.

Sometimes the most obvious thing doesn't become obvious until the right error appears.

What Actually Works

The pipeline runs end to end. You select a country, pick your news sources, set a lookback window for conflict events, and click run. The agent fans out across tools, pulls live data, and returns a structured brief covering:

Severity assessment with confidence rating
Recent conflict events from ACLED
News headlines with source attribution
State Department advisory level and notes
Airspace restrictions
Embassy contacts for your passport country

The 1.5B model handles the synthesis. It is not perfect — it occasionally loses the JSON schema mid-output, and the tool call reasoning is shallower than a 72B model. But it is running on hardware I control, it costs me GPU time only when it's being used, and I understand every layer of it.

That last part matters more than the benchmark.

What I Learned

Small models require tight prompts. The bigger models tolerate ambiguity in the system prompt. The 1.5B does not. If your prompt schema is loose, the output is loose. Every token of the instruction matters.

The deployment layer is where projects die. I've seen more interesting ML projects fail in deployment than in training. The model works on your laptop; then it doesn't work on Modal because of a six-week-old API rename. You have to treat deployment like a first-class engineering problem, not an afterthought.

Build the logging first. The agent trace tab in the Gradio UI — the raw output from every tool call and model step — saved me hours of debugging. Every time something looked wrong in the brief, I could see exactly where the agent went sideways. Observability before optimization.

The builder advantage is real. Buying API access to a hosted model is faster on day one. But I now have a fine-tuned model in a domain that matters to my work, a deployment pattern I understand, and a stack I can extend. The next capability I add doesn't require a vendor conversation.

Who did I help?

My goal here is to help the lean teams make decisions faster and more efficiently. Using this tool, I am introducing the ability to parallelize the risk analysis and dissemination process. Although its extremely early in the trial run, it certainly looks like it has some promise.

What's Next

The tool call reliability needs work. Right now, the 1.5B model sometimes hallucinates tool arguments or misses a required field in the JSON schema. The fix is a combination of better prompt engineering and potentially a second fine-tuning pass with tool-call examples specifically.

I also want to add streaming. Right now, the Gradio Space spins for 60-90 seconds on a cold start while Modal boots the container. Streaming responses and a loading state that shows tool-call progress in real time would make this significantly more usable.

And I want to write the governance paper. There's a real research question hiding in this project about what it means to deploy an agentic analyst in a regulated enterprise context. How you document the model lineage, validate the outputs, and handle hallucinations in a security operations context. That's a different kind of artifact than a Gradio demo, but it matters just as much.

Models mentioned in this article 1

Datasets mentioned in this article 1

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Field Notes and Lessons Learned: Finetuning a 1.5B Travel Risk Analyst and low-cost deployment.

The product

Model:

Backend:

Frontend:

The Architecture:

The Training Side

The Deployment Side

What Actually Works

What I Learned

Who did I help?

What's Next

Links:

Models mentioned in this article 1

Datasets mentioned in this article 1

Community

Models mentioned in this article 1

Datasets mentioned in this article 1