Codebase Analysis & Ollama Qwen3.5 Fallback Integration
Codebase Analysis Summary
I've reviewed all 18 source files across the entire project. Here's a comprehensive analysis:
Issues Found
π΄ Critical Issues
1. inference.py β API_BASE_URL crashes if env var is missing (Line 23)
API_BASE_URL = os.environ["API_BASE_URL"] # KeyError if missing!
API_KEY = os.environ["API_KEY"] # Same problem
These use os.environ[...] which raises KeyError at import-time if the env vars aren't set. The API_KEY on line 25 has the same issue. During local development/testing, this crashes immediately.
Fix: Change to
os.getenv()with sensible defaults, and add the Ollama fallback here.
2. server/app.py β NetworkStatsResponse model mismatch (Lines 205-207)
The endpoint GET /stats returns NetworkStatsResponse(**env.get_network_stats()) but get_network_stats() now returns extra fields (false_flag_accuracy, stealth_detection_rate, burst_ticks, false_flags_seen, stealth_attacks_seen, config_params) that are not defined in the NetworkStatsResponse Pydantic model. This will cause a validation error or silently drop fields depending on Pydantic config.
3. server/firewall_environment.py β INSPECT action bug (Lines 410-414)
if inspected and session_id not in self.inspected_sessions:
metadata["revealed"] = True
self.inspected_sessions[session_id] = session
self.pending_sessions[session_id] = session # β BUG
After popping from pending_sessions on line 400, the session is re-added to both inspected_sessions AND pending_sessions. This creates a duplicate reference β the session exists in both pools. When the session is later acted upon again (block after inspect), the code pops from inspected_sessions (line 397) but then also tries pending_sessions.pop() on line 441. This is not necessarily a crash, but it means:
- The session count in
state()double-counts inspected sessions _rebuild_queuealready deduplicates, so functional behavior is OK- But the
pending_session_countmetric is inflated
π‘ Moderate Issues
4. data_loader.py β Session TTL hardcoded vs config-driven (Lines 480-481)
ttl = 2 if malicious else 3 # Hardcoded, ignoring config!
The _build_session method hardcodes TTL values, but _spawn_sessions() in firewall_environment.py overwrites these at lines 537-540. This is not a bug (the overwrite works), but it's dead code that could confuse developers.
5. server/app.py β StepResponse model mismatch (Line 186)
env.step() returns info.score and info.passed fields, but StepResponse.info is typed as Dict[str, Any], so this works. However StepResponse.state uses StateResponse which doesn't include focus_observation as a List[float] β this could cause validation issues in edge cases.
6. Unused import in models.py (Line 3)
List from typing is imported but Callable is not β and Callable is imported in graders.py line 13 directly. Minor, no functional impact.
π’ Things That Are Correct
| Component | Status | Notes |
|---|---|---|
| Reward Engine | β Correct | Multi-objective weights sum to 1.0, anti-degenerate policy design is sound |
| Threat Engine | β Correct | Kill chain model, stealth blending, escalation modifiers all work correctly |
| Traffic Generator | β Correct | 22-dim features, 5 benign profiles, 5Γ4 malicious profiles, normalization bounds |
| Grading System | β Correct | Deterministic seeding, weighted scoring, pass constraints |
| Heuristic Agent | β Correct | 8-rule policy matches between inference.py and heuristic_agent.py |
| Task Configs | β Correct | Monotonic difficulty progression across all 8 new parameters |
| Docker Setup | β Correct | Proper non-root user, healthcheck, port exposure |
| OpenEnv YAML | β Correct | Matches implementation: tasks, tools, action/observation spaces |
| Test Suite | β Correct | 14 tests covering generators, rewards, threats, environment, graders |
| Client | β Correct | Clean REST client matching all server endpoints |
Proposed Changes
1. Ollama Qwen 3.5 Fallback in inference.py
The user wants the fallback to use Ollama with Qwen 3.5 (likely qwen3:0.6b or qwen2.5:3b via Ollama cloud). I'll update inference.py to:
- Add Ollama as the fallback LLM provider when the primary API fails
- Use
qwen2.5:3b(closest to "qwin3.5" β Qwen 2.5 3B is the widely-available Ollama model) - Keep the heuristic as the final safety-net fallback
Clarification needed: "qwin3.5 cloud" β I'm interpreting this as Qwen 2.5 3B via Ollama's local server (
http://localhost:11434/v1). Ollama uses OpenAI-compatible API, so we can reuse the sameOpenAIclient.If you mean a different model (e.g.,
qwen3:0.6b,qwen2.5:7b, or a cloud-hosted Qwen endpoint), please let me know and I'll adjust.
[MODIFY] inference.py
- Add
OLLAMA_BASE_URLandOLLAMA_MODELenvironment variables with defaults - Create a secondary
OpenAIclient pointing to Ollama - In
get_action(), on primary API failure β try Ollama β then heuristic - Fix
API_BASE_URLandAPI_KEYto useos.getenv()with defaults
2. Fix NetworkStatsResponse model mismatch
[MODIFY] models.py
- Add the missing fields to
NetworkStatsResponse:false_flag_accuracy,stealth_detection_rate,burst_ticks,false_flags_seen,stealth_attacks_seen,config_params
3. Fix INSPECT duplicate session bug
[MODIFY] firewall_environment.py
- Remove the line that re-adds the session to
pending_sessionsafter INSPECT (line 414). The session should only be ininspected_sessionsduring the inspection phase.
Open Questions
- Qwen model version: I'm defaulting to
qwen2.5:3bvia Ollama athttp://localhost:11434/v1. Should I use a different model name or a remote Ollama endpoint URL?- API_KEY fallback: Should the Ollama fallback use
"ollama"as the API key (Ollama doesn't require one), or do you have a specific cloud-hosted Ollama endpoint that needs authentication?
Verification Plan
Automated Tests
- Run
pytest tests/to verify no regressions - Run
python scripts/check_accuracy.pyto validate all parameter checks pass
Manual Verification
- Test inference with Ollama running locally to verify the fallback chain works