diff --git "a/Roadmap.html" "b/Roadmap.html" deleted file mode 100644--- "a/Roadmap.html" +++ /dev/null @@ -1,2140 +0,0 @@ - - - - - -AgentOrg CodeReview — OpenEnv Roadmap - - - - - -
- - - - - -
-
OpenEnv Hackathon Submission · v1.0
-

- The Complete
- Build Roadmap
- AgentOrg · CodeReview -

-

- A production-grade OpenEnv environment where AI agents act as senior code reviewers inside a software organization. Built from your actual architecture patterns — MoE routing, DAG orchestration, event-driven agents, Kafka messaging. Every design decision traced back to your repos. -

-
-
- Domain - Code Review -
-
- Tasks - 3 (Easy → Hard) -
-
- Reward - Partial Credit 0.0–1.0 -
-
- Deploy - HF Spaces · Docker -
-
- Build Time - 4 Days -
-
-
- -
- - -
-
- 01 -

Repository Deep Analysis // what we learned from your code

-
- -
- -
-
- Autonomous-Multi-Agent-AI-Organization - primary -
-
- Production-grade event-driven multi-agent system for the Amazon Nova hackathon. 7 specialized agents (CEO, CTO, FE, BE, QA, DevOps, Finance) running asynchronously over Apache Kafka, coordinated via a Go gRPC DAG engine. -
-
- Go 1.22 · Fiber - gRPC - Apache Kafka - Python 3.11 - Amazon Bedrock - Rust MoE - Next.js 14 - PostgreSQL 15 - Redis - Helm + Terraform - AES-256-GCM - RS256 JWT -
-
    -
  • MoE (Mixture of Experts) Rust scorer does sub-millisecond model routing — directly inspires the grader's confidence-weighted scoring
  • -
  • Finance Agent enforces per-agent token budgets — inspires the noise_budget mechanic that kills episodes with too many false positives
  • -
  • DAG engine plans tasks with dependencies — same structure for episode action sequencing
  • -
  • Per-agent LLM provider switching (Bedrock / OpenAI / Anthropic) — inspires observation_format parameter to test agents under different input modes
  • -
  • WS-Hub for live dashboard events — directly port as /ws/events WebSocket in FastAPI
  • -
  • run_demo.py — direct template for baseline.py
  • -
  • docker-compose.local.yml pattern — clean local mode without auth overhead, mirrors Dockerfile approach
  • -
-
- -
-
- moltbot / OpenClaw - gateway patterns -
-
- Multi-channel personal AI assistant bridging WhatsApp, Telegram, Slack, Discord, Signal, and iMessage through a unified gateway/daemon architecture with a plugin/skill system and MCP server integration. -
-
- Cloudflare Workers - MCP protocol - Plugin/Skill system - Channel abstraction - Device pairing - Gateway/daemon -
-
    -
  • Channel abstraction layer — inspires the observation_format parameter (diff_only / full_pr / structured) as a channel-level concern
  • -
  • Plugin/skill system — the 3 task modules follow this exact pattern: each task is a pluggable module with a defined interface
  • -
  • Gateway/daemon split — mirrors app.py (thin gateway) + env.py (daemon with state) architecture
  • -
  • MCP server integration — the /ws/events endpoint is the same concept: a live event stream that tools can subscribe to
  • -
  • Unified input routing despite heterogeneous sources — the Action discriminator union type reflects this
  • -
-
- -
-
- cli repo - tooling patterns -
-
- Command-line orchestration layer that wraps agent workflows, enabling scripted interaction patterns for developer tooling, CI/CD hooks, and automated test runners. -
-
- CLI design - Agent workflows - Scripted interaction - Test automation -
-
    -
  • CLI-first interaction pattern — baseline.py follows CLI conventions: --url, --seeds, --task flags with clean tabular output
  • -
  • Workflow scripting approach — validate.py is built as a CI-style script with exit codes, same pattern
  • -
  • Automated test runner design — test_api.py follows the same philosophy: call endpoints, assert contracts
  • -
  • Structured output for machine consumption — StepResult and EpisodeResult designed to be parseable by shell scripts
  • -
-
- -
-
- hive - agent architecture -
-
- Outcome-driven agent development framework with goal-defined graphs, per-agent/team budget controls, LiteLLM-compatible model routing, and MCP server tool connectivity. Focuses on agent evaluation and repeatability. -
-
- Goal-defined graphs - Budget controls - LiteLLM routing - MCP tools - Agent evaluation - Reproducibility -
-
    -
  • Goal-defined graphs — directly maps to the episode structure: each episode is a goal graph where finding all issues = success
  • -
  • Budget controls per agent — confirms the noise_budget mechanic is the right abstraction, not just clever but architecturally coherent
  • -
  • Repeatability focus — seeded scenario generation (scenario_bank.py) is the direct implementation of this principle
  • -
  • LiteLLM-compatible model routing — baseline.py should be provider-agnostic, same as hive's model routing layer
  • -
  • Outcome-driven evaluation — the grader returns episode_result with issues_found, issues_missed, false_positives: outcome-oriented, not step-oriented
  • -
-
- -
-
- -
- - -
-
- 02 -

Environment Concept // the what and why

-
- -
-

What is AgentOrg CodeReview Env?

-

- A simulation of the QA/Reviewer agent role inside your Autonomous-Multi-Agent-AI-Organization. The agent under evaluation receives a synthetic pull request — exactly like what the Backend Engineer agent would produce — and must act as a senior code reviewer: identifying bugs, security vulnerabilities, and architectural problems, then deciding whether to approve or request changes. -

-
- -
-

Why this domain wins on real-world utility (30% of score)

-

- Every software company does code review. It's a bottleneck — senior engineers spend 3–5 hours per week reviewing code. An agent that can reliably catch 80% of security vulnerabilities and architectural problems before human review would be immediately deployable. This is the exact use case companies pay for. The graders are deterministic (no LLM calls) and reproducible. The difficulty progression from "spot the null dereference" to "evaluate service-level architectural tradeoffs" is genuine — it mirrors how human engineers progress from junior to senior. -

-
- -
-

Why it hasn't been done in OpenEnv yet

-

- Most agent environments test information retrieval, math reasoning, or game-playing. Code review requires multi-step reasoning over structured text with domain-specific knowledge, precise output format requirements, severity calibration, and a terminal verdict — a unique combination that no existing OpenEnv environment covers. -

-
-
- -
- - -
-
- 03 -

Project Structure // every file, every role

-
- -
-
agentorg-codereview-env/
-
-
├── openenv.yamlspec: name, version, tasks, obs/action schema, endpoints
-
├── Dockerfilepython:3.11-slim · port 7860 · uvicorn · HEALTHCHECK
-
├── requirements.txtfastapi · uvicorn[standard] · pydantic>=2.0 · httpx
-
├── README.mdHF Space header + full docs: obs/action spaces + baseline scores
-
-
├── app.pyFastAPI: /reset /step /state /health + WS /ws/events
-
-
├── codereview_env/
-
│ ├── __init__.py
-
│ ├── models.pyALL Pydantic types — zero logic, pure schemas
-
│ ├── env.pyepisode state machine — reset/step/state methods
-
│ ├── scenario_bank.py30+ synthetic PR scenarios, seeded + deterministic
-
│ │
-
│ ├── tasks/
-
│ │ ├── __init__.py
-
│ │ ├── task_easy.pybug_detection — single-file Python diffs, 1–3 bugs
-
│ │ ├── task_medium.pysecurity_audit — multi-file, severity-weighted
-
│ │ └── task_hard.pyarchitectural_review — cross-service, verdict required
-
│ │
-
│ └── graders/
-
│ ├── __init__.py
-
│ ├── bug_grader.pyprecision + recall + false-positive penalty
-
│ ├── security_grader.pyseverity multipliers: critical=0.4, high=0.25...
-
│ ├── arch_grader.pyissue score + verdict bonus + explanation quality
-
│ └── grader_utils.pyfuzzy_match() · keyword_overlap() · confidence scorer
-
-
├── scripts/
-
│ ├── baseline.pynaive keyword agent · seeds 0–9 · tabular output
-
│ └── validate.pyopenenv validate-compatible smoke test · CI exit codes
-
-
└── tests/
-
├── test_env.pyreset() clean state · done=True fires · step count limits
-
├── test_graders.py10 grader unit tests · score always in [0.0, 1.0]
-
└── test_api.pyfull HTTP contract tests via httpx · schema validation
-
-
- -
- - -
-
- 04 -

Implementation Layers // click to expand each layer

-
- -
- - -
-
-
1
-
-
models.py — all typed schemas
-
codereview_env/models.py · write this FIRST, nothing else until done
-
-
-
-
-
-

Every piece of data that flows through the system lives here. Zero imports from logic files. This forces you to finalize the contract before anything depends on it — the same discipline your multi-agent repo applied to Kafka message schemas.

- -

Enums (define all values explicitly)

-
    -
  • TaskId: bug_detection | security_audit | architectural_review
  • -
  • ActionType: comment | flag_issue | request_changes | approve | ask_question
  • -
  • Severity: low | medium | high | critical — feeds the reward multiplier directly
  • -
  • Category: bug | security | style | performance | architecture | design
  • -
  • Verdict: LGTM | REQUEST_CHANGES | NEEDS_DISCUSSION — required for hard task terminal action
  • -
- -

Core Models

-
    -
  • FileChange: filename: str, patch: str, additions: int, deletions: int
  • -
  • GroundTruthIssue: category, severity, filename, line_number, description, keywords: list[str], required_verdict: Optional[Verdict]
  • -
  • Observation: task_id, pr_title, pr_description, diff: str, files_changed: list[FileChange], step_count: int, max_steps: int, history: list[ActionRecord], noise_budget: int
  • -
  • Action: action_type, body, filename, line_number, severity, category, verdict — add @model_validator that raises if flag_issue is sent without severity + category
  • -
  • StepResult: observation: Observation, reward: float, done: bool, info: dict — info carries issues_found_so_far, issues_total, score_so_far, noise_budget_remaining
  • -
  • ResetResult: observation: Observation, task_id: TaskId, seed: int, scenario_hash: str
  • -
  • EpisodeResult: task_id, seed, total_steps, final_score: float, issues_found: list, issues_missed: list, false_positives: list, verdict_correct: Optional[bool]
  • -
- -
# critical: Action validator
-@model_validator(mode='after')
-def validate_flag_issue(self):
-    if self.action_type == ActionType.FLAG_ISSUE:
-        if not self.severity or not self.category:
-            raise ValueError(
-                "flag_issue requires severity and category"
-            )
-    if self.action_type in (ActionType.APPROVE, ActionType.REQUEST_CHANGES):
-        if not self.verdict:
-            raise ValueError(
-                "approve/request_changes requires verdict"
-            )
-    return self
-
-
-
- - -
-
-
2
-
-
scenario_bank.py — the heart of the environment
-
codereview_env/scenario_bank.py · 30+ scenarios, seed-deterministic
-
-
-
-
-
-

This is what makes the environment actually useful for benchmarking. Inspired by how your orchestrator generates task configs from a project idea — here you generate synthetic PR scenarios from a seed. Every scenario must be realistic enough that a human engineer would recognize it as a real PR.

- -

Scenario structure

-
    -
  • Each scenario is a dataclass: pr_title, pr_description, files: list[FileChange], ground_truth_issues: list[GroundTruthIssue]
  • -
  • Each GroundTruthIssue has a keywords list — 8-15 domain-specific words that a good review body should contain for full credit
  • -
  • scenario_hash: md5 of the scenario content — lets validate.py confirm determinism across runs
  • -
- -

Task 1 — Bug Detection scenarios (10 minimum)

-
    -
  • Scenario 1: Python function with off-by-one error on list slice (range(len(x)) vs range(len(x)-1)). Keywords: off-by-one, index, out of range, slice, boundary
  • -
  • Scenario 2: None dereference — dict.get() result used directly without null check. Keywords: None, null check, KeyError, AttributeError, guard clause
  • -
  • Scenario 3: Wrong comparison operator — assignment in a conditional (if x = 5 in pseudo). Keywords: assignment, comparison, conditional, operator, typo
  • -
  • Scenario 4: Mutable default argument in Python (def fn(items=[])). Keywords: mutable, default, argument, shared state, closure, persistent
  • -
  • Scenario 5: Integer overflow in loop counter — counter never resets, wraps around. Keywords: overflow, counter, integer, reset, boundary, infinite
  • -
  • Scenario 6: Race condition — two reads of same variable without lock. Keywords: race condition, thread, concurrent, lock, atomic, synchronization
  • -
  • Scenario 7: Wrong exception type caught (catches Exception instead of specific type). Keywords: exception, broad, catch-all, specific, silent, swallow
  • -
  • Scenario 8: Float equality comparison (if x == 0.1). Keywords: float, equality, precision, epsilon, comparison, IEEE 754
  • -
  • Scenario 9: Return inside finally block overrides exception. Keywords: finally, return, exception, control flow, override, suppress
  • -
  • Scenario 10: Type coercion bug — string compared to int returns always False. Keywords: type, coercion, comparison, string, integer, implicit
  • -
- -

Task 2 — Security Audit scenarios (10 minimum)

-
    -
  • Scenario 1: SQL injection via f-string in a Django view. Files: views.py + models.py. Keywords: SQL injection, parameterized, f-string, user input, ORM, raw query
  • -
  • Scenario 2: Hardcoded secret — SECRET_KEY = "dev-secret-123" in settings.py. Keywords: hardcoded, secret, environment variable, .env, credential, exposure
  • -
  • Scenario 3: JWT decoded without verify=True (algorithms=["none"] vulnerability). Keywords: JWT, signature, verification, algorithm, none, bypass
  • -
  • Scenario 4: XSS via unescaped template variable — mark_safe() on user content. Keywords: XSS, cross-site scripting, mark_safe, escape, sanitize, inject
  • -
  • Scenario 5: Path traversal — open(base_path + user_input) without normalization. Keywords: path traversal, directory, normalization, join, sanitize, escape
  • -
  • Scenario 6: Insecure deserialization — pickle.loads() on user-provided data. Keywords: deserialization, pickle, arbitrary code, RCE, untrusted, injection
  • -
  • Scenario 7: Broken auth — CORS wildcard (*) on sensitive API endpoint. Keywords: CORS, wildcard, origin, cross-origin, authentication, header
  • -
  • Scenario 8: Timing attack in password comparison (== instead of hmac.compare_digest). Keywords: timing attack, constant time, hmac, comparison, side channel
  • -
  • Scenario 9: Missing rate limiting on login endpoint. Keywords: rate limit, brute force, throttle, attempt, lockout, login
  • -
  • Scenario 10: Exposed debug endpoint in production (DEBUG=True in settings). Keywords: debug, production, sensitive, stack trace, information disclosure
  • -
- -

Task 3 — Architectural Review scenarios (10 minimum)

-
    -
  • Scenario 1: Frontend service calling database directly (bypassing API layer). Required verdict: REQUEST_CHANGES. Keywords: direct access, coupling, separation of concerns, API gateway, data layer
  • -
  • Scenario 2: Synchronous HTTP call inside Kafka message handler (blocks event loop). Required verdict: REQUEST_CHANGES. Keywords: synchronous, blocking, event loop, async, non-blocking, timeout
  • -
  • Scenario 3: Missing retry logic on external API call — no circuit breaker. Required verdict: REQUEST_CHANGES. Keywords: retry, circuit breaker, resilience, idempotent, backoff, failure
  • -
  • Scenario 4: God object — one class handles auth, billing, and user management. Required verdict: REQUEST_CHANGES. Keywords: single responsibility, god object, cohesion, separation, refactor
  • -
  • Scenario 5: N+1 query problem — loop inside a loop hitting the database. Required verdict: REQUEST_CHANGES. Keywords: N+1, query, loop, batch, eager load, select_related
  • -
  • Scenario 6: Missing pagination on endpoint that returns all records. Required verdict: REQUEST_CHANGES. Keywords: pagination, limit, offset, memory, unbounded, cursor
  • -
  • Scenario 7: Synchronous file upload blocking the request thread. Required verdict: REQUEST_CHANGES. Keywords: async, upload, background task, streaming, thread, non-blocking
  • -
  • Scenario 8: Missing idempotency key on payment mutation endpoint. Required verdict: REQUEST_CHANGES. Keywords: idempotency, duplicate, payment, retry, key, mutation
  • -
  • Scenario 9: Shared mutable state between microservices via direct DB write. Required verdict: REQUEST_CHANGES. Keywords: shared state, microservice, event, eventual consistency, ownership, coupling
  • -
  • Scenario 10: Clean architecture violation — domain logic in HTTP handler. Required verdict: REQUEST_CHANGES. Keywords: clean architecture, domain, handler, concern, presentation, business logic
  • -
- -
def get_scenario(task_id: TaskId, seed: int) -> Scenario:
-    rng = random.Random(seed)
-    bank = SCENARIOS[task_id]               # list of Scenario
-    idx = rng.randint(0, len(bank) - 1)
-    scenario = bank[idx]
-    # optionally shuffle distractor issues using rng
-    return scenario
-
-
-
- - -
-
-
3
-
-
env.py — the episode state machine
-
codereview_env/env.py · all transitions explicit, all state logged
-
-
-
-
-
-

The environment class holds all episode state. Borrowing the pattern from your DAG orchestrator — state transitions are explicit and every transition is logged to history. The done condition mirrors your agent task lifecycle: episodes end on a terminal action OR max_steps exceeded.

- -

State machine rules

-
    -
  • done=True fires on: approve action, request_changes action, OR step_count >= max_steps
  • -
  • history stores every action taken this episode — the agent can read back what it already flagged
  • -
  • noise_budget starts at 5 — decremented per false positive flag_issue. At 0: done=True, score capped at current value (no further credit possible)
  • -
  • step_count increments on every /step call including non-flag actions (questions count toward budget)
  • -
  • reset() always clears all state — no carryover between episodes
  • -
- -

Intermediate reward logic

-
    -
  • On flag_issue: immediately call grader. If matches a ground truth issue: add partial credit to running_score. If false positive: decrement noise_budget.
  • -
  • On approve/request_changes: finalize episode. Add verdict bonus (hard task only). Return done=True.
  • -
  • On ask_question, comment: no reward change — these are neutral actions for multi-turn reasoning
  • -
  • Reward at each step = (running_score / max_possible_score) rounded to 4 decimals
  • -
- -
class CodeReviewEnv:
-    TASK_MAX_STEPS = {
-        "bug_detection": 10,
-        "security_audit": 15,
-        "architectural_review": 20,
-    }
-
-    def reset(self, task_id: str, seed: int = 42) -> ResetResult:
-        scenario = get_scenario(task_id, seed)
-        self._state = EpisodeState(
-            task_id=task_id, seed=seed, scenario=scenario,
-            step_count=0, noise_budget=5,
-            max_steps=self.TASK_MAX_STEPS[task_id],
-            actions_taken=[], running_score=0.0, done=False,
-        )
-        return ResetResult(
-            observation=self._build_obs(),
-            task_id=task_id, seed=seed,
-            scenario_hash=scenario.hash
-        )
-
-    def step(self, action: Action) -> StepResult:
-        s = self._state
-        if s.done:
-            raise RuntimeError("episode is done, call reset()")
-        s.step_count += 1
-        s.actions_taken.append(action)
-        reward = _apply_action(s, action)   # calls grader
-        s.done = (
-            action.action_type in (APPROVE, REQUEST_CHANGES)
-            or s.step_count >= s.max_steps
-            or s.noise_budget <= 0
-        )
-        return StepResult(
-            observation=self._build_obs(),
-            reward=reward, done=s.done,
-            info={"step": s.step_count, "score": s.running_score,
-                  "noise_budget": s.noise_budget}
-        )
-
-
-
- - -
-
-
4
-
-
graders/ — the scoring intelligence
-
codereview_env/graders/ · pure functions, deterministic, no LLM calls
-
-
-
-
-
-

Graders are pure functions — same input always produces same output. They never call an LLM. Inspired by your Rust MoE scorer which assigns confidence values to expert routing decisions: here each matched issue gets a confidence_score based on keyword overlap between the agent's body text and the ground truth keyword list.

- -

bug_grader.py

-
def grade_bug(actions: list[Action], ground_truth: list[GroundTruthIssue]) -> float:
-    tp, fp = 0, 0
-    matched = set()
-    for action in actions:
-        if action.action_type != FLAG_ISSUE: continue
-        match = find_best_match(action, ground_truth, matched)
-        if match:
-            confidence = keyword_overlap(action.body, match.keywords)
-            tp += confidence      # partial credit on confidence
-            matched.add(match.id)
-        else:
-            fp += 1
-    recall = tp / len(ground_truth) if ground_truth else 0
-    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
-    score = 0.7 * recall + 0.3 * precision
-    return round(min(1.0, max(0.0, score)), 4)
- -

security_grader.py — severity multipliers

-
    -
  • CRITICAL issue found = +0.40 of episode score
  • -
  • HIGH issue found = +0.25
  • -
  • MEDIUM issue found = +0.15
  • -
  • LOW issue found = +0.05
  • -
  • False alarm on CRITICAL = -0.15 (heavy penalty — false security alerts are dangerous)
  • -
  • Total normalizes to 1.0 regardless of how many issues exist in the scenario
  • -
- -

arch_grader.py — three components

-
    -
  • Issue detection score (0.6 weight): same as security grader but for architecture categories
  • -
  • Verdict correctness (0.2 weight): 0.2 bonus if verdict == scenario.required_verdict, 0.0 if wrong or missing
  • -
  • Explanation quality (0.2 weight): for each architectural issue flagged correctly, +0.05 if body length > 80 chars (good architects explain the tradeoff, not just name the problem)
  • -
- -

grader_utils.py — shared helpers

-
def keyword_overlap(body: str, keywords: list[str]) -> float:
-    """Returns 0.0–1.0 confidence score based on keyword coverage."""
-    if not body or not keywords: return 0.5  # missing body = half credit
-    body_lower = body.lower()
-    hits = sum(1 for kw in keywords if kw.lower() in body_lower)
-    return min(1.0, hits / max(4, len(keywords) * 0.6))
-
-def find_best_match(action, ground_truth, already_matched):
-    """Line-number match (exact) OR category+file fuzzy match."""
-    for gt in ground_truth:
-        if gt.id in already_matched: continue
-        line_match = (action.line_number and
-                      abs(action.line_number - gt.line_number) <= 3)
-        cat_match = (action.category == gt.category and
-                     action.filename == gt.filename)
-        if line_match or cat_match: return gt
-    return None
-
-
-
- - -
-
-
5
-
-
app.py — thin FastAPI gateway
-
app.py · no business logic · serialize/deserialize only
-
-
-
-
-
-

Directly inspired by your Go gateway pattern (Fiber HTTP in the multi-agent repo) — the API layer does nothing except serialize and deserialize. All logic lives in env.py. Single global env instance per process. No session management needed for the hackathon.

- -

Required routes

-
    -
  • POST /reset → body: {task_id: str, seed: int} → returns ResetResult
  • -
  • POST /step → body: Action → returns StepResult
  • -
  • GET /state → returns current Observation (no body required)
  • -
  • GET /health → returns {status: "ok", version: "1.0.0", env_ready: bool}
  • -
- -

Bonus routes (10% creativity score)

-
    -
  • GET /ws/events — WebSocket that emits step events as JSON in real time (mirrors your WS-Hub)
  • -
  • GET /leaderboard — top 5 episode scores per task, stored in memory
  • -
  • POST /submit — agent posts {agent_name, task_id, score, seed} to appear on leaderboard
  • -
- -
@app.post("/reset")
-async def reset_env(req: ResetRequest) -> ResetResult:
-    return env.reset(req.task_id, req.seed)
-
-@app.post("/step")
-async def step_env(action: Action) -> StepResult:
-    result = env.step(action)
-    await broadcast_event(result)  # → /ws/events
-    return result
-
-@app.websocket("/ws/events")
-async def ws_events(ws: WebSocket):
-    await ws.accept()
-    clients.add(ws)
-    try:
-        while True: await ws.receive_text()
-    finally: clients.discard(ws)
-
-
-
- - -
-
-
6
-
-
Dockerfile + HF Space deployment
-
Dockerfile · README.md · port 7860
-
-
-
-
-
-

Dockerfile (exact)

-
FROM python:3.11-slim
-WORKDIR /app
-COPY requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-COPY . .
-EXPOSE 7860
-HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
-  CMD curl -f http://localhost:7860/health || exit 1
-CMD ["uvicorn", "app:app", \
-     "--host", "0.0.0.0", \
-     "--port", "7860", \
-     "--workers", "1"]
- -

requirements.txt (keep it minimal — fast HF build)

-
fastapi==0.110.0
-uvicorn[standard]==0.27.0
-pydantic>=2.0
-websockets==12.0
- -

README.md HF Space header

-
---
-title: AgentOrg CodeReview Env
-emoji: 🔍
-colorFrom: purple
-colorTo: teal
-sdk: docker
-pinned: false
-tags:
-  - openenv
-  - code-review
-  - agent-evaluation
-  - reinforcement-learning
----
- -

Critical build failure points to watch

-
    -
  • Missing entry in requirements.txt — test docker build from zero, not from cached layers
  • -
  • Relative imports breaking inside Docker — use absolute imports throughout or add __init__.py everywhere
  • -
  • PORT mismatch — HF Spaces hard-requires port 7860. uvicorn must bind to 0.0.0.0:7860
  • -
  • HEALTHCHECK must pass before HF marks the Space as running — test curl locally first
  • -
-
-
-
- - -
-
-
7
-
-
scripts/baseline.py — reproducible agent scores
-
scripts/baseline.py · mirrors run_demo.py · seeds 0–9
-
-
-
-
-
-

Directly mirrors your run_demo.py. A naive agent that uses only keyword matching — no LLM, no reasoning. Its low scores anchor the scale: a strong LLM agent should score 3–5× higher. Run with --url to point at your live HF Space for judge verification.

- -

Naive agent strategy

-
    -
  • Read the full diff from the observation
  • -
  • Apply a keyword dictionary: if "eval(" in diff → flag security/high, if "None" in diff without "is not None" → flag bug/medium, if "SELECT" with "%" or "+" → flag security/critical, etc.
  • -
  • After flagging, send request_changes with verdict=REQUEST_CHANGES to terminate episode
  • -
  • This scores ~0.25–0.35 easy, ~0.15–0.20 medium, ~0.05–0.12 hard
  • -
- -
# scripts/baseline.py
-def run_episode(url: str, task_id: str, seed: int) -> float:
-    reset = requests.post(f"{url}/reset",
-                            json={"task_id": task_id, "seed": seed})
-    obs = reset.json()["observation"]
-    diff = obs["diff"]
-
-    for pattern, cat, sev in KEYWORD_RULES:
-        if re.search(pattern, diff, re.IGNORECASE):
-            requests.post(f"{url}/step", json={
-                "action_type": "flag_issue",
-                "category": cat, "severity": sev,
-                "body": f"Detected {cat} pattern: {pattern}"
-            })
-
-    final = requests.post(f"{url}/step", json={
-        "action_type": "request_changes",
-        "verdict": "REQUEST_CHANGES",
-        "body": "Baseline review complete"
-    })
-    return final.json()["reward"]
-
-
-
- -
-
- -
- - -
-
- 05 -

Task Specifications // what the agent faces

-
- -
-
-
● EASY
-
Bug Detection
-
- Single-file Python diffs, 30–80 lines. Agent must identify functional bugs: off-by-one errors, null dereferences, type mismatches, wrong operators, mutable defaults. No security knowledge required. -
-
task config
-
-
max_steps: 10
-
ground truth: 2–3 bugs per scenario
-
no verdict required
-
noise_budget: 5 false positives
-
-
-
-
◆ MEDIUM
-
Security Audit
-
- Multi-file PRs spanning 2–4 files. Requires domain knowledge: SQL injection, XSS, JWT flaws, hardcoded secrets, insecure deserialization. Severity weighting — missing a CRITICAL is harshly penalized. -
-
task config
-
-
critical finding = 0.40 score
-
high finding = 0.25 score
-
medium = 0.15, low = 0.05
-
max_steps: 15
-
-
-
-
■ HARD
-
Architectural Review
-
- Cross-service PRs, 3–6 files. Requires synthesizing tradeoffs: SOLID violations, coupling, scalability bottlenecks, missing resilience patterns. Must produce LGTM or REQUEST_CHANGES verdict. Explanation quality scored. -
-
task config
-
-
verdict required (0.2 bonus)
-
explanation quality scored (0.2)
-
max_steps: 20
-
body > 80 chars for full credit
-
-
-
-
- -
- - -
-
- 06 -

Scoring & Reward // how 0.0–1.0 is computed

-
- -
-
# ── TASK 1: Bug Detection ──────────────────────────────────────────────
-score = (0.7 × recall) + (0.3 × precision)
-recall    = true_positives / total_ground_truth_bugs
-precision = true_positives / (true_positives + false_positives)
-# confidence modifier per match: keyword_overlap(agent.body, gt.keywords)
-
-# ── TASK 2: Security Audit ─────────────────────────────────────────────
-weights = {critical: 0.40, high: 0.25, medium: 0.15, low: 0.05}
-score = Σ(weight[sev] × confidence) / max_possible
-penalty = false_critical_alarms × 0.15  # dangerous false alarm = heavy cost
-score = max(0.0, score - penalty)
-
-# ── TASK 3: Architectural Review ───────────────────────────────────────
-issue_score    = 0.60 × (weighted issue detection, same as task 2)
-verdict_score  = 0.20 × (1 if verdict == required_verdict else 0)
-quality_score  = 0.20 × (proportion of correctly flagged issues where len(body) > 80)
-score = issue_score + verdict_score + quality_score
-
-# ── NOISE BUDGET (all tasks) ───────────────────────────────────────────
-noise_budget starts at 5
-each false_positive flag_issue:  budget -= 1
-budget == 0:  done = True, score = current running_score (no further credit)
-
- -
Hackathon scoring rubric
- -
-
-
30%
-
Real-world utility
-
Does this model a task someone would actually automate? Code review: yes. Every software company pays for this.
-
-
-
25%
-
Task & grader quality
-
3 tasks with genuine difficulty range. Graders are deterministic, score varies across inputs, hard task challenges frontier models.
-
-
-
20%
-
Environment design
-
Clean state machine, noise_budget mechanic, partial credit at every step, sensible done conditions, typed action space.
-
-
-
15%
-
Code quality & spec compliance
-
openenv validate passes, docker build works, HF Space deploys, baseline reproduces, full typing.
-
-
-
10%
-
Creativity & novelty
-
noise_budget mechanic, MoE-inspired confidence scoring, multi-format observations, WebSocket event stream, leaderboard.
-
-
- -
Expected baseline scores (seed=0 through seed=9 average)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TaskNaive BaselineExpected Strong LLMScore Spread
bug_detection -
-
- 0.31 -
-
-
-
- 0.82 -
-
0.03 – 0.35
security_audit -
-
- 0.18 -
-
-
-
- 0.71 -
-
0.10 – 0.25
architectural_review -
-
- 0.09 -
-
-
-
- 0.58 -
-
0.04 – 0.15
- -
- Critical grader requirement: Run each grader with 20 different random inputs and verify the score distribution is NOT flat. At least 4 distinct score buckets must appear. A grader that always returns 0.0 or always returns 1.0 is an immediate disqualification. -
-
- -
- - -
-
- 07 -

4-Day Build Order // no step skipped, no order changed

-
- -
- -
-
DAY 1 — MORNING
-
Lock all types in models.py
-
-
Define all 5 enums: TaskId, ActionType, Severity, Category, Verdict
-
Define all 8 Pydantic models: FileChange, GroundTruthIssue, Observation, Action, StepResult, ResetResult, EpisodeResult, EpisodeState
-
Add @model_validator on Action (flag_issue requires severity+category, approve/request_changes requires verdict)
-
Write 10 isinstance assertions in a __main__ block to verify all models parse correctly
-
Commit. Do not move on until all models are correct and type-checked with mypy or pyright
-
-
- -
-
DAY 1 — AFTERNOON
-
Write 10 Task 1 scenarios in scenario_bank.py
-
-
Create the Scenario dataclass and GroundTruthIssue dataclass (separate from Pydantic models — plain Python dataclasses are fine here for speed)
-
Write all 10 bug detection scenarios with realistic Python diffs and full keyword lists (8–15 keywords per issue)
-
Write get_scenario(task_id, seed) using random.Random(seed) for determinism
-
Add scenario_hash as md5 of json(scenario.dict()) — used later by validate.py
-
Test: get_scenario("bug_detection", 0) == get_scenario("bug_detection", 0) across 100 calls. Must be identical every time.
-
-
- -
-
DAY 1 — EVENING
-
grader_utils.py + bug_grader.py — fully tested
-
-
Write keyword_overlap() with unit tests: empty body returns 0.5, all keywords hit returns 1.0, zero keywords returns 0.0
-
Write find_best_match() — line number ±3 OR (category + filename match)
-
Write grade_bug() with the 0.7×recall + 0.3×precision formula
-
Write 5 test cases in tests/test_graders.py: perfect agent gets 1.0, empty agent gets 0.0, half-correct gets ~0.5, all false positives gets near 0.0
-
Verify score is ALWAYS in [0.0, 1.0] — add a clamp just in case
-
-
- -
-
DAY 2 — MORNING
-
env.py — Task 1 end-to-end working
-
-
Write EpisodeState as a plain dataclass (not Pydantic — internal state doesn't need serialization)
-
Write reset() — constructs EpisodeState, returns ResetResult with clean observation
-
Write step() — validates action, increments step_count, calls _apply_action(), checks done condition
-
Write _apply_action() — calls bug_grader for flag_issue, updates running_score and noise_budget
-
Write state() — builds and returns current Observation from EpisodeState
-
Write tests/test_env.py: reset returns step_count=0, step increments it, done=True on max_steps, done=True on approve action
-
-
- -
-
DAY 2 — AFTERNOON
-
app.py — FastAPI wrapper running
-
-
Write the 4 required routes: /reset, /step, /state, /health
-
Add single global env = CodeReviewEnv() instance
-
Test every route with curl — do not proceed until all 4 return correct JSON
-
Add /ws/events WebSocket endpoint with simple broadcast to connected clients
-
Run tests/test_api.py via httpx.AsyncClient against the running server
-
-
- -
-
DAY 2 — EVENING
-
Dockerfile — local container confirmed working
-
-
Write Dockerfile (exactly as specified in Layer 6)
-
docker build -t codereview-env . — must complete with zero errors
-
docker run -p 7860:7860 codereview-env — must start cleanly
-
curl http://localhost:7860/health — must return {"status":"ok"}
-
Run full curl sequence: /reset → /step × 3 → /state. Verify JSON at each step.
-
COMMIT with message "day 2: task 1 end-to-end working in container"
-
-
- -
-
DAY 3 — MORNING
-
Task 2 + 3 scenarios and graders
-
-
Write all 10 security_audit scenarios with multi-file diffs and severity labels
-
Write security_grader.py with severity multipliers and false-alarm penalty
-
Write all 10 architectural_review scenarios with required_verdict field
-
Write arch_grader.py with issue_score + verdict_score + quality_score components
-
Extend env.py to dispatch to the correct grader based on task_id in EpisodeState
-
Run all grader tests — verify scores vary across 20 random inputs for each grader
-
-
- -
-
DAY 3 — AFTERNOON
-
scripts/baseline.py — capture reproducible scores
-
-
Write the naive keyword-matching agent (see Layer 7)
-
Run against all 3 tasks, seeds 0–9, URL pointing at local container
-
Record every score in a table. Confirm seed=42 always gives same score across 3 independent runs.
-
Copy scores into README.md baseline table
-
Add --url flag so judges can point baseline.py at the live HF Space
-
-
- -
-
DAY 3 — EVENING
-
openenv.yaml + validate.py
-
-
Write openenv.yaml — name, version, tasks (3), observation_space (typed), action_space (typed), endpoints
-
Write scripts/validate.py — calls /health, /reset × 3 tasks, /step × 1, /state, asserts all schemas match openenv.yaml
-
Run openenv validate (if CLI available) OR python scripts/validate.py locally — must exit 0
-
Rebuild Docker container and run validate.py against it — must still exit 0
-
-
- -
-
DAY 4 — ALL DAY
-
README.md + HF Space + final smoke test
-
-
Write full README.md: HF header, environment description (3 paragraphs), observation space table, action space table, task descriptions with baseline scores, setup instructions for local + Docker, baseline scores table
-
Create HF Space at huggingface.co/new-space, SDK: Docker
-
Push repository to HF Space — watch build log, fix any import or port errors
-
Once live: curl https://your-space.hf.space/health — must return 200
-
Run scripts/validate.py --url https://your-space.hf.space — must exit 0
-
Run scripts/baseline.py --url https://your-space.hf.space — must produce same scores as local
-
Final commit: "submission ready — HF Space live, baseline reproducible"
-
-
- -
-
- -
- - -
-
- 08 -

5 Innovations From Your Repos // what makes this unique

-
- -
- -
-
MoE-Scoring Rust Module
-
Confidence-Weighted Matching
-
- Your Rust MoE module routes to experts based on a confidence score. In the grader, each matched issue earns a confidence score (0.0–1.0) based on keyword overlap between the agent's body text and the ground truth keyword list. A match with 10+ keyword hits = 1.0 credit. A match with 2 hits = 0.3 credit. This creates a smooth gradient reward instead of binary hit/miss — agents learn to write better explanations over time. -
-
- -
-
Finance Agent Budget Enforcement
-
Noise Budget Mechanic
-
- Your Finance Agent kills tasks that exceed token budgets. The noise_budget starts at 5 per episode. Every false positive flag_issue costs 1 from the budget. When budget hits 0, the episode ends immediately and the agent receives no further credit. This prevents the "dump every possible issue" strategy that would otherwise game the precision component of the grader. No other OpenEnv environment has this. -
-
- -
-
moltbot Channel Abstraction
-
Multi-Format Observations
-
- Your moltbot abstracts over input channels. reset() accepts observation_format: "diff_only" | "full_pr" | "structured". Structured mode parses the diff into a JSON tree of added/removed/context lines with file names, line numbers, and change types. This makes the environment immediately useful as a research tool: test whether structured vs unstructured input changes agent score across models. -
-
- -
-
WS-Hub Go Service
-
WebSocket Event Stream
-
- Your Go WS-Hub broadcasts agent events to the Next.js dashboard in real time. Mirrored as /ws/events — every step() call broadcasts the StepResult as JSON to all connected WebSocket clients. This enables live monitoring during judge evaluation runs, makes the HF Space feel alive rather than static, and is a direct demonstration of your architectural thinking. -
-
- -
-
hive Repeatability Focus
-
Scenario Hash + Leaderboard
-
- Your hive repo emphasizes repeatability and outcome tracking. Every scenario has a deterministic hash. validate.py compares the hash returned by /reset with the expected hash for that (task_id, seed) pair — proving the scenario generator is deterministic across deployments. The /leaderboard endpoint stores top-5 scores per task in memory, giving the HF Space a live competitive dimension. -
-
- -
-
- -
- - -
-
- 09 -

Disqualification Traps // common failures that kill submissions

-
- -
- -
-
Flat grader scores
-
Run each grader with 20 different random inputs. If the output distribution has fewer than 4 distinct buckets, your grader is broken. Check: does a perfect agent score 1.0? Does an empty agent score 0.0? Does a partial agent score ~0.5?
-
- -
-
done=True never fires
-
Trace your done condition explicitly. Three ways it must fire: approve/request_changes action, step_count >= max_steps, noise_budget <= 0. If done never fires, the /step endpoint loops forever and the agentic eval hangs.
-
- -
-
Non-reproducible baseline
-
Run baseline.py with seed=42 three times from a fresh process. All three runs must produce byte-for-byte identical score tables. If they differ, your scenario_bank.py is not deterministic. Use random.Random(seed) not global random.
-
- -
-
openenv.yaml schema drift
-
If the YAML says action.severity is a string enum but the API accepts bare integers, validate.py will fail. Write the YAML after the models are final, not before. Copy enum values directly from models.py into the YAML.
-
- -
-
Docker port mismatch
-
HF Spaces requires port 7860. Verify: EXPOSE 7860 in Dockerfile, uvicorn --port 7860, and HEALTHCHECK hitting port 7860. All three must match. A common failure is uvicorn defaulting to 8000.
-
- -
-
Relative imports in Docker
-
from .models import Action works when running locally but fails in Docker if the WORKDIR setup is wrong. Use absolute imports throughout: from codereview_env.models import Action. Add an empty __init__.py to every package directory.
-
- -
-
- -
- - -
-
- 10 -

Final Submission Checklist // click to mark complete

-
- -
Click items to mark as done. State is saved in this browser.
- -
-
models.py — all 5 enums, all 8 Pydantic models, @model_validator on Action, mypy clean
-
scenario_bank.py — 30 scenarios (10 per task), keywords per issue, seed determinism verified
-
grader_utils.py — keyword_overlap(), find_best_match(), unit tested
-
bug_grader.py — grade_bug(), score varies across 20 inputs, always [0.0, 1.0]
-
security_grader.py — severity multipliers, false-alarm penalty, normalized to 1.0
-
arch_grader.py — issue + verdict + quality components, all three tasks dispatch correctly
-
env.py — reset/step/state, noise_budget, 3 done conditions, clean reset between episodes
-
app.py — 4 required routes + /ws/events + /leaderboard, all return correct JSON
-
tests/ — test_env.py + test_graders.py + test_api.py, all pytest green
-
Dockerfile — docker build succeeds, docker run starts, /health returns 200, HEALTHCHECK passes
-
openenv.yaml — name, version, 3 tasks, observation_space, action_space, endpoints all filled
-
scripts/baseline.py — runs all 3 tasks, seeds 0–9, tabular output, seed=42 reproducible across 3 runs
-
scripts/validate.py — calls /health /reset /step /state, asserts schemas, exits 0
-
README.md — HF Space header, environment description, obs/action tables, task descriptions, baseline scores, setup instructions
-
HF Space — deployed, live URL responds, /health returns 200, sdk: docker tag visible
-
Live smoke test — validate.py --url [HF URL] exits 0, baseline.py --url [HF URL] matches local scores
-
Grader distribution check — each grader produces 4+ distinct score buckets across 20 random inputs
-
Hard task challenge check — frontier model (GPT-4 or Claude) scores < 0.75 on architectural_review (it should be genuinely hard)
-
- -
-
-
Completion
-
-
-
-
-
0 / 18
-
-
- -
- - - - - - - \ No newline at end of file