| --- |
| license: mit |
| base_model: Qwen/Qwen3-4B-Instruct-2507 |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - hydrology |
| - agent |
| - tool-use |
| - grpo |
| - reinforcement-learning |
| - qwen3 |
| - ef5 |
| - crest |
| - function-calling |
| datasets: |
| - anonymousOwl/HydroAgent-dataset |
| --- |
| |
| # HydroAgent β Qwen3-4B-Instruct fine-tuned for hydrologic model calibration |
|
|
| **HydroAgent** is a tool-using language model that calibrates the |
| [EF5/CREST](https://github.com/HyDROSLab/EF5) distributed hydrologic model. |
| Given a USGS streamflow gage and a precipitation-driven simulation, the agent |
| iteratively proposes physically plausible parameter sets, runs the simulator, |
| inspects the resulting NSE / peak / volume metrics, and revises until the |
| model fits the observations. |
|
|
| This release is the **GRPO step-100 checkpoint** of the SFT + RL pipeline |
| described in [chrimerss/HydroLLM](https://github.com/chrimerss/HydroLLM). |
|
|
| - **Base model:** [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) |
| - **Training:** full fine-tuning, BF16, FSDP, no LoRA |
| - **RL framework:** [verl 0.5](https://github.com/volcengine/verl) GRPO with [SGLang](https://github.com/sgl-project/sglang) rollouts |
| - **Tool format:** Hermes-style `<tool_call>` JSON (Qwen3-Instruct native) |
| - **Hardware:** 4Γ H100, ~30 min/step, K=6 rollouts Γ max 50 multi-turn calls |
|
|
| ## How the agent works |
|
|
| The model has access to three tools and runs a multi-turn calibration loop: |
|
|
| | Tool | Purpose | |
| |---|---| |
| | `set_parameters` | Set 11 tunable CREST multipliers: `wm`, `b`, `im`, `ke`, `fc`, `under`, `leaki`, `alpha`, `beta`, `alpha0`, `iwu` | |
| | `run_simulation` | Execute EF5 with the current parameters and produce a hydrograph | |
| | `evaluate` | Score the latest run vs. observations: NSE, CC, KGE, peak ratio, lag | |
|
|
| Each rollout typically follows: `set_parameters β run_simulation β evaluate β set_parameters β β¦` |
| until NSE plateaus or the agent runs out of turns. Inputs to the agent are a |
| short system prompt describing the calibration task and a per-gage user |
| message with watershed metadata (basin area, lat/lon, time window). |
|
|
| ## Training data |
|
|
| Training calibrates the agent on **10 CONUS USGS gages** (basin areas |
| 539 β 2401 kmΒ²), each driven by **MRMS 1 km hourly precipitation** and |
| **hourly USGS streamflow observations** from 60-day windows selected to |
| contain a clear flood event (rising + receding limbs, edge-buffered). |
|
|
| | Gage ID | Basin (kmΒ²) | Lat | Lon | Window (UTC) | |
| |---|---:|---:|---:|---| |
| | 11383500 | 539 | 40.0140 | -121.9483 | 2018-05-19 β 2018-07-17 | |
| | 11043000 | 575 | 33.4798 | -117.1439 | 2019-03-15 β 2019-05-13 | |
| | 11152000 | 632 | 36.2805 | -121.3227 | 2018-05-29 β 2018-07-27 | |
| | 02294781 | 1064 | 27.8245 | -81.8017 | 2018-04-29 β 2018-06-27 | |
| | 02312000 | 1476 | 28.4800 | -82.1776 | 2018-11-15 β 2019-01-13 | |
| | 07195430 | 1489 | 36.1086 | -94.5333 | 2018-01-04 β 2018-03-04 | |
| | 11179000 | 1639 | 37.5871 | -121.9608 | 2018-06-03 β 2018-08-01 | |
| | 14301000 | 1727 | 45.7040 | -123.7554 | 2018-09-11 β 2018-11-09 | |
| | 14207500 | 1828 | 45.3507 | -122.6762 | 2018-04-09 β 2018-06-07 | |
| | 11376000 | 2401 | 40.3871 | -122.2386 | 2018-09-21 β 2018-11-19 | |
|
|
| **Held-out evaluation gages** (never seen during training): |
|
|
| | Gage ID | Basin (kmΒ²) | Lat | Lon | Window (UTC) | |
| |---|---:|---:|---:|---| |
| | 02338660 | 329 | 33.2357 | -84.9876 | 2018-07-01 β 2018-08-31 | |
| | 01403060 | 2033 | 40.5511 | -74.5483 | 2018-11-11 β 2019-01-09 | |
| | 06279500 | 40792 | 44.7585 | -108.1816 | 2018-06-13 β 2018-08-11 | |
| | 07144100 | 3209 | 37.8831 | -97.4245 | 2019-03-30 β 2019-05-28 | |
|
|
| The full training dataset β CONUS terrain rasters, per-gage MRMS hourly |
| precipitation clips, USGS hourly streamflow observations, daily PET, the |
| EF5 control template, and the 73 GPT-4o calibration trajectories that seed |
| the SFT phase β is published as |
| [**anonymousOwl/HydroAgent-dataset**](https://huggingface.co/datasets/anonymousOwl/HydroAgent-dataset). |
| See that repo's README for the per-folder layout and provenance. |
|
|
| ## Reward |
|
|
| Two reward layers shape the policy: |
|
|
| **Per-turn (returned by tools):** |
|
|
| | Tool call | Reward | |
| |---|---| |
| | `set_parameters` (valid) | `+0.02` | |
| | `run_simulation` (valid) | `+0.05` | |
| | `evaluate` (valid) | `ΞNSE` (this turn β previous best) | |
| | Any tool (invalid) | `β0.5` | |
|
|
| **Terminal (returned at end of trajectory):** |
|
|
| | Component | Value | |
| |---|---| |
| | Best NSE (clipped) | `[β1, 1]` | |
| | Target-met bonus | `+0.5` if best NSE > gage target | |
| | Iteration bonus | `+0.02 Γ n_evaluates` | |
| | Improvement bonus | `+0.10 Γ max(0, n_improvements β 1)` | |
| | Empty-trajectory penalty | `β1.0` | |
|
|
| ## GRPO settings |
|
|
| | Setting | Value | |
| |---|---| |
| | Algorithm | GRPO (group-relative advantages) | |
| | K (rollouts per prompt) | 6 | |
| | Train batch size | 4 prompts (24 trajectories per step) | |
| | Max assistant turns | 50 | |
| | Learning rate | 1e-6 with 5% warmup | |
| | Entropy coefficient | 0.01 | |
| | KL loss coefficient | 0.05 (anchored to base policy) | |
| | Sampling | `temperature=1.0`, `top_p=0.95` | |
| | Steps in this checkpoint | **100** | |
|
|
| ## Quick start |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| repo = "anonymousOwl/HydroAgent" |
| tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="bfloat16", device_map="auto") |
| ``` |
|
|
| The model emits Hermes-style tool calls, e.g.: |
|
|
| ``` |
| <tool_call> |
| {"name": "set_parameters", "arguments": {"wm": 1.0, "b": 1.0, "im": 0.5, ...}} |
| </tool_call> |
| ``` |
|
|
| Parse with `tokenizer.apply_chat_template(..., tools=HYDRO_TOOLS)` and |
| dispatch each call to your EF5 sandbox. See |
| [`modal_app/eval.py`](https://github.com/chrimerss/HydroLLM/blob/main/modal_app/eval.py) |
| for a reference SGLang loop with retry-on-parse-failure logic. |
|
|
| For full reproduction (image, EF5 binary, multi-turn rollout, reward |
| computation), use the |
| [HydroLLM repository](https://github.com/chrimerss/HydroLLM). |
|
|
| ## Limitations |
|
|
| - Trained on **10 small/medium CONUS basins** (β€ 2401 kmΒ²) over short flood |
| windows. Generalization to large basins (> 3000 kmΒ²), arid catchments, or |
| out-of-CONUS regions is unverified. |
| - Calibrates **CREST parameter multipliers only** β does not modify routing, |
| initial conditions, or sub-basin structure. |
| - The agent depends on a working EF5 toolchain; the weights alone do not |
| perform calibration without the simulation environment in the loop. |
| - This is a research checkpoint, not a production tool. NSE on held-out |
| gages varies substantially with basin and event. |
|
|
| ## License |
|
|
| MIT β same as the upstream [HydroLLM repository](https://github.com/chrimerss/HydroLLM) |
| and the base [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{hydrollm2026, |
| title = {HydroLLM: Reinforcement Learning Fine-Tuning of LLMs with Hydrologic Simulation Feedback}, |
| year = {2026}, |
| url = {https://github.com/chrimerss/HydroLLM} |
| } |
| ``` |
|
|
| ## Acknowledgement |
|
|
| Compute for this research was sponsored by [Modal](https://modal.com). |
|
|