CLOUDSENSE / OPENENV BENCHMARK
ENVIRONMENT ONLINE · V1.0.0
FINOPS · REINFORCEMENT LEARNING · Q1 2025
Teach agents
to spend wisely.
CloudSense is an OpenEnv-compatible RL benchmark that simulates real AWS accounts
with authentic pricing, utilization, and dependency graphs. Agents must identify
waste, optimize spend, and reason about blast radius — the cascading
infrastructure impact of every action — without breaking production.
Issue No. 001
Region US-EAST-1
Pricing ON-DEMAND
Model QWEN 2.5 72B
Tasks
03
Easy / Medium / Hard
Resources
61total
6 · 15 · 40 per task
Monthly spend
$18.3k
Summed across accounts
Blast levels
05
None → Critical
Three escalating scenarios
EASY · STARTUP
Startup
Cleanup
A 6-resource dev/staging account. Obvious waste, no production, no dependencies. Tests basic cost-optimization fundamentals.
Steps10
Spend$627
Baseline0.94
→
MEDIUM · MID-SIZE
Mid-Size
Audit
15 resources mixing prod and non-prod. Must distinguish seasonal spikes, failover replicas, and expiring reservations from genuine waste.
Steps20
Spend$3.5k
Baseline0.78
→
HARD · ENTERPRISE
Enterprise
FinOps
40 interdependent resources. Cross-region replication, oversized Elasticsearch, NAT Gateway traps. Requires blast-radius reasoning.
Steps45
Spend$14.2k
Baseline0.76
→
Action Space / 09 verbs
01
rightsize_resource
shrink to cheaper instance type
02
terminate_resource
remove unused infrastructure
03
add_lifecycle_policy
S3 tiering · ~70% savings
04
enable_autoscaling
dynamic capacity · ~20% savings
05
purchase_reservation
steady workloads · ~30% savings
06
change_storage_class
Glacier / IA tiers
07
schedule_uptime
business-hours only
08
request_more_info
defer, gather context
09
skip_resource
safe for critical prod
HTTP Endpoints / OpenEnv
GET
/taskslist available scenarios
POST
/stepexecute action · JSON body
Run the
benchmark
yourself.
"""