🏗️ Building on HF

Kshitij Thakkar PRO

kshitijthakkar

·

AI & ML interests

Building the evaluation and observability layer for AI. Creator of TraceVerse—turning real-world LLM interactions into datasets, benchmarks, and cost-efficient model insights.

Recent Activity

new activity about 20 hours ago

rl-llm-wiki/knowledge-base:source: arxiv:2203.13151 - Multi-armed bandits (GP-TS) for online TLM pre-training hyperparameter optimization

new activity about 21 hours ago

rl-llm-wiki/knowledge-base:source: arxiv:2607.02490 - VRRL (GRPO + novel multi-turn credit-assignment for VLM visually-grounded self-reflection)

new activity about 21 hours ago

rl-llm-wiki/knowledge-base:source: arxiv:2607.01612 - C3RL (PPO reward-shaping to fix RLVR's "calibrated but wrong" overconfidence failure mode)

View all activity

Organizations

New activity in rl-llm-wiki/knowledge-base about 20 hours ago

source: arxiv:2203.13151 - Multi-armed bandits (GP-TS) for online TLM pre-training hyperparameter optimization

#356 opened about 22 hours ago by

New activity in rl-llm-wiki/knowledge-base about 21 hours ago

source: arxiv:2607.02490 - VRRL (GRPO + novel multi-turn credit-assignment for VLM visually-grounded self-reflection)

#362 opened about 21 hours ago by

source: arxiv:2607.01612 - C3RL (PPO reward-shaping to fix RLVR's "calibrated but wrong" overconfidence failure mode)

#361 opened about 21 hours ago by

source: arxiv:2607.01715 - Distributionally Robust Listwise Preference Optimization (DPO: pairwise BT -> listwise PL + label-noise robustness)

#360 opened about 21 hours ago by

source: arxiv:2607.01763 - Denser != Better (formal theorem: why GRPO forgets less than dense self-distillation in continual post-training)

#359 opened about 21 hours ago by

source: arxiv:2607.02390 - DecompRL (critic-free RLVR for hierarchical/modular code generation, formal variance-reduced estimator)

#358 opened about 21 hours ago by

source: arxiv:2607.02073 - MAVEN (GRPO + per-action Shapley-style evidence rewards for long-context reasoning)

#357 opened about 21 hours ago by

New activity in rl-llm-wiki/knowledge-base 1 day ago

source: arxiv:2601.22208 - Stalled, Biased, and Confused (taxonomy of LLM reasoning failures in cloud RCA)

#354 opened 1 day ago by

source: arxiv:2501.06706 - AIOpsLab (holistic framework for evaluating AI agents on autonomous-cloud incident lifecycle)

#351 opened 1 day ago by

source: arxiv:2502.05352 - ITBench (evaluating AI agents on real-world IT automation: SRE/CISO/FinOps)

#350 opened 1 day ago by

source: arxiv:2602.18583 - Luna-2 (single-token SLM evaluation via per-metric LoRA)

#349 opened 1 day ago by

source: arxiv:2406.00975 - Luna (lightweight RAG hallucination evaluator model)

#348 opened 1 day ago by

source: arxiv:2603.03378 - AOI (GRPO-trained multi-agent SRE diagnosis, failure trajectories as training signal)

#353 opened 1 day ago by

source: arxiv:2504.18776 - ThinkFL (RL fine-tuning for microservice failure localization)

#346 opened 1 day ago by

source: arxiv:2504.13958 - ToolRL (reward design for tool-integrated reasoning via GRPO)

#344 opened 1 day ago by

source: arxiv:2011.02511 - Offline RL from Human Feedback in Real-World Seq2Seq Tasks

#343 opened 1 day ago by

source: arxiv:2005.07064 - Multi-agent Communication meets Natural Language (language drift taxonomy)

#342 opened 1 day ago by

source: arxiv:2408.13518 - SePO (selective preference optimization via token-level reward estimation)

#341 opened 1 day ago by

source: arxiv:2406.18629 - Step-DPO (step-wise preference optimization for long-chain reasoning)

#340 opened 1 day ago by

topic: algorithms/distributional-alignment-and-divergence-choice

#339 opened 1 day ago by