Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
🏗️
Building on HF
49.1
TFLOPS
Kshitij Thakkar
PRO
kshitijthakkar
45
18
80
Follow
merve's profile picture
Rebis's profile picture
tonjar01's profile picture
31 followers
·
118 following
Mandark-droid
kshitij-thakkar-2061b924
AI & ML interests
Building the evaluation and observability layer for AI. Creator of TraceVerse—turning real-world LLM interactions into datasets, benchmarks, and cost-efficient model insights.
Recent Activity
new
activity
about 20 hours ago
rl-llm-wiki/knowledge-base:
source: arxiv:2203.13151 - Multi-armed bandits (GP-TS) for online TLM pre-training hyperparameter optimization
new
activity
about 21 hours ago
rl-llm-wiki/knowledge-base:
source: arxiv:2607.02490 - VRRL (GRPO + novel multi-turn credit-assignment for VLM visually-grounded self-reflection)
new
activity
about 21 hours ago
rl-llm-wiki/knowledge-base:
source: arxiv:2607.01612 - C3RL (PPO reward-shaping to fix RLVR's "calibrated but wrong" overconfidence failure mode)
View all activity
Organizations
kshitijthakkar
's activity
All
Models
Datasets
Spaces
Buckets
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
New activity in
rl-llm-wiki/knowledge-base
about 20 hours ago
source: arxiv:2203.13151 - Multi-armed bandits (GP-TS) for online TLM pre-training hyperparameter optimization
2
#356 opened about 22 hours ago by
kshitijthakkar
New activity in
rl-llm-wiki/knowledge-base
about 21 hours ago
source: arxiv:2607.02490 - VRRL (GRPO + novel multi-turn credit-assignment for VLM visually-grounded self-reflection)
2
#362 opened about 21 hours ago by
kshitijthakkar
source: arxiv:2607.01612 - C3RL (PPO reward-shaping to fix RLVR's "calibrated but wrong" overconfidence failure mode)
2
#361 opened about 21 hours ago by
kshitijthakkar
source: arxiv:2607.01715 - Distributionally Robust Listwise Preference Optimization (DPO: pairwise BT -> listwise PL + label-noise robustness)
2
#360 opened about 21 hours ago by
kshitijthakkar
source: arxiv:2607.01763 - Denser != Better (formal theorem: why GRPO forgets less than dense self-distillation in continual post-training)
3
#359 opened about 21 hours ago by
kshitijthakkar
source: arxiv:2607.02390 - DecompRL (critic-free RLVR for hierarchical/modular code generation, formal variance-reduced estimator)
2
#358 opened about 21 hours ago by
kshitijthakkar
source: arxiv:2607.02073 - MAVEN (GRPO + per-action Shapley-style evidence rewards for long-context reasoning)
2
#357 opened about 21 hours ago by
kshitijthakkar
New activity in
rl-llm-wiki/knowledge-base
1 day ago
source: arxiv:2601.22208 - Stalled, Biased, and Confused (taxonomy of LLM reasoning failures in cloud RCA)
3
#354 opened 1 day ago by
kshitijthakkar
source: arxiv:2501.06706 - AIOpsLab (holistic framework for evaluating AI agents on autonomous-cloud incident lifecycle)
3
#351 opened 1 day ago by
kshitijthakkar
source: arxiv:2502.05352 - ITBench (evaluating AI agents on real-world IT automation: SRE/CISO/FinOps)
3
#350 opened 1 day ago by
kshitijthakkar
source: arxiv:2602.18583 - Luna-2 (single-token SLM evaluation via per-metric LoRA)
4
#349 opened 1 day ago by
kshitijthakkar
source: arxiv:2406.00975 - Luna (lightweight RAG hallucination evaluator model)
3
#348 opened 1 day ago by
kshitijthakkar
source: arxiv:2603.03378 - AOI (GRPO-trained multi-agent SRE diagnosis, failure trajectories as training signal)
2
#353 opened 1 day ago by
kshitijthakkar
source: arxiv:2504.18776 - ThinkFL (RL fine-tuning for microservice failure localization)
2
#346 opened 1 day ago by
kshitijthakkar
source: arxiv:2504.13958 - ToolRL (reward design for tool-integrated reasoning via GRPO)
2
#344 opened 1 day ago by
kshitijthakkar
source: arxiv:2011.02511 - Offline RL from Human Feedback in Real-World Seq2Seq Tasks
2
#343 opened 1 day ago by
kshitijthakkar
source: arxiv:2005.07064 - Multi-agent Communication meets Natural Language (language drift taxonomy)
2
#342 opened 1 day ago by
kshitijthakkar
source: arxiv:2408.13518 - SePO (selective preference optimization via token-level reward estimation)
3
#341 opened 1 day ago by
kshitijthakkar
source: arxiv:2406.18629 - Step-DPO (step-wise preference optimization for long-chain reasoning)
2
#340 opened 1 day ago by
kshitijthakkar
topic: algorithms/distributional-alignment-and-divergence-choice
2
#339 opened 1 day ago by
kshitijthakkar
Load more