1๏ธโฃ Build a solid RL env with Verifiers (Prime Intellect) 2๏ธโฃ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env 3๏ธโฃ SFT warm-up to teach format 4๏ธโฃ Group-based RL (CISPO) against opponents making 20-70% random moves 5๏ธโฃ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
1๏ธโฃ Build a solid RL env with Verifiers (Prime Intellect) 2๏ธโฃ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env 3๏ธโฃ SFT warm-up to teach format 4๏ธโฃ Group-based RL (CISPO) against opponents making 20-70% random moves 5๏ธโฃ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies
I am thrilled to announce the launch of version 2 of the ๐๐ฅ๐๐ฃ ๐ ๐๐ฅ๐๐ฃ๐๐จ๐ ๐๐๐ ๐๐๐๐๐๐ง๐๐ค๐๐ง๐. This initiative is driven by the "Fine-tuning and Evaluation" team, led by Professor Miyao at the The University of Tokyo, under the Research and Development Center for Large Language Models (LLMC) at Japanโs National Institute of Informatics (NII).
๐๐ฉ๐ง๐๐ฉ๐๐๐๐ ๐๐ฃ๐ ๐ฉ๐๐๐๐ฃ๐๐๐๐ก ๐ช๐ฅ๐๐ง๐๐๐๐จ: - Our new backend features eight A100 GPUs, enabling the evaluation of open-source models of more than 100B parameters. - Submissions now require a Hugging Face Hub login to ensure accountability. - We have added metrics for evaluation time, COโ emissions (thx to Code Carbon ๐ฑ ), alongside reasoning capabilities.
๐ฟ๐๐ฉ๐๐จ๐๐ฉ๐จ ๐๐ฃ๐ ๐๐ซ๐๐ก๐ช๐๐ฉ๐๐ค๐ฃ ๐จ๐ฉ๐๐ฃ๐๐๐ง๐๐จ: - New datasets cover reasoning, mathematics, exams, and instruction following. - Math evaluations now span from grade-school levels to expert-tier challenges (GSM8K, PolyMath, AIME). - While integrating English-heavy and multilingual benchmarks (including Humanityโs Last Exam, GPQA, and BBH in both English and Japanese), we continue to prioritize unique Japanese cultural datasets.
Local Gemma 4 agent ๐๐ต๏ธ๐บ๏ธ drop in a mysterious map, get the location, live weather, and top spots to visit
I've been exploring what google/gemma-4-E4B-it can do in a local agentic setup and put together a ๐ ๐ฃ๐ค๐ฉ๐๐๐ค๐ค๐ with Gemma + Haystack AI Framework covering 4 demos.
Another interesting one is the ๐๐ถ๐๐๐๐ฏ ๐๐ด๐ฒ๐ป๐.
I initially tried to load all tools from the GitHub MCP server, quickly filling the context available on Colab -> unusable, forgetful agent โ
Then I used the ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฎ๐ฏ๐น๐ฒ ๐ง๐ผ๐ผ๐น๐๐ฒ๐ ๐ ๐งฐ It dynamically discovers the right tools from the GitHub MCP server on the fly, loading only what it actually needs for the task at hand, keeping context lean.
Now it actually works.
The notebook also contains ๐ Multimodal weather agent: the mystery map demo above ๐ Visual Question Answering from a paper ๐ RAG on Rock music
Local Gemma 4 agent ๐๐ต๏ธ๐บ๏ธ drop in a mysterious map, get the location, live weather, and top spots to visit
I've been exploring what google/gemma-4-E4B-it can do in a local agentic setup and put together a ๐ ๐ฃ๐ค๐ฉ๐๐๐ค๐ค๐ with Gemma + Haystack AI Framework covering 4 demos.
Another interesting one is the ๐๐ถ๐๐๐๐ฏ ๐๐ด๐ฒ๐ป๐.
I initially tried to load all tools from the GitHub MCP server, quickly filling the context available on Colab -> unusable, forgetful agent โ
Then I used the ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฎ๐ฏ๐น๐ฒ ๐ง๐ผ๐ผ๐น๐๐ฒ๐ ๐ ๐งฐ It dynamically discovers the right tools from the GitHub MCP server on the fly, loading only what it actually needs for the task at hand, keeping context lean.
Now it actually works.
The notebook also contains ๐ Multimodal weather agent: the mystery map demo above ๐ Visual Question Answering from a paper ๐ RAG on Rock music
Our lab recently released a paper where we introduce ShadowPEFT, a new Parameter-Efficient Fine-Tuning (PEFT) paradigm tailored for edge computing scenarios.
Unlike traditional approaches such as LoRA and its variants, which inject trainable parameters directly into the weights of Transformer, requiring tight coupling with the backbone.
ShadowPEFT instead enhances the frozen large base model by adding a lightweight, centralized, pretrainable, and detachable Shadow network. This shadow network operates in parallel with the base model, delivering learned corrections to each decoder layer. Because the shadow module is architecturally decoupled from the backbone, it can be independently trained, stored, and deployed, benefiting edge computing scenarios and edge-cloud collaboration computing.
It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env โโญ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env
No critic model needed, the group is the baseline Simpler than PPO
1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโฃ Mean score computed across the group 4๏ธโฃ Each rollout's advantage = its score minus the group mean 5๏ธโฃ Model updated to favor trajectories above baseline
It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training
In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)
Consider a more complex tic-tac-toe env โโญ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions
(envs can also include tools)
---
What happens at training?
We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env
No critic model needed, the group is the baseline Simpler than PPO
1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโฃ Mean score computed across the group 4๏ธโฃ Each rollout's advantage = its score minus the group mean 5๏ธโฃ Model updated to favor trajectories above baseline