anakin87 (Stefano Fiorucci)

liked a dataset 7 days ago

VAGOsolutions/SauerkrautLM-Doom-MultiVec-31k

Updated 30 days ago • 278 • 3

upvoted an article 13 days ago

Article

ML Intern Takes Our Post-Training Internship Test

13 days ago

•

30

reacted to their post with ❤️ 13 days ago

Post

3273

A small model that struggled against a random opponent now beats GPT-5-mini at tic-tac-toe

I took LiquidAI/LFM2-2.6B and trained it through play.

🧑‍🍳 Here's how:

1️⃣ Build a solid RL env with Verifiers (Prime Intellect)
2️⃣ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env
3️⃣ SFT warm-up to teach format
4️⃣ Group-based RL (CISPO) against opponents making 20-70% random moves
5️⃣ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies

Done! Beats GPT-5-mini 🏆

---

🎮 Play against the model: anakin87/LFM2-2.6B-mr-tictactoe

🤗 Model: anakin87/LFM2-2.6B-mr-tictactoe

📚 Walkthrough/course: https://github.com/anakin87/llm-rl-environments-lil-course

🤗 Dataset and checkpoints: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe

posted an update 13 days ago

Post

3273

A small model that struggled against a random opponent now beats GPT-5-mini at tic-tac-toe

I took LiquidAI/LFM2-2.6B and trained it through play.

🧑‍🍳 Here's how:

1️⃣ Build a solid RL env with Verifiers (Prime Intellect)
2️⃣ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env
3️⃣ SFT warm-up to teach format
4️⃣ Group-based RL (CISPO) against opponents making 20-70% random moves
5️⃣ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies

Done! Beats GPT-5-mini 🏆

---

🎮 Play against the model: anakin87/LFM2-2.6B-mr-tictactoe

🤗 Model: anakin87/LFM2-2.6B-mr-tictactoe

📚 Walkthrough/course: https://github.com/anakin87/llm-rl-environments-lil-course

🤗 Dataset and checkpoints: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe

reacted to AkimfromParis's post with ❤️ 13 days ago

Post

2501

🌸 𝙊𝙥𝙚𝙣 𝙅𝙖𝙥𝙖𝙣𝙚𝙨𝙚 𝙇𝙇𝙈 𝙇𝙚𝙖𝙙𝙚𝙧𝙗𝙤𝙖𝙧𝙙 𝙑2 𝙤𝙣 𝙃𝙪𝙜𝙜𝙞𝙣𝙜 𝙁𝙖𝙘𝙚 🇯🇵 // 🌸 ハギングフェイス版「 𝗢𝗽𝗲𝗻 𝗝𝗮𝗽𝗮𝗻𝗲𝘀𝗲 𝗟𝗟𝗠 𝗟𝗲𝗮𝗱𝗲𝗿𝗯𝗼𝗮𝗿𝗱 𝗩𝟮 」公開 🇯🇵

I am thrilled to announce the launch of version 2 of the 𝙊𝙥𝙚𝙣 𝙅𝙖𝙥𝙖𝙣𝙚𝙨𝙚 𝙇𝙇𝙈 𝙇𝙚𝙖𝙙𝙚𝙧𝙗𝙤𝙖𝙧𝙙. This initiative is driven by the "Fine-tuning and Evaluation" team, led by Professor Miyao at the The University of Tokyo, under the Research and Development Center for Large Language Models (LLMC) at Japan’s National Institute of Informatics (NII).

𝙎𝙩𝙧𝙖𝙩𝙚𝙜𝙞𝙘 𝙖𝙣𝙙 𝙩𝙚𝙘𝙝𝙣𝙞𝙘𝙖𝙡 𝙪𝙥𝙜𝙧𝙖𝙙𝙚𝙨:
- Our new backend features eight A100 GPUs, enabling the evaluation of open-source models of more than 100B parameters.
- Submissions now require a Hugging Face Hub login to ensure accountability.
- We have added metrics for evaluation time, CO₂ emissions (thx to Code Carbon 🌱 ), alongside reasoning capabilities.

𝘿𝙖𝙩𝙖𝙨𝙚𝙩𝙨 𝙖𝙣𝙙 𝙚𝙫𝙖𝙡𝙪𝙖𝙩𝙞𝙤𝙣 𝙨𝙩𝙖𝙣𝙙𝙖𝙧𝙙𝙨:
- New datasets cover reasoning, mathematics, exams, and instruction following.
- Math evaluations now span from grade-school levels to expert-tier challenges (GSM8K, PolyMath, AIME).
- While integrating English-heavy and multilingual benchmarks (including Humanity’s Last Exam, GPQA, and BBH in both English and Japanese), we continue to prioritize unique Japanese cultural datasets.

llm-jp/open-japanese-llm-leaderboard-v2

どうぞお願い致します！😊

liked a model 13 days ago

anakin87/LFM2-2.6B-mr-tictactoe

Text Generation • 3B • Updated about 1 month ago • 49 • 1

reacted to their post with 🤗 14 days ago

Post

96

Local Gemma 4 agent 💎🕵️🗺️
drop in a mysterious map, get the location, live weather, and top spots to visit

I've been exploring what google/gemma-4-E4B-it can do in a local agentic setup and put together a 📓 𝙣𝙤𝙩𝙚𝙗𝙤𝙤𝙠 with Gemma + Haystack AI Framework covering 4 demos.

📓 https://t.ly/04Ty5

Another interesting one is the 𝗚𝗶𝘁𝗛𝘂𝗯 𝗔𝗴𝗲𝗻𝘁.

I initially tried to load all tools from the GitHub MCP server, quickly filling the context available on Colab -> unusable, forgetful agent ❌

Then I used the 𝗦𝗲𝗮𝗿𝗰𝗵𝗮𝗯𝗹𝗲 𝗧𝗼𝗼𝗹𝘀𝗲𝘁 🔎 🧰
It dynamically discovers the right tools from the GitHub MCP server on the fly, loading only what it actually needs for the task at hand, keeping context lean.

Now it actually works.

The notebook also contains
💎 Multimodal weather agent: the mystery map demo above
💎 Visual Question Answering from a paper
💎 RAG on Rock music

posted an update 14 days ago

Post

96

Local Gemma 4 agent 💎🕵️🗺️
drop in a mysterious map, get the location, live weather, and top spots to visit

I've been exploring what google/gemma-4-E4B-it can do in a local agentic setup and put together a 📓 𝙣𝙤𝙩𝙚𝙗𝙤𝙤𝙠 with Gemma + Haystack AI Framework covering 4 demos.

📓 https://t.ly/04Ty5

Another interesting one is the 𝗚𝗶𝘁𝗛𝘂𝗯 𝗔𝗴𝗲𝗻𝘁.

I initially tried to load all tools from the GitHub MCP server, quickly filling the context available on Colab -> unusable, forgetful agent ❌

Then I used the 𝗦𝗲𝗮𝗿𝗰𝗵𝗮𝗯𝗹𝗲 𝗧𝗼𝗼𝗹𝘀𝗲𝘁 🔎 🧰
It dynamically discovers the right tools from the GitHub MCP server on the fly, loading only what it actually needs for the task at hand, keeping context lean.

Now it actually works.

The notebook also contains
💎 Multimodal weather agent: the mystery map demo above
💎 Visual Question Answering from a paper
💎 RAG on Rock music

reacted to SeanLee97's post with 🔥 14 days ago

Post

8132

Our lab recently released a paper where we introduce ShadowPEFT, a new Parameter-Efficient Fine-Tuning (PEFT) paradigm tailored for edge computing scenarios.

Unlike traditional approaches such as LoRA and its variants, which inject trainable parameters directly into the weights of Transformer, requiring tight coupling with the backbone.

ShadowPEFT instead enhances the frozen large base model by adding a lightweight, centralized, pretrainable, and detachable Shadow network.
This shadow network operates in parallel with the base model, delivering learned corrections to each decoder layer. Because the shadow module is architecturally decoupled from the backbone, it can be independently trained, stored, and deployed, benefiting edge computing scenarios and edge-cloud collaboration computing.

- HF Paper: ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning (2604.19254)
- GitHub: https://github.com/ShadowLLM/shadow-peft
- HF Collection: https://huggingface.co/collections/shadow-llm/shadow-peft-models

7 replies

·

replied to their post 15 days ago

Oh, thank you!

reacted to their post with ❤️ 16 days ago

Post

10386

How LLM training with RL Environments works?

It all starts with 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗥𝗲𝘄𝗮𝗿𝗱𝘀
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training

In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)

Consider a more complex tic-tac-toe env ❌⭕
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions

(envs can also include tools)

---

What happens at training?

We use 𝗚𝗿𝗼𝘂𝗽 𝗥𝗲𝗹𝗮𝘁𝗶𝘃𝗲 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 with a tic-tac-toe env

No critic model needed, the group is the baseline
Simpler than PPO

1️⃣ Rollout generation: from the same board, model plays N games via sampling
2️⃣ Each game scored with deterministic rewards (win, format, ...)
3️⃣ Mean score computed across the group
4️⃣ Each rollout's advantage = its score minus the group mean
5️⃣ Model updated to favor trajectories above baseline

🔁 Repeat

For a deep dive, check out
🌱 https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs

2 replies

·

posted an update 16 days ago

Post

10386

How LLM training with RL Environments works?

It all starts with 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗥𝗲𝘄𝗮𝗿𝗱𝘀
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training

In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)

Consider a more complex tic-tac-toe env ❌⭕
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions

(envs can also include tools)

---

What happens at training?

We use 𝗚𝗿𝗼𝘂𝗽 𝗥𝗲𝗹𝗮𝘁𝗶𝘃𝗲 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 with a tic-tac-toe env

No critic model needed, the group is the baseline
Simpler than PPO

1️⃣ Rollout generation: from the same board, model plays N games via sampling
2️⃣ Each game scored with deterministic rewards (win, format, ...)
3️⃣ Mean score computed across the group
4️⃣ Each rollout's advantage = its score minus the group mean
5️⃣ Model updated to favor trajectories above baseline

🔁 Repeat

For a deep dive, check out
🌱 https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs

2 replies

·

liked a model 18 days ago

mradermacher/LFM2-2.6B-mr-tictactoe-GGUF

3B • Updated about 1 month ago • 231 • 1

reacted to their post with 🤗 20 days ago

Post

1618

Your RL environment is an SFT data factory 🏭

In LLM post-training it's common to do Supervised Fine-Tuning warm-up before Reinforcement Learning.

When teaching a new task, RL needs some signal to amplify and SFT builds a good initial basis, for example by teaching format.

If you've built an RL env, generating SFT synthetic data is basically free.

An env already has: task data, rollout logic, rewards.

1️⃣ pick a strong model
2️⃣ run it through the env
3️⃣ filter rollouts by reward

works out of the box with Verifiers (Prime Intellect) and Atropos (Nous Research)

🧑‍💻 Example: https://github.com/anakin87/llm-rl-environments-lil-course/blob/main/chapters/05.md

posted an update 20 days ago

Post

1618

Your RL environment is an SFT data factory 🏭

In LLM post-training it's common to do Supervised Fine-Tuning warm-up before Reinforcement Learning.

When teaching a new task, RL needs some signal to amplify and SFT builds a good initial basis, for example by teaching format.

If you've built an RL env, generating SFT synthetic data is basically free.

An env already has: task data, rollout logic, rewards.

1️⃣ pick a strong model
2️⃣ run it through the env
3️⃣ filter rollouts by reward

works out of the box with Verifiers (Prime Intellect) and Atropos (Nous Research)

🧑‍💻 Example: https://github.com/anakin87/llm-rl-environments-lil-course/blob/main/chapters/05.md