Rayugacodes
/

kernelx-strategist

Safetensors

llama

Model card Files Files and versions

xet

Community

Rayugacodes commited on Apr 26

Commit

7574947

verified ·

1 Parent(s): 0b6fd4f

Update README with latest

Browse files

Files changed (1) hide show

README.md +148 -137

README.md CHANGED Viewed

@@ -1,199 +1,210 @@
----
-library_name: transformers
-tags: []
----
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

+# KernelX
+**An OpenEnv-compliant world-modeling environment for Linux kernel scheduling.**
+KernelX teaches a 360-million-parameter language model to make Linux scheduling decisions in real time. An eBPF sentinel extracts a 24-dimensional state vector at every context switch, a learned World Model predicts the consequences of each action, and a GRPO-trained Strategist outputs scheduling nudges in 44 milliseconds on a laptop CPU.
+Built for the Meta PyTorch OpenEnv Hackathon 2026 — Theme 3.1, World Modeling.
+## Try it now
+| | |
+|---|---|
+| **Live environment** | [huggingface.co/spaces/Rayugacodes/KernelX](https://huggingface.co/spaces/Rayugacodes/KernelX) |
+| **Training notebook (free T4)** | [KernelX_Training.ipynb](https://colab.research.google.com/github/pie-314/KernelX/blob/main/KernelX_Training.ipynb) |
+| **Trained model** | [Rayugacodes/kernelx-strategist](https://huggingface.co/Rayugacodes/kernelx-strategist) |
+| **Training data (534K transitions)** | [Rayugacodes/kernelx-training-data](https://huggingface.co/datasets/Rayugacodes/kernelx-training-data) |
+| **Blog post** | *The Digital Traffic Jam.md* |
+| **Demo video (2 min)** | *[YouTube link]* |
+| **Performance report** | [training/PERFORMANCE.md](training/PERFORMANCE.md) |
+## What this environment is
+KernelX gives an LLM agent a partially-observable view of a real Linux kernel and asks it to learn scheduling policy from interaction. The agent observes a 24-dimensional telemetry vector, takes a single scalar action between -1 and +1, and the next state comes from a World Model trained on real kernel transitions.
+It is an OpenEnv environment. The standard `reset()` / `step(action)` / `state` interface works the way you expect. Plug in TRL, Stable Baselines, or any RL loop — the environment doesn't care.
+```python
+from brain.client import KernelXClient
+env = KernelXClient(url="https://your-space.hf.space")
+obs = env.reset()
+obs = env.step(action=0.5)   # nudge a process priority
+score = env.evaluate()        # OpenEnv-compliant grading
+```
+## Why it's interesting to train an LLM on
+Kernel scheduling is a domain where the "right" action is not obvious from the immediate observation, where mistakes cascade through subsequent states, and where the cost function (latency, throughput, fairness) involves real trade-offs. An agent that learns to schedule well must build a causal model of how its priority adjustments propagate through the scheduler's internal state — exactly the kind of world-modeling capability Theme 3.1 targets.
+Compared to most RL environments LLMs get trained on, this one has three properties that we think make it useful:
+The **state space is real**. The 24D observation is what an eBPF program actually extracts at `sched_switch`: priorities, virtual runtime, migration counts, wait time. We collected 534,134 of these from a real Linux machine under mixed workloads. There is no toy MDP underneath.
+The **dynamics are learned**. The World Model is a SmolLM2-360M fine-tune that predicts `S_{t+1}` given `(S_t, a_t)`. The Strategist trains against the World Model, not against a recorded replay. This means the agent's actions actually drive state transitions during training — the standard RL contract.
+The **reward decomposes**. We don't optimize a single number. The reward is the sum of a throughput term, a latency penalty, a stability penalty, and a format reward. Each component is independently inspectable, which makes debugging tractable and makes reward-hacking visible when it happens.
+## Architecture
+```
+Linux kernel (eBPF sentinel)
+   ↓ 24D telemetry vector at every sched_switch
+Rust bridge (lockless ring buffer → /dev/shm + JSONL)
+   ↓ filtered: wait_us > 500 OR 10% random sample
+Python brain (FastAPI + OpenEnv server)
+   ↓ World Model predicts next state given (state, action)
+   ↓ Strategist outputs action ∈ [-1, +1]
+ZMQ → Bridge → eBPF priority_actions map
+   ↓
+Kernel applies the nudge at the next context switch
+```
+Five components, each in its native language:
+- `kernel/` — eBPF C program (`sentinel.bpf.c`) attached to `sched_wakeup` and raw `sched_switch` tracepoints. Extracts the 24D vector, ships it through a `BPF_MAP_TYPE_RINGBUF`. The actuator side reads from a `priority_actions` hash map.
+- `bridge/` — Rust userspace process built on Aya. Reads the ring buffer, mirrors state to shared memory at sub-millisecond latency, persists trajectories to JSONL, listens on ZMQ for actions from the brain. Optionally writes through to RadishDB (the team's WAL-backed key-value store) for durable trajectory storage.
+- `brain/` — Python OpenEnv server. Implements the `Environment` interface. Loads the trained GGUF Strategist, runs inference, talks to the bridge over ZMQ. Includes an `LLMGrader` for OpenEnv-compliant scoring and a `/reload-policy` endpoint for hot-swapping models without downtime.
+- `training/` — Full ML pipeline. Preprocessing (symlog scaling, 10D active-feature extraction), World Model SFT, Strategist warm-start + GRPO, GGUF export, policy iteration, baseline comparison.
+- `ui/` — Ratatui terminal HUD. Reads the same shared memory as the brain, renders live telemetry, AI reasoning, and reward sparklines at 10 Hz.
+## The training pipeline
+```bash
+# 1. Preprocess raw kernel transitions
+python -m training.data.preprocess --input data/state_transitions.jsonl
+# 2. Train the World Model (SFT — predicts S_{t+1} | S_t, a_t)
+python -m training.models.train_world_model \
+    --train-data training/data/train.jsonl \
+    --val-data   training/data/val.jsonl
+# 3. Train the Strategist (warm-start SFT + GRPO against the World Model)
+python -m training.models.train_strategist \
+    --train-data training/data/train.jsonl
+# 4. Export to GGUF for sub-50ms CPU inference
+python -m training.models.export_gguf \
+    --adapter-path training/models/strategist_final
+# 5. Closed-loop policy iteration: collect → train → deploy → repeat
+python -m training.policy_iteration \
+    --trajectories-path data/trajectories.jsonl
+```
+The full pipeline runs on a free Colab T4. See [`KernelX_Training.ipynb`](KernelX_Training.ipynb).
+## Reward function
+```
+R_t = α · log(Δ_exec + 1)  −  β · max(0, Δ_wait)  −  γ · |a_t − a_{t-1}|  +  format_reward
+```
+| Component | Weight | Signal | Range |
+|---|---|---|---|
+| Throughput | α = 1.0 | log of CPU-time progress | [0, ~10] |
+| Latency penalty | β = 2.0 | per-microsecond increase in wait time | (-∞, 0] |
+| Stability penalty | γ = 0.5 | absolute action change between steps | [-1, 0] |
+| Format reward | 1.0 | action ∈ [-1, +1] | {0, 1} |
+The format reward is what stops the agent from outputting nonsense — every other component still applies if it does, but losing the format point is a hard signal during early GRPO. The stability term is what stops the agent from oscillating. The latency term is the actual objective. The throughput term keeps the agent from learning that "do nothing forever" is a local optimum.
+## Results
+**World Model (Stage 2 SFT).** The model learns the kernel's default dynamics from 10K transitions in 2 epochs. Loss dropped from 2.05 → 0.29, token-level prediction accuracy from 61% → 91%. *[Plot: training/plots/world_model_training.png]*
+**Strategist warm-start (Stage 3a SFT).** Teaches the model the output format before RL begins. Loss 2.13 → 0.28, 100% format compliance. *[Plot: training/plots/strategist_warmstart_training.png]*
+**Strategist GRPO (Stage 3b RL).** Trained against the World Model simulator. The trained policy achieves higher cumulative reward than both the random-action baseline and the hand-written heuristic policy on held-out test states. *[Plot: training/plots/grpo_training.png — to be regenerated against World-Model simulator]*
+**Inference.** The Q4_K_M-quantized GGUF model is 258MB and runs in 44ms warm-cache on a laptop CPU.
+For full numbers and per-iteration breakdowns: [`training/PERFORMANCE.md`](training/PERFORMANCE.md).
+## Running locally
+The full kernel→bridge→brain stack requires a Linux machine with kernel BTF support and root access. The OpenEnv environment alone (which is what judges interact with) runs anywhere — the HF Space is the easiest path.
+```bash
+# Step 1: Load the eBPF sentinel (Linux only, requires sudo)
+cd kernel && sudo make load
+# Step 2: Start the Rust bridge
+cargo run --manifest-path bridge/Cargo.toml --release -- --record
+# Step 3: Start the OpenEnv server
+export PYTHONPATH=$PYTHONPATH:.
+python3 -m brain.server.app
+# Step 4: Run the autonomous policy loop
+python3 -m brain.server.run_autonomous --steps 50 --verbose
+# Step 5: Launch the HUD
+cargo run --manifest-path ui/Cargo.toml --release
+```
+If the eBPF stack isn't available, the brain server falls back to a simulator and the UI runs in `MOCK DEMO` mode.
+## Model details
+| | |
+|---|---|
+| Base model | SmolLM2-360M-Instruct |
+| Fine-tuning | LoRA (r=16, α=32) on q/k/v/o + gate/up/down |
+| Quantization | GGUF Q4_K_M (258MB) |
+| Inference latency | 44ms warm-cache, CPU |
+| Action space | single float ∈ [-1.0, +1.0] |
+| Observation | 10 active features extracted from 24D eBPF vector |
+| Target hardware | i3 CPU laptop, sub-50ms decision budget |
+## Shared-memory contract
+The UI and the brain both read from `/dev/shm/kernelx_state`:
+```rust
+#[repr(C, packed)]
+struct HUDState {
+    features: [u64; 24],       // 24D telemetry vector
+    current_action: f32,        // most recent AI action
+    active_pid: u32,            // process being scheduled
+    is_clamped: u32,            // safety auditor flag
+    reasoning: [u8; 128],       // explanation string
+    p99_wait_us: u64,           // P99 wait latency
+    core_heat: [f32; 4],        // per-core utilization
+    model_confidence: f32,
+    world_model_drift: f32,
+    radish_wal_size: u64,
+    radish_dirty_pages: u32,
+}
+```
+Total: 376 bytes, packed C layout, byte-identical between Rust and Python.
+## What we'd do with more time
+**Reward normalization.** Wait-delta values can hit 89,000 microseconds, which dominates the reward and risks gradient explosion in GRPO. Clipping the latency penalty to a fixed range (or scaling by p95 wait time) would stabilize training.
+**PMU features.** Fourteen of the 24 feature slots are reserved for hardware performance counters (IPC, cache misses, branch mispredictions). Populating them via `perf_event_open` would give the agent much richer state, especially for distinguishing "CPU-bound but progressing" from "CPU-bound and thrashing."
+**Multi-process reasoning.** The current Strategist acts on one PID at a time. A multi-agent extension where each PID has its own agent — or a centralized agent reasoning about process *interactions* — is the natural next step.
+**Real GRPO on real telemetry.** The current setup trains GRPO against the learned World Model. With more compute, training could close the loop by collecting fresh trajectories under the trained policy and re-training — proper online RL on a real system.
+## Citation
+```
+@misc{kernelx2026,
+  title  = {KernelX: An OpenEnv World-Modeling Environment for Linux Kernel Scheduling},
+  author = {Naman Gupta and team},
+  year   = {2026},
+  note   = {Meta PyTorch OpenEnv Hackathon}
+}
+```
+## License
+MIT. RadishDB sub-component is also MIT (see `RadishDB/LICENSE`).
+---
+*KernelX — Meta PyTorch OpenEnv Hackathon 2026 — Theme 3.1, World Modeling*