|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: meta-llama/Llama-3.1-8B |
|
|
tags: |
|
|
- sequence-compression |
|
|
- kv-cache |
|
|
- long-context |
|
|
- efficiency |
|
|
metrics: |
|
|
- perplexity |
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
# IronCell — Mark 1: Technical Brief |
|
|
|
|
|
**GitHub Repository:** [gaoang1111/IronMan](https://github.com/gaoang1111/IronMan) |
|
|
**Checkpoints:** [HuggingFace - IronCell-Mark-1](https://huggingface.co/ddddamn/IronCell-Mark-1) |
|
|
**Training Logs:** [WandB Overview](https://wandb.ai/gaoang001111-none/IronMan/overview) |
|
|
|
|
|
--- |
|
|
|
|
|
## Core Efficiency Metrics |
|
|
|
|
|
| Metric | Value / Performance | |
|
|
| :--- | :--- | |
|
|
| **VRAM Footprint** | **Reduced by 93.75%** (Requirement down to 6.25%) | |
|
|
| **Logic Integrity (PPL)** | **11.20** (FineWeb Zero-Overlap) | |
|
|
| **Baseline (Llama 3.1 8B)** | 7.40 PPL | |
|
|
|
|
|
> **The Verdict:** This represents a marginal increase in perplexity exchanged for an impossible context capacity on consumer-grade GPUs. |
|
|
|
|
|
--- |
|
|
|
|
|
## Cellular Differentiation Theory |
|
|
|
|
|
The project views a pre-trained LLM as a powerful but rigid "state machine" and treats the homologous base (Llama 3.1 8B) as a "stem cell". Through induced functional differentiation, the model is split into collaborating units: |
|
|
|
|
|
* **Compressor (`cmp`):** Specialized in distilling raw text chunks into dense semantic latent vectors. |
|
|
* **Generator (`gen`):** A causal language model trained to reconstruct and reason based on these compressed vectors. |
|
|
* **Projector (`proj`):** A linear mapping that translates compressor hidden states into the generator's hidden space. |
|
|
|
|
|
--- |
|
|
|
|
|
## Zipper Layout (Masked Parallel Training) |
|
|
|
|
|
To achieve **16:1** sequence compression, IronCell utilizes a "control chain + raw chunks" layout: |
|
|
|
|
|
1. **Structural Chain:** Formatted as `[<bos>][<soc>] V-1 [<eoc>] V0 [<eoc>] V1 [<eoc>] ... [Raw_Token chunks]` |
|
|
2. **Zipper (Staircase) Mask:** A custom attention mask ensures each raw segment only attends to its permitted control tokens, maintaining causal integrity without information leakage. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training & Reproducibility |
|
|
|
|
|
The entire differentiation process is reproducible in an afternoon (**~5 hours**) using an **8×A800** node. |
|
|
|
|
|
### Phase 1: Alignment |
|
|
* **Objective:** Only the projector and new special tokens are trained. |
|
|
* **Performance:** Aligns the compressed signal as loss dropped from 12.8 to 4.12 in ~20 steps. |
|
|
|
|
|
### Phase 2: Differentiation |
|
|
* **Objective:** Model weights are unfrozen with **L2 regularization**. |
|
|
* **Performance:** Resulting in a steady eval loss decline from 2.72 to 2.41. |
|
|
|
|
|
--- |
|
|
|
|
|
## Data Specifications |
|
|
|
|
|
* **Source:** FineWeb-Edu (HuggingFace). |
|
|
* **Scale:** Phase 2 uses 10,000 samples. |
|
|
* **Length:** Individual string lengths ranging from 10k to 30k characters. |
|
|
* **Protocol:** A **zero-overlap** sampling strategy was maintained within the first 150 training steps. |