IronCell โ Mark 1: Technical Brief
GitHub Repository: gaoang1111/IronMan
Checkpoints: HuggingFace - IronCell-Mark-1
Training Logs: WandB Overview
Core Efficiency Metrics
| Metric | Value / Performance |
|---|---|
| VRAM Footprint | Reduced by 93.75% (Requirement down to 6.25%) |
| Logic Integrity (PPL) | 11.20 (FineWeb Zero-Overlap) |
| Baseline (Llama 3.1 8B) | 7.40 PPL |
The Verdict: This represents a marginal increase in perplexity exchanged for an impossible context capacity on consumer-grade GPUs.
Cellular Differentiation Theory
The project views a pre-trained LLM as a powerful but rigid "state machine" and treats the homologous base (Llama 3.1 8B) as a "stem cell". Through induced functional differentiation, the model is split into collaborating units:
- Compressor (
cmp): Specialized in distilling raw text chunks into dense semantic latent vectors. - Generator (
gen): A causal language model trained to reconstruct and reason based on these compressed vectors. - Projector (
proj): A linear mapping that translates compressor hidden states into the generator's hidden space.
Zipper Layout (Masked Parallel Training)
To achieve 16:1 sequence compression, IronCell utilizes a "control chain + raw chunks" layout:
- Structural Chain: Formatted as
[<bos>][<soc>] V-1 [<eoc>] V0 [<eoc>] V1 [<eoc>] ... [Raw_Token chunks] - Zipper (Staircase) Mask: A custom attention mask ensures each raw segment only attends to its permitted control tokens, maintaining causal integrity without information leakage.
Training & Reproducibility
The entire differentiation process is reproducible in an afternoon (~5 hours) using an 8รA800 node.
Phase 1: Alignment
- Objective: Only the projector and new special tokens are trained.
- Performance: Aligns the compressed signal as loss dropped from 12.8 to 4.12 in ~20 steps.
Phase 2: Differentiation
- Objective: Model weights are unfrozen with L2 regularization.
- Performance: Resulting in a steady eval loss decline from 2.72 to 2.41.
Data Specifications
- Source: FineWeb-Edu (HuggingFace).
- Scale: Phase 2 uses 10,000 samples.
- Length: Individual string lengths ranging from 10k to 30k characters.
- Protocol: A zero-overlap sampling strategy was maintained within the first 150 training steps.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for ddddamn/IronCell-Mark-1
Base model
meta-llama/Llama-3.1-8B