ddddamn
/

IronCell-Mark-1

sequence-compression

Model card Files Files and versions

IronCell-Mark-1 / README.md

ddddamn's picture

Create README.md

35f7607 verified 2 days ago

|

history blame contribute delete

2.87 kB

	---
	license: apache-2.0
	base_model: meta-llama/Llama-3.1-8B
	tags:
	- sequence-compression
	- kv-cache
	- long-context
	- efficiency
	metrics:
	- perplexity
	---



	![cell_vs_llama](https://cdn-uploads.huggingface.co/production/uploads/6891bed1f76477f415c0eaa6/yA9h2Pjb3ysk8M27Eg91d.png)


	# IronCell — Mark 1: Technical Brief

	GitHub Repository: [gaoang1111/IronMan](https://github.com/gaoang1111/IronMan)
	Checkpoints: [HuggingFace - IronCell-Mark-1](https://huggingface.co/ddddamn/IronCell-Mark-1)
	Training Logs: [WandB Overview](https://wandb.ai/gaoang001111-none/IronMan/overview)

	---

	## Core Efficiency Metrics

	\| Metric \| Value / Performance \|
	\| :--- \| :--- \|
	\| VRAM Footprint \| Reduced by 93.75% (Requirement down to 6.25%) \|
	\| Logic Integrity (PPL) \| 11.20 (FineWeb Zero-Overlap) \|
	\| Baseline (Llama 3.1 8B) \| 7.40 PPL \|

	> The Verdict: This represents a marginal increase in perplexity exchanged for an impossible context capacity on consumer-grade GPUs.

	---

	## Cellular Differentiation Theory

	The project views a pre-trained LLM as a powerful but rigid "state machine" and treats the homologous base (Llama 3.1 8B) as a "stem cell". Through induced functional differentiation, the model is split into collaborating units:

	* Compressor (`cmp`): Specialized in distilling raw text chunks into dense semantic latent vectors.
	* Generator (`gen`): A causal language model trained to reconstruct and reason based on these compressed vectors.
	* Projector (`proj`): A linear mapping that translates compressor hidden states into the generator's hidden space.

	---

	## Zipper Layout (Masked Parallel Training)

	To achieve 16:1 sequence compression, IronCell utilizes a "control chain + raw chunks" layout:

	1. Structural Chain: Formatted as `[<bos>][<soc>] V-1 [<eoc>] V0 [<eoc>] V1 [<eoc>] ... [Raw_Token chunks]`
	2. Zipper (Staircase) Mask: A custom attention mask ensures each raw segment only attends to its permitted control tokens, maintaining causal integrity without information leakage.

	---

	## Training & Reproducibility

	The entire differentiation process is reproducible in an afternoon (~5 hours) using an 8×A800 node.

	### Phase 1: Alignment
	* Objective: Only the projector and new special tokens are trained.
	* Performance: Aligns the compressed signal as loss dropped from 12.8 to 4.12 in ~20 steps.

	### Phase 2: Differentiation
	* Objective: Model weights are unfrozen with L2 regularization.
	* Performance: Resulting in a steady eval loss decline from 2.72 to 2.41.

	---

	## Data Specifications

	* Source: FineWeb-Edu (HuggingFace).
	* Scale: Phase 2 uses 10,000 samples.
	* Length: Individual string lengths ranging from 10k to 30k characters.
	* Protocol: A zero-overlap sampling strategy was maintained within the first 150 training steps.