Update README.md

287ae54 verified about 2 months ago

4.6 kB

	---
	license: mit
	language:
	- en
	base_model:
	- NousResearch/Llama-2-7b-hf
	pipeline_tag: text-generation
	tags:
	- llm
	- llama
	- fine-tune
	- lema
	- vram-optimization
	- low-resource-computing
	- chat
	- merged
	---

	# LEMA-Llama-2-7b (Proof of Concept)

	This model is a demonstration of the [LEMA (Layer-wise Efficient Memory Abstraction)](https://github.com/Pomilon/LEMA) framework. It proves that large language models (7B+) can be fine-tuned on consumer-grade hardware with limited VRAM (e.g., 16GB Tesla P100) by virtualizing GPU memory.

	Key Achievement:
	Fine-tuned Llama-2-7B using only 6.36 GB of VRAM (standard LoRA typically requires ~14GB+ for this configuration).

	> Training code is available over on the github repository: [LEMA-llama](https://github.com/Pomilon/LEMA-llama)

	## Model Details

	- Base Model: `NousResearch/Llama-2-7b-hf`
	- Framework: LEMA v1.0
	- Fine-Tuning Method: LoRA (Rank 16, Alpha 32)
	- Memory Strategy: Streaming (Triple-Buffer: Disk -> RAM -> VRAM)
	- Precision: FP16

	## Training Configuration

	The model was trained to learn a strict custom chat format (`[LEMA_REPLY]`) to verify that weight updates were successfully applied.

	- Hardware: NVIDIA Tesla P100 (16GB VRAM)
	- Batch Size: 8 (Gradient Accumulation: 1)
	- Sequence Length: 512
	- Steps: 625 (1 Epoch over 5k examples)
	- Optimizer: AdamW (lr=1e-4)

	### Memory Efficiency

	\| Metric \| Standard PEFT/LoRA \| LEMA (This Run) \|
	\| :--- \| :--- \| :--- \|
	\| Peak VRAM \| OOM \| 6.36 GB \|
	\| System RAM \| OOM \| 2.40 GB \|

	Note: Standard PEFT typically OOMs at Batch Size 4-8 on 16GB cards with 512 context. LEMA held steady at <7GB.

	## Training Logs

	The training loss converged smoothly, demonstrating stable learning despite the layer-wise streaming architecture.

	```
	Step 10/625 \| Loss: 2.1732 \| VRAM: 6.36GB
	Step 100/625 \| Loss: 0.0677 \| VRAM: 6.36GB
	Step 200/625 \| Loss: 0.0462 \| VRAM: 6.36GB
	Step 300/625 \| Loss: 0.0407 \| VRAM: 6.36GB
	Step 400/625 \| Loss: 0.0412 \| VRAM: 6.36GB
	Step 500/625 \| Loss: 0.0459 \| VRAM: 6.36GB
	Step 600/625 \| Loss: 0.0406 \| VRAM: 6.36GB
	Final Step \| Training Complete
	```

	## Derived Metrics

	* Total Training Time: 5h 40m
	* Average Step Time: 32.23s
	* Peak VRAM: 6.36GB (stable)
	* Peak RAM: 2.52GB

	Full raw logs over [Here](training_logs.txt)

	## Limitations & Known Issues

	⚠️ Warning: Experimental Proof-of-Concept

	This model was trained for only 1 epoch as a mechanical stress test of the LEMA library. While it successfully learned the new vocabulary and special tags, it has not yet mastered the logical structure or grammar of the custom template.

	- Token Looping: The model may repeat tags like `[LEMA_REPLY]` multiple times in a loop.
	- Hallucinations: It may invent creative definitions for terms it hasn't seen in its original pre-training (e.g., hallucinating an acronym for LEMA).
	- Overfitting: Due to the small, highly repetitive synthetic dataset and 1-epoch training, the model is likely overfit to the specific examples provided.
	- Template Grammar: It often skips the `Explanation:` and `Confidence:` fields.

	To achieve production-grade results and make the model actually usable for general tasks, it is recommended to train for 3-5 epochs with a much larger, more diverse dataset (50k+ examples).

	## Usage

	This model uses a custom prompt format for testing purposes:

	```text
	<\|system\|>
	You are a precise assistant trained using LEMA.

	<\|user\|>
	What is LEMA?

	<\|assistant\|>
	[LEMA_REPLY]
	Answer: ...
	Explanation: ...
	Confidence: High
	[/LEMA_REPLY]
	```

	### Loading with Transformers

	Since this model has been merged (LoRA adapter integrated into base), you can load it as a standard Llama model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("Pomilon/LEMA-llama-2-7b")
	tokenizer = AutoTokenizer.from_pretrained("Pomilon/LEMA-llama-2-7b")

	prompt = "<\|system\|>\nYou are a precise assistant trained using LEMA.\n\n<\|user\|>\nWhat is LEMA?\n\n<\|assistant\|>\n[LEMA_REPLY]\nAnswer:"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

	output = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(output[0]))
	```

	## About LEMA

	LEMA is an experimental framework designed to democratize LLM fine-tuning. It treats model weights as a stream of data rather than a static block, allowing models to be processed layer-by-layer. This trades computation time (latency) for massive memory savings.

	[Check out the GitHub Repository](https://github.com/Pomilon/LEMA)