LEMA-llama-2-7b / README.md
Pomilon's picture
Update README.md
287ae54 verified
---
license: mit
language:
- en
base_model:
- NousResearch/Llama-2-7b-hf
pipeline_tag: text-generation
tags:
- llm
- llama
- fine-tune
- lema
- vram-optimization
- low-resource-computing
- chat
- merged
---
# LEMA-Llama-2-7b (Proof of Concept)
This model is a demonstration of the **[LEMA (Layer-wise Efficient Memory Abstraction)](https://github.com/Pomilon/LEMA)** framework. It proves that large language models (7B+) can be fine-tuned on consumer-grade hardware with limited VRAM (e.g., 16GB Tesla P100) by virtualizing GPU memory.
**Key Achievement:**
Fine-tuned Llama-2-7B using only **6.36 GB of VRAM** (standard LoRA typically requires ~14GB+ for this configuration).
> Training code is available over on the github repository: [**LEMA-llama**](https://github.com/Pomilon/LEMA-llama)
## Model Details
- **Base Model:** `NousResearch/Llama-2-7b-hf`
- **Framework:** LEMA v1.0
- **Fine-Tuning Method:** LoRA (Rank 16, Alpha 32)
- **Memory Strategy:** Streaming (Triple-Buffer: Disk -> RAM -> VRAM)
- **Precision:** FP16
## Training Configuration
The model was trained to learn a strict custom chat format (`[LEMA_REPLY]`) to verify that weight updates were successfully applied.
- **Hardware:** NVIDIA Tesla P100 (16GB VRAM)
- **Batch Size:** 8 (Gradient Accumulation: 1)
- **Sequence Length:** 512
- **Steps:** 625 (1 Epoch over 5k examples)
- **Optimizer:** AdamW (lr=1e-4)
### Memory Efficiency
| Metric | Standard PEFT/LoRA | **LEMA (This Run)** |
| :--- | :--- | :--- |
| **Peak VRAM** | **OOM** | **6.36 GB** |
| **System RAM** | **OOM** | **2.40 GB** |
*Note: Standard PEFT typically OOMs at Batch Size 4-8 on 16GB cards with 512 context. LEMA held steady at <7GB.*
## Training Logs
The training loss converged smoothly, demonstrating stable learning despite the layer-wise streaming architecture.
```
Step 10/625 | Loss: 2.1732 | VRAM: 6.36GB
Step 100/625 | Loss: 0.0677 | VRAM: 6.36GB
Step 200/625 | Loss: 0.0462 | VRAM: 6.36GB
Step 300/625 | Loss: 0.0407 | VRAM: 6.36GB
Step 400/625 | Loss: 0.0412 | VRAM: 6.36GB
Step 500/625 | Loss: 0.0459 | VRAM: 6.36GB
Step 600/625 | Loss: 0.0406 | VRAM: 6.36GB
Final Step | Training Complete
```
## Derived Metrics
* Total Training Time: 5h 40m
* Average Step Time: 32.23s
* Peak VRAM: 6.36GB (stable)
* Peak RAM: 2.52GB
Full raw logs over [Here](training_logs.txt)
## Limitations & Known Issues
**⚠️ Warning: Experimental Proof-of-Concept**
This model was trained for only **1 epoch** as a mechanical stress test of the LEMA library. While it successfully learned the new vocabulary and special tags, it has not yet mastered the logical structure or grammar of the custom template.
- **Token Looping:** The model may repeat tags like `[LEMA_REPLY]` multiple times in a loop.
- **Hallucinations:** It may invent creative definitions for terms it hasn't seen in its original pre-training (e.g., hallucinating an acronym for LEMA).
- **Overfitting:** Due to the small, highly repetitive synthetic dataset and 1-epoch training, the model is likely overfit to the specific examples provided.
- **Template Grammar:** It often skips the `Explanation:` and `Confidence:` fields.
To achieve production-grade results and make the model actually usable for general tasks, it is recommended to train for 3-5 epochs with a much larger, more diverse dataset (50k+ examples).
## Usage
This model uses a custom prompt format for testing purposes:
```text
<|system|>
You are a precise assistant trained using LEMA.
<|user|>
What is LEMA?
<|assistant|>
[LEMA_REPLY]
Answer: ...
Explanation: ...
Confidence: High
[/LEMA_REPLY]
```
### Loading with Transformers
Since this model has been merged (LoRA adapter integrated into base), you can load it as a standard Llama model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Pomilon/LEMA-llama-2-7b")
tokenizer = AutoTokenizer.from_pretrained("Pomilon/LEMA-llama-2-7b")
prompt = "<|system|>\nYou are a precise assistant trained using LEMA.\n\n<|user|>\nWhat is LEMA?\n\n<|assistant|>\n[LEMA_REPLY]\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0]))
```
## About LEMA
LEMA is an experimental framework designed to democratize LLM fine-tuning. It treats model weights as a stream of data rather than a static block, allowing models to be processed layer-by-layer. This trades computation time (latency) for massive memory savings.
[Check out the GitHub Repository](https://github.com/Pomilon/LEMA)