🔋 Llama-3.1-8B Energy Document Classifier

A fine-tuned Llama-3.1-8B model for binary classification of energy-related documents, achieving 98.39% accuracy on test data.

This model uses LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, trained on 95,602 documents (perfectly balanced: 47,801 energy + 47,801 non-energy).

📊 Model Performance

Test Set Results (9,562 documents)

Metric	Score
Test Accuracy	98.39%
F1 Score	98.41%
Precision	97.17%
Recall	99.69%
ROC-AUC	99.76%

Validation Set Results (9,560 documents)

Metric	Score
Val Accuracy	98.55%
Val F1 Score	98.56%
Val Precision	97.54%
Val Recall	99.60%
Val ROC-AUC	99.76%

Confusion Matrix (Test Set - 9,562 documents)

	Predicted Non-Energy	Predicted Energy
Actual Non-Energy	4,642 (97.09%)	139 (2.91%)
Actual Energy	15 (0.31%)	4,766 (99.69%)

Only 154 misclassifications out of 9,562 documents (1.61% error rate)!

Training Details

Base Model: meta-llama/Llama-3.1-8B
Training Method: LoRA (r=16, alpha=32, dropout=0.05)
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable Parameters: 45M out of 8B (0.56%)
Total Dataset: 95,602 documents (perfectly balanced)
- Train: 76,480 (38,240 energy + 38,240 non-energy)
- Val: 9,560 (4,780 energy + 4,780 non-energy)
- Test: 9,562 (4,781 energy + 4,781 non-energy)
Energy Data Sources:
- EnergyAI/finepdfs_energy (40,989 docs)
- EnergyAI/wikipedia_energy (5,459 docs)
- EnergyAI/eartharxiv_engrxiv_energy (27 docs)
- EnergyAI/scored_chunks_from_SPE_pipeline (1,326 docs)
Training Time: ~2 hours on 4× A100 80GB GPUs
Convergence: Early stopping at step 1,100 (< 1 epoch!)
Effective Batch Size: 64 (per_device=4, gradient_accum=4, 4 GPUs)
Learning Rate: 2e-5 with cosine schedule and 10% warmup
Precision: bfloat16 mixed precision

Data Curation

Energy-labeled documents were sourced from four HuggingFace datasets (see above). Classification labels for the training data were created with Mistral 3 Large model and this classifier was distilled from this data. Non-energy documents were sampled from a base document pipeline, with deduplication to ensure no overlap with energy documents (validated by both document ID and MD5 hash matching).

🎯 Use Cases

This model can classify documents as energy-related or non-energy. Perfect for:

📚 Research paper categorization
📰 News article filtering
📄 Document management systems
🔍 Content discovery and recommendation
🗂️ Dataset curation for energy research

Energy Topics Covered:

Oil & Gas
Renewable Energy (Solar, Wind, Hydro, Geothermal)
Electricity & Power Systems
Nuclear Energy
Energy Policy & Economics
Carbon & Climate (energy-related aspects)
Energy Storage & Batteries

🚀 Quick Start

Installation

pip install torch>=2.0.0 transformers>=4.44.0 peft>=0.12.0 accelerate>=0.28.0

⚠️ Version Requirements:

Package	Recommended Version	Notes
`torch`	≥2.0.0	CUDA support recommended
`transformers`	4.44.0 - 4.57.x	Tested range
`peft`	0.12.0 - 0.18.x	LoRA adapter loading
`accelerate`	≥0.28.0	For device_map="auto"

Note: The tokenizer is loaded from the base model (meta-llama/Llama-3.1-8B). You need access to Llama-3.1 on Hugging Face (requires accepting the license agreement).

Basic Usage

from peft import AutoPeftModelForSequenceClassification
from transformers import AutoTokenizer
import torch

# Load model and tokenizer
model_name = "omar-elmansouri/Llama-3.1-8B-Energy-Classifier"

# Load tokenizer from base model (recommended for compatibility)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the PEFT adapter model
model = AutoPeftModelForSequenceClassification.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model.config.pad_token_id = tokenizer.pad_token_id
model.eval()

# Classify a document
text = """
Solar energy capacity has grown exponentially over the past decade. 
The International Energy Agency reports that solar now represents 
the fastest-growing renewable energy source globally.
"""


inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()

label_map = {0: "non_energy", 1: "energy"}
print(f"Prediction: {label_map[predicted_class]}")
print(f"Confidence: {predictions[0][predicted_class].item():.4f}")
print(f"Probabilities: Energy={predictions[0][1].item():.4f}, Non-Energy={predictions[0][0].item():.4f}")

Output:

Prediction: energy
Confidence: 0.9987
Probabilities: Energy=0.9987, Non-Energy=0.0013

Batch Processing

from transformers import pipeline

# Create classification pipeline
classifier = pipeline(
    "text-classification",
    model="EnergyAI/Llama-3.1-8B-Energy-Classifier",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Classify multiple documents
texts = [
    "Wind turbines are becoming more efficient with larger blade designs.",
    "The software development team completed the sprint planning meeting.",
    "Natural gas prices fluctuated amid geopolitical tensions in Europe.",
]

results = classifier(texts, truncation=True, max_length=512)
for text, result in zip(texts, results):
    print(f"Text: {text[:50]}...")
    print(f"Label: {result['label']}, Score: {result['score']:.4f}\n")

Using PEFT for Efficient Loading

from peft import AutoPeftModelForSequenceClassification
from transformers import AutoTokenizer
import torch

# Load with PEFT (more memory efficient)
model = AutoPeftModelForSequenceClassification.from_pretrained(
    "EnergyAI/Llama-3.1-8B-Energy-Classifier",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("EnergyAI/Llama-3.1-8B-Energy-Classifier")

# Classify
text = "Offshore wind farms are expanding along the Atlantic coast."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
print(f"Energy probability: {probs[0][1].item():.4f}")

🏗️ Model Architecture

Base Model

Name: meta-llama/Llama-3.1-8B
Parameters: 8 Billion
Architecture: Transformer-based causal language model
Context Length: 128K tokens (using 1024 for classification)

Fine-tuning Details

Method: LoRA (Low-Rank Adaptation)
LoRA Rank (r): 16
LoRA Alpha: 32
LoRA Dropout: 0.05
Target Modules:
- q_proj, k_proj, v_proj, o_proj (Attention)
- gate_proj, up_proj, down_proj (MLP)
Trainable Parameters: 41,951,232 (~0.52% of base model)
Task Type: Sequence Classification (Binary)

📚 Training Details

Training Data

Training Samples: 30,534 documents
Validation Samples: 3,816 documents
Test Samples: 3,818 documents
Total: 38,168 labeled documents
Class Distribution: Balanced (50/50 energy/non-energy)

Training Configuration

Epochs: 3 (early stopped at 2.3)
Batch Size: 4 per device × 4 GPUs × 4 gradient accumulation = 64 effective
Learning Rate: 2e-5 (cosine schedule with 10% warmup)
Optimizer: AdamW (fused)
Weight Decay: 0.01
Max Gradient Norm: 1.0
Mixed Precision: bfloat16
Training Time: ~93 minutes on 4× GPUs

Hardware

GPUs: 4× NVIDIA (high-memory GPUs)
Memory: 200GB total
Framework: PyTorch with HuggingFace Transformers & PEFT

Training Metrics Evolution

Step	Train Loss	Val Loss	Val Accuracy	Val F1
100	4.31	1.00	65.2%	66.8%
300	1.78	0.83	80.0%	80.4%
500	0.73	0.70	89.6%	89.7%
700	0.42	0.65	93.9%	94.0%
900	0.27	0.64	96.1%	96.1%
1100	0.21	0.63	98.2%	98.2%

💡 How It Works

Label Definitions

Label 0 (non_energy): Documents that are NOT primarily about energy topics
- Examples: General news, politics (non-energy), sports, culture, software, education
Label 1 (energy): Documents primarily discussing energy-related topics
- Examples:
  - "Solar panel efficiency reached new record highs..."
  - "OPEC announced production cuts affecting oil prices..."
  - "Nuclear reactor designs promise safer, cleaner energy..."
  - "Wind energy capacity doubled in the last five years..."

Classification Process

Input: Document text (up to 1024 tokens)
Tokenization: Llama-3.1 tokenizer with left padding
Model Forward Pass: Through LoRA-adapted Llama-3.1-8B
Output: Binary logits → softmax probabilities
Prediction: Class with highest probability + confidence score

📦 Model Files

This repository contains:

adapter_config.json - LoRA adapter configuration
adapter_model.safetensors - Trained LoRA weights (161 MB)
tokenizer.json - Tokenizer vocabulary
tokenizer_config.json - Tokenizer configuration
special_tokens_map.json - Special tokens mapping

Note: The base Llama-3.1-8B model will be downloaded automatically from HuggingFace.

🔧 Advanced Usage

Custom Inference Script

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from typing import List, Dict

class EnergyClassifier:
    def __init__(self, model_name: str = "EnergyAI/Llama-3.1-8B-Energy-Classifier"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            device_map="auto",
            torch_dtype=torch.bfloat16,
        )
        self.model.eval()
        self.label_map = {0: "non_energy", 1: "energy"}
    
    @torch.no_grad()
    def predict(self, text: str, return_probs: bool = True) -> Dict:
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=1024,
            padding=True,
        )
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        
        outputs = self.model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(probs, dim=-1).item()
        
        result = {
            "label": self.label_map[predicted_class],
            "confidence": probs[0][predicted_class].item(),
        }
        
        if return_probs:
            result["probabilities"] = {
                "non_energy": probs[0][0].item(),
                "energy": probs[0][1].item(),
            }
        
        return result
    
    @torch.no_grad()
    def predict_batch(self, texts: List[str], batch_size: int = 8) -> List[Dict]:
        results = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            inputs = self.tokenizer(
                batch,
                return_tensors="pt",
                truncation=True,
                max_length=1024,
                padding=True,
            )
            inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
            
            outputs = self.model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
            
            for j in range(len(batch)):
                pred_class = torch.argmax(probs[j]).item()
                results.append({
                    "label": self.label_map[pred_class],
                    "confidence": probs[j][pred_class].item(),
                    "probabilities": {
                        "non_energy": probs[j][0].item(),
                        "energy": probs[j][1].item(),
                    }
                })
        
        return results

# Usage
classifier = EnergyClassifier()
result = classifier.predict("Wind energy is the fastest growing renewable source.")
print(result)

Processing Large Files

import json
from tqdm import tqdm

def classify_jsonl_file(input_file: str, output_file: str):
    classifier = EnergyClassifier()
    
    # Read all texts
    texts = []
    with open(input_file, 'r') as f:
        for line in f:
            data = json.loads(line)
            texts.append(data['text'])
    
    # Classify in batches
    results = classifier.predict_batch(texts, batch_size=16)
    
    # Write results
    with open(input_file, 'r') as fin, open(output_file, 'w') as fout:
        for line, result in tqdm(zip(fin, results), total=len(texts)):
            data = json.loads(line)
            data['predicted_label'] = result['label']
            data['confidence'] = result['confidence']
            data['energy_prob'] = result['probabilities']['energy']
            fout.write(json.dumps(data) + '\n')

# Process your dataset
classify_jsonl_file('documents.jsonl', 'documents_classified.jsonl')

🎓 Training Code

The training code is available at: GitHub Repository

To reproduce the training:

# Clone repository
git clone https://github.com/EnergyAI/energy-classifier.git
cd energy-classifier

# Install dependencies
pip install -r requirements.txt

# Prepare your data (train.jsonl, val.jsonl, test.jsonl)
# Format: {"text": "document text", "label": 0 or 1}

# Train
python train.py --config configs/training_config.yaml

📋 Requirements

torch>=2.0.0
transformers>=4.40.0
peft>=0.10.0
accelerate>=0.28.0
safetensors>=0.4.0

⚡ Performance Tips

For Maximum Speed:

# Use fp16 instead of bfloat16 if your GPU supports it
model = AutoModelForSequenceClassification.from_pretrained(
    "EnergyAI/Llama-3.1-8B-Energy-Classifier",
    device_map="auto",
    torch_dtype=torch.float16,  # Faster on some GPUs
)

# Enable torch.compile (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")

For Lower Memory:

# Use 8-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForSequenceClassification.from_pretrained(
    "EnergyAI/Llama-3.1-8B-Energy-Classifier",
    quantization_config=quantization_config,
    device_map="auto",
)

📊 Benchmark Results

Inference Speed (on NVIDIA A100 GPU)

Batch Size	Throughput (docs/sec)	Latency (ms/doc)
1	12.3	81.3
8	78.4	10.2
16	134.7	5.9
32	198.5	3.2

Memory Usage

Model Size: 161 MB (LoRA adapters only)
Peak GPU Memory (bf16): ~18 GB (includes base model)
Peak GPU Memory (8-bit): ~10 GB

🤝 Citation

If you use this model in your research, please cite:

@misc{llama31-energy-classifier,
  author = {EnergyAI Team},
  title = {Llama-3.1-8B Energy Document Classifier},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EnergyAI/Llama-3.1-8B-Energy-Classifier}},
}