πŸ”‹ Llama-3.1-8B Energy Document Classifier

A fine-tuned Llama-3.1-8B model for binary classification of energy-related documents, achieving 98.39% accuracy on test data.

This model uses LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, trained on 95,602 documents (perfectly balanced: 47,801 energy + 47,801 non-energy).

πŸ“Š Model Performance

Test Set Results (9,562 documents)

Metric Score
Test Accuracy 98.39%
F1 Score 98.41%
Precision 97.17%
Recall 99.69%
ROC-AUC 99.76%

Validation Set Results (9,560 documents)

Metric Score
Val Accuracy 98.55%
Val F1 Score 98.56%
Val Precision 97.54%
Val Recall 99.60%
Val ROC-AUC 99.76%

Confusion Matrix (Test Set - 9,562 documents)

Predicted Non-Energy Predicted Energy
Actual Non-Energy 4,642 (97.09%) 139 (2.91%)
Actual Energy 15 (0.31%) 4,766 (99.69%)

Only 154 misclassifications out of 9,562 documents (1.61% error rate)!

Training Details

  • Base Model: meta-llama/Llama-3.1-8B
  • Training Method: LoRA (r=16, alpha=32, dropout=0.05)
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Trainable Parameters: 45M out of 8B (0.56%)
  • Total Dataset: 95,602 documents (perfectly balanced)
    • Train: 76,480 (38,240 energy + 38,240 non-energy)
    • Val: 9,560 (4,780 energy + 4,780 non-energy)
    • Test: 9,562 (4,781 energy + 4,781 non-energy)
  • Energy Data Sources:
    • EnergyAI/finepdfs_energy (40,989 docs)
    • EnergyAI/wikipedia_energy (5,459 docs)
    • EnergyAI/eartharxiv_engrxiv_energy (27 docs)
    • EnergyAI/scored_chunks_from_SPE_pipeline (1,326 docs)
  • Training Time: ~2 hours on 4Γ— A100 80GB GPUs
  • Convergence: Early stopping at step 1,100 (< 1 epoch!)
  • Effective Batch Size: 64 (per_device=4, gradient_accum=4, 4 GPUs)
  • Learning Rate: 2e-5 with cosine schedule and 10% warmup
  • Precision: bfloat16 mixed precision

Data Curation

Energy-labeled documents were sourced from four HuggingFace datasets (see above). Non-energy documents were sampled from a base document pipeline, with deduplication to ensure no overlap with energy documents (validated by both document ID and MD5 hash matching).

🎯 Use Cases

This model can classify documents as energy-related or non-energy. Perfect for:

  • πŸ“š Research paper categorization
  • πŸ“° News article filtering
  • πŸ“„ Document management systems
  • πŸ” Content discovery and recommendation
  • πŸ—‚οΈ Dataset curation for energy research

Energy Topics Covered:

  • Oil & Gas
  • Renewable Energy (Solar, Wind, Hydro, Geothermal)
  • Electricity & Power Systems
  • Nuclear Energy
  • Energy Policy & Economics
  • Carbon & Climate (energy-related aspects)
  • Energy Storage & Batteries

πŸš€ Quick Start

Installation

pip install torch>=2.0.0 transformers>=4.44.0 peft>=0.12.0 accelerate>=0.28.0

⚠️ Version Requirements:

Package Recommended Version Notes
torch β‰₯2.0.0 CUDA support recommended
transformers 4.44.0 - 4.57.x Tested range
peft 0.12.0 - 0.18.x LoRA adapter loading
accelerate β‰₯0.28.0 For device_map="auto"

Note: The tokenizer is loaded from the base model (meta-llama/Llama-3.1-8B). You need access to Llama-3.1 on Hugging Face (requires accepting the license agreement).

Basic Usage

from peft import AutoPeftModelForSequenceClassification
from transformers import AutoTokenizer
import torch

# Load model and tokenizer
model_name = "omar-elmansouri/Llama-3.1-8B-Energy-Classifier"

# Load tokenizer from base model (recommended for compatibility)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the PEFT adapter model
model = AutoPeftModelForSequenceClassification.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model.config.pad_token_id = tokenizer.pad_token_id
model.eval()

# Classify a document
text = """
Solar energy capacity has grown exponentially over the past decade. 
The International Energy Agency reports that solar now represents 
the fastest-growing renewable energy source globally.
"""


inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()

label_map = {0: "non_energy", 1: "energy"}
print(f"Prediction: {label_map[predicted_class]}")
print(f"Confidence: {predictions[0][predicted_class].item():.4f}")
print(f"Probabilities: Energy={predictions[0][1].item():.4f}, Non-Energy={predictions[0][0].item():.4f}")

Output:

Prediction: energy
Confidence: 0.9987
Probabilities: Energy=0.9987, Non-Energy=0.0013

Batch Processing

from transformers import pipeline

# Create classification pipeline
classifier = pipeline(
    "text-classification",
    model="EnergyAI/Llama-3.1-8B-Energy-Classifier",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Classify multiple documents
texts = [
    "Wind turbines are becoming more efficient with larger blade designs.",
    "The software development team completed the sprint planning meeting.",
    "Natural gas prices fluctuated amid geopolitical tensions in Europe.",
]

results = classifier(texts, truncation=True, max_length=512)
for text, result in zip(texts, results):
    print(f"Text: {text[:50]}...")
    print(f"Label: {result['label']}, Score: {result['score']:.4f}\n")

Using PEFT for Efficient Loading

from peft import AutoPeftModelForSequenceClassification
from transformers import AutoTokenizer
import torch

# Load with PEFT (more memory efficient)
model = AutoPeftModelForSequenceClassification.from_pretrained(
    "EnergyAI/Llama-3.1-8B-Energy-Classifier",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("EnergyAI/Llama-3.1-8B-Energy-Classifier")

# Classify
text = "Offshore wind farms are expanding along the Atlantic coast."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
print(f"Energy probability: {probs[0][1].item():.4f}")

πŸ—οΈ Model Architecture

Base Model

  • Name: meta-llama/Llama-3.1-8B
  • Parameters: 8 Billion
  • Architecture: Transformer-based causal language model
  • Context Length: 128K tokens (using 1024 for classification)

Fine-tuning Details

  • Method: LoRA (Low-Rank Adaptation)
  • LoRA Rank (r): 16
  • LoRA Alpha: 32
  • LoRA Dropout: 0.05
  • Target Modules:
    • q_proj, k_proj, v_proj, o_proj (Attention)
    • gate_proj, up_proj, down_proj (MLP)
  • Trainable Parameters: 41,951,232 (~0.52% of base model)
  • Task Type: Sequence Classification (Binary)

πŸ“š Training Details

Training Data

  • Training Samples: 30,534 documents
  • Validation Samples: 3,816 documents
  • Test Samples: 3,818 documents
  • Total: 38,168 labeled documents
  • Class Distribution: Balanced (50/50 energy/non-energy)

Training Configuration

  • Epochs: 3 (early stopped at 2.3)
  • Batch Size: 4 per device Γ— 4 GPUs Γ— 4 gradient accumulation = 64 effective
  • Learning Rate: 2e-5 (cosine schedule with 10% warmup)
  • Optimizer: AdamW (fused)
  • Weight Decay: 0.01
  • Max Gradient Norm: 1.0
  • Mixed Precision: bfloat16
  • Training Time: ~93 minutes on 4Γ— GPUs

Hardware

  • GPUs: 4Γ— NVIDIA (high-memory GPUs)
  • Memory: 200GB total
  • Framework: PyTorch with HuggingFace Transformers & PEFT

Training Metrics Evolution

Step Train Loss Val Loss Val Accuracy Val F1
100 4.31 1.00 65.2% 66.8%
300 1.78 0.83 80.0% 80.4%
500 0.73 0.70 89.6% 89.7%
700 0.42 0.65 93.9% 94.0%
900 0.27 0.64 96.1% 96.1%
1100 0.21 0.63 98.2% 98.2%

πŸ’‘ How It Works

Label Definitions

  • Label 0 (non_energy): Documents that are NOT primarily about energy topics

    • Examples: General news, politics (non-energy), sports, culture, software, education
  • Label 1 (energy): Documents primarily discussing energy-related topics

    • Examples:
      • "Solar panel efficiency reached new record highs..."
      • "OPEC announced production cuts affecting oil prices..."
      • "Nuclear reactor designs promise safer, cleaner energy..."
      • "Wind energy capacity doubled in the last five years..."

Classification Process

  1. Input: Document text (up to 1024 tokens)
  2. Tokenization: Llama-3.1 tokenizer with left padding
  3. Model Forward Pass: Through LoRA-adapted Llama-3.1-8B
  4. Output: Binary logits β†’ softmax probabilities
  5. Prediction: Class with highest probability + confidence score

πŸ“¦ Model Files

This repository contains:

  • adapter_config.json - LoRA adapter configuration
  • adapter_model.safetensors - Trained LoRA weights (161 MB)
  • tokenizer.json - Tokenizer vocabulary
  • tokenizer_config.json - Tokenizer configuration
  • special_tokens_map.json - Special tokens mapping

Note: The base Llama-3.1-8B model will be downloaded automatically from HuggingFace.

πŸ”§ Advanced Usage

Custom Inference Script

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from typing import List, Dict

class EnergyClassifier:
    def __init__(self, model_name: str = "EnergyAI/Llama-3.1-8B-Energy-Classifier"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name,
            device_map="auto",
            torch_dtype=torch.bfloat16,
        )
        self.model.eval()
        self.label_map = {0: "non_energy", 1: "energy"}
    
    @torch.no_grad()
    def predict(self, text: str, return_probs: bool = True) -> Dict:
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=1024,
            padding=True,
        )
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        
        outputs = self.model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
        predicted_class = torch.argmax(probs, dim=-1).item()
        
        result = {
            "label": self.label_map[predicted_class],
            "confidence": probs[0][predicted_class].item(),
        }
        
        if return_probs:
            result["probabilities"] = {
                "non_energy": probs[0][0].item(),
                "energy": probs[0][1].item(),
            }
        
        return result
    
    @torch.no_grad()
    def predict_batch(self, texts: List[str], batch_size: int = 8) -> List[Dict]:
        results = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            inputs = self.tokenizer(
                batch,
                return_tensors="pt",
                truncation=True,
                max_length=1024,
                padding=True,
            )
            inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
            
            outputs = self.model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
            
            for j in range(len(batch)):
                pred_class = torch.argmax(probs[j]).item()
                results.append({
                    "label": self.label_map[pred_class],
                    "confidence": probs[j][pred_class].item(),
                    "probabilities": {
                        "non_energy": probs[j][0].item(),
                        "energy": probs[j][1].item(),
                    }
                })
        
        return results

# Usage
classifier = EnergyClassifier()
result = classifier.predict("Wind energy is the fastest growing renewable source.")
print(result)

Processing Large Files

import json
from tqdm import tqdm

def classify_jsonl_file(input_file: str, output_file: str):
    classifier = EnergyClassifier()
    
    # Read all texts
    texts = []
    with open(input_file, 'r') as f:
        for line in f:
            data = json.loads(line)
            texts.append(data['text'])
    
    # Classify in batches
    results = classifier.predict_batch(texts, batch_size=16)
    
    # Write results
    with open(input_file, 'r') as fin, open(output_file, 'w') as fout:
        for line, result in tqdm(zip(fin, results), total=len(texts)):
            data = json.loads(line)
            data['predicted_label'] = result['label']
            data['confidence'] = result['confidence']
            data['energy_prob'] = result['probabilities']['energy']
            fout.write(json.dumps(data) + '\n')

# Process your dataset
classify_jsonl_file('documents.jsonl', 'documents_classified.jsonl')

πŸŽ“ Training Code

The training code is available at: GitHub Repository

To reproduce the training:

# Clone repository
git clone https://github.com/EnergyAI/energy-classifier.git
cd energy-classifier

# Install dependencies
pip install -r requirements.txt

# Prepare your data (train.jsonl, val.jsonl, test.jsonl)
# Format: {"text": "document text", "label": 0 or 1}

# Train
python train.py --config configs/training_config.yaml

πŸ“‹ Requirements

torch>=2.0.0
transformers>=4.40.0
peft>=0.10.0
accelerate>=0.28.0
safetensors>=0.4.0

⚑ Performance Tips

For Maximum Speed:

# Use fp16 instead of bfloat16 if your GPU supports it
model = AutoModelForSequenceClassification.from_pretrained(
    "EnergyAI/Llama-3.1-8B-Energy-Classifier",
    device_map="auto",
    torch_dtype=torch.float16,  # Faster on some GPUs
)

# Enable torch.compile (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")

For Lower Memory:

# Use 8-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForSequenceClassification.from_pretrained(
    "EnergyAI/Llama-3.1-8B-Energy-Classifier",
    quantization_config=quantization_config,
    device_map="auto",
)

πŸ“Š Benchmark Results

Inference Speed (on NVIDIA A100 GPU)

Batch Size Throughput (docs/sec) Latency (ms/doc)
1 12.3 81.3
8 78.4 10.2
16 134.7 5.9
32 198.5 3.2

Memory Usage

  • Model Size: 161 MB (LoRA adapters only)
  • Peak GPU Memory (bf16): ~18 GB (includes base model)
  • Peak GPU Memory (8-bit): ~10 GB

🀝 Citation

If you use this model in your research, please cite:

@misc{llama31-energy-classifier,
  author = {EnergyAI Team},
  title = {Llama-3.1-8B Energy Document Classifier},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/EnergyAI/Llama-3.1-8B-Energy-Classifier}},
}

πŸ“„ License

This model is released under the Llama 3.1 Community License Agreement. See the license for details.

Base Model: meta-llama/Llama-3.1-8B

πŸ”— Links

πŸ‘₯ Contact

For questions or issues:

πŸ™ Acknowledgments

  • Meta AI for the Llama 3.1 base model
  • HuggingFace for the transformers and PEFT libraries
  • Research team for dataset curation and annotation

Happy Classifying! πŸ”‹βš‘

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for EnergyAI/Llama-3.1-8B-Energy-Classifier

Adapter
(536)
this model

Evaluation results