Cloud Log Classifier using CodeBERT

This project implements a cloud platform log classifier using a fine-tuned CodeBERT model. It can classify logs from AWS, Azure, and GCP with high accuracy.

The model was fine-tuned on a dataset of simulated cloud logs using microsoft/codebert-base-mlm as the base model.

📂 Project Structure

.
├── Cloud_Classifier_using_codebert.ipynb  # Jupyter Notebook containing training and evaluation code
├── cloud-log-classifier-final/            # Saved model directory (generated after training)
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── tokenizer_config.json
│   └── ...
├── cloud-log-classifier-final.zip         # Zipped model for distribution
└── README.md                              # This file

🚀 Usage

You can use the fine-tuned model in your own projects using the CloudLogClassifier class.

Prerequisites

pip install torch transformers scikit-learn numpy

Python Inference Code

Save the following code as classifier.py or use it directly in your python scripts. Ensure you have the cloud-log-classifier-final folder (unzipped) in your working directory.

import torch
import json
import numpy as np
from transformers import RobertaForSequenceClassification, RobertaTokenizer

class CloudLogClassifier:
    """
    Reusable classifier for cloud platform detection from logs.
    """

    def __init__(self, model_path):
        """
        Load the fine-tuned model and tokenizer
        
        Args:
            model_path (str): Path to the directory containing the saved model files
        """
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        try:
            self.model = RobertaForSequenceClassification.from_pretrained(model_path)
            self.tokenizer = RobertaTokenizer.from_pretrained(model_path)
            self.model.to(self.device)
            self.model.eval()
            
            # Load label mapping
            with open(f'{model_path}/label_mapping.json', 'r') as f:
                mappings = json.load(f)
                # Convert keys back to integers for the dictionary
                self.id2label = {int(k): v for k, v in mappings['id2label'].items()}
                
        except Exception as e:
            raise RuntimeError(f"Failed to load model from {model_path}: {str(e)}")

    def predict(self, log_text):
        """
        Predict cloud platform from log text

        Args:
            log_text (str): Log text to classify

        Returns:
            dict: Prediction results with label and confidence
        """
        # Tokenize input
        inputs = self.tokenizer(
            log_text,
            return_tensors='pt',
            truncation=True,
            padding='max_length',
            max_length=128
        ).to(self.device)

        # Get prediction
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits
            probabilities = torch.nn.functional.softmax(logits, dim=-1)
            predicted_class = torch.argmax(probabilities, dim=-1).item()
            confidence = probabilities[0][predicted_class].item()

        return {
            'platform': self.id2label[predicted_class],
            'confidence': confidence,
            'all_probabilities': {
                self.id2label[i]: prob.item()
                for i, prob in enumerate(probabilities[0])
            }
        }

# Usage Example
if __name__ == "__main__":
    # Path to your unzipped model folder
    model_path = './cloud-log-classifier-final'
    
    try:
        classifier = CloudLogClassifier(model_path)
        
        test_logs = [
            "[    3.6936] ena 0000:00:05.0: Elastic Network Adapter (ENA)",
            "AzureLinuxAgent: INFO Starting Azure Linux Agent",
            "google_guest_agent INFO GCE Agent running"
        ]

        print("Predictions:")
        for log in test_logs:
            result = classifier.predict(log)
            print(f"\nLog: {log}")
            print(f"Predicted Platform: {result['platform'].upper()}")
            print(f"Confidence: {result['confidence']:.2%}")
            
    except Exception as e:
        print(f"Error: {e}")

📊 Model Performance

The model achieves high accuracy on the test set, effectively distinguishing between different cloud provider log formats (AWS, Azure, GCP).

Metric	Score
Accuracy	~98%
Precision	>95%
Recall	>95%
F1-Score	>95%

🛠️ Training

The model was trained using the Trainer API from HuggingFace Transformers.

Base Model: microsoft/codebert-base-mlm
Epochs: 5
Batch Size: 16
Learning Rate: Default (5e-5)