You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Cloud Log Classifier using CodeBERT

This project implements a cloud platform log classifier using a fine-tuned CodeBERT model. It can classify logs from AWS, Azure, and GCP with high accuracy.

The model was fine-tuned on a dataset of simulated cloud logs using microsoft/codebert-base-mlm as the base model.

📂 Project Structure

.
├── Cloud_Classifier_using_codebert.ipynb  # Jupyter Notebook containing training and evaluation code
├── cloud-log-classifier-final/            # Saved model directory (generated after training)
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── tokenizer_config.json
│   └── ...
├── cloud-log-classifier-final.zip         # Zipped model for distribution
└── README.md                              # This file

🚀 Usage

You can use the fine-tuned model in your own projects using the CloudLogClassifier class.

Prerequisites

pip install torch transformers scikit-learn numpy

Python Inference Code

Save the following code as classifier.py or use it directly in your python scripts. Ensure you have the cloud-log-classifier-final folder (unzipped) in your working directory.

import torch
import json
import numpy as np
from transformers import RobertaForSequenceClassification, RobertaTokenizer

class CloudLogClassifier:
    """
    Reusable classifier for cloud platform detection from logs.
    """

    def __init__(self, model_path):
        """
        Load the fine-tuned model and tokenizer
        
        Args:
            model_path (str): Path to the directory containing the saved model files
        """
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        try:
            self.model = RobertaForSequenceClassification.from_pretrained(model_path)
            self.tokenizer = RobertaTokenizer.from_pretrained(model_path)
            self.model.to(self.device)
            self.model.eval()
            
            # Load label mapping
            with open(f'{model_path}/label_mapping.json', 'r') as f:
                mappings = json.load(f)
                # Convert keys back to integers for the dictionary
                self.id2label = {int(k): v for k, v in mappings['id2label'].items()}
                
        except Exception as e:
            raise RuntimeError(f"Failed to load model from {model_path}: {str(e)}")

    def predict(self, log_text):
        """
        Predict cloud platform from log text

        Args:
            log_text (str): Log text to classify

        Returns:
            dict: Prediction results with label and confidence
        """
        # Tokenize input
        inputs = self.tokenizer(
            log_text,
            return_tensors='pt',
            truncation=True,
            padding='max_length',
            max_length=128
        ).to(self.device)

        # Get prediction
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits
            probabilities = torch.nn.functional.softmax(logits, dim=-1)
            predicted_class = torch.argmax(probabilities, dim=-1).item()
            confidence = probabilities[0][predicted_class].item()

        return {
            'platform': self.id2label[predicted_class],
            'confidence': confidence,
            'all_probabilities': {
                self.id2label[i]: prob.item()
                for i, prob in enumerate(probabilities[0])
            }
        }

# Usage Example
if __name__ == "__main__":
    # Path to your unzipped model folder
    model_path = './cloud-log-classifier-final'
    
    try:
        classifier = CloudLogClassifier(model_path)
        
        test_logs = [
            "[    3.6936] ena 0000:00:05.0: Elastic Network Adapter (ENA)",
            "AzureLinuxAgent: INFO Starting Azure Linux Agent",
            "google_guest_agent INFO GCE Agent running"
        ]

        print("Predictions:")
        for log in test_logs:
            result = classifier.predict(log)
            print(f"\nLog: {log}")
            print(f"Predicted Platform: {result['platform'].upper()}")
            print(f"Confidence: {result['confidence']:.2%}")
            
    except Exception as e:
        print(f"Error: {e}")

📊 Model Performance

The model achieves high accuracy on the test set, effectively distinguishing between different cloud provider log formats (AWS, Azure, GCP).

Metric	Score
Accuracy	~98%
Precision	>95%
Recall	>95%
F1-Score	>95%

🛠️ Training

The model was trained using the Trainer API from HuggingFace Transformers.

Base Model: microsoft/codebert-base-mlm
Epochs: 5
Batch Size: 16
Learning Rate: Default (5e-5)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support