# Cloud Log Classifier using CodeBERT

This project implements a cloud platform log classifier using a fine-tuned CodeBERT model. It can classify logs from **AWS**, **Azure**, and **GCP** with high accuracy.

The model was fine-tuned on a dataset of simulated cloud logs using `microsoft/codebert-base-mlm` as the base model.

## 📂 Project Structure

```
.
├── Cloud_Classifier_using_codebert.ipynb  # Jupyter Notebook containing training and evaluation code
├── cloud-log-classifier-final/            # Saved model directory (generated after training)
│   ├── config.json
│   ├── pytorch_model.bin
│   ├── tokenizer_config.json
│   └── ...
├── cloud-log-classifier-final.zip         # Zipped model for distribution
└── README.md                              # This file
```

## 🚀 Usage

You can use the fine-tuned model in your own projects using the `CloudLogClassifier` class.

### Prerequisites

```bash
pip install torch transformers scikit-learn numpy
```

### Python Inference Code

Save the following code as `classifier.py` or use it directly in your python scripts. Ensure you have the `cloud-log-classifier-final` folder (unzipped) in your working directory.

```python
import torch
import json
import numpy as np
from transformers import RobertaForSequenceClassification, RobertaTokenizer

class CloudLogClassifier:
    """
    Reusable classifier for cloud platform detection from logs.
    """

    def __init__(self, model_path):
        """
        Load the fine-tuned model and tokenizer
        
        Args:
            model_path (str): Path to the directory containing the saved model files
        """
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        try:
            self.model = RobertaForSequenceClassification.from_pretrained(model_path)
            self.tokenizer = RobertaTokenizer.from_pretrained(model_path)
            self.model.to(self.device)
            self.model.eval()
            
            # Load label mapping
            with open(f'{model_path}/label_mapping.json', 'r') as f:
                mappings = json.load(f)
                # Convert keys back to integers for the dictionary
                self.id2label = {int(k): v for k, v in mappings['id2label'].items()}
                
        except Exception as e:
            raise RuntimeError(f"Failed to load model from {model_path}: {str(e)}")

    def predict(self, log_text):
        """
        Predict cloud platform from log text

        Args:
            log_text (str): Log text to classify

        Returns:
            dict: Prediction results with label and confidence
        """
        # Tokenize input
        inputs = self.tokenizer(
            log_text,
            return_tensors='pt',
            truncation=True,
            padding='max_length',
            max_length=128
        ).to(self.device)

        # Get prediction
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits
            probabilities = torch.nn.functional.softmax(logits, dim=-1)
            predicted_class = torch.argmax(probabilities, dim=-1).item()
            confidence = probabilities[0][predicted_class].item()

        return {
            'platform': self.id2label[predicted_class],
            'confidence': confidence,
            'all_probabilities': {
                self.id2label[i]: prob.item()
                for i, prob in enumerate(probabilities[0])
            }
        }

# Usage Example
if __name__ == "__main__":
    # Path to your unzipped model folder
    model_path = './cloud-log-classifier-final'
    
    try:
        classifier = CloudLogClassifier(model_path)
        
        test_logs = [
            "[    3.6936] ena 0000:00:05.0: Elastic Network Adapter (ENA)",
            "AzureLinuxAgent: INFO Starting Azure Linux Agent",
            "google_guest_agent INFO GCE Agent running"
        ]

        print("Predictions:")
        for log in test_logs:
            result = classifier.predict(log)
            print(f"\nLog: {log}")
            print(f"Predicted Platform: {result['platform'].upper()}")
            print(f"Confidence: {result['confidence']:.2%}")
            
    except Exception as e:
        print(f"Error: {e}")
```

## 📊 Model Performance

The model achieves high accuracy on the test set, effectively distinguishing between different cloud provider log formats (AWS, Azure, GCP).

| Metric | Score |
| :--- | :--- |
| **Accuracy** | ~98% |
| **Precision** | >95% |
| **Recall** | >95% |
| **F1-Score** | >95% |

## 🛠️ Training

The model was trained using the `Trainer` API from HuggingFace Transformers.

- **Base Model**: microsoft/codebert-base-mlm
- **Epochs**: 5
- **Batch Size**: 16
- **Learning Rate**: Default (5e-5)