| # Cloud Log Classifier using CodeBERT | |
| This project implements a cloud platform log classifier using a fine-tuned CodeBERT model. It can classify logs from **AWS**, **Azure**, and **GCP** with high accuracy. | |
| The model was fine-tuned on a dataset of simulated cloud logs using `microsoft/codebert-base-mlm` as the base model. | |
| ## π Project Structure | |
| ``` | |
| . | |
| βββ Cloud_Classifier_using_codebert.ipynb # Jupyter Notebook containing training and evaluation code | |
| βββ cloud-log-classifier-final/ # Saved model directory (generated after training) | |
| β βββ config.json | |
| β βββ pytorch_model.bin | |
| β βββ tokenizer_config.json | |
| β βββ ... | |
| βββ cloud-log-classifier-final.zip # Zipped model for distribution | |
| βββ README.md # This file | |
| ``` | |
| ## π Usage | |
| You can use the fine-tuned model in your own projects using the `CloudLogClassifier` class. | |
| ### Prerequisites | |
| ```bash | |
| pip install torch transformers scikit-learn numpy | |
| ``` | |
| ### Python Inference Code | |
| Save the following code as `classifier.py` or use it directly in your python scripts. Ensure you have the `cloud-log-classifier-final` folder (unzipped) in your working directory. | |
| ```python | |
| import torch | |
| import json | |
| import numpy as np | |
| from transformers import RobertaForSequenceClassification, RobertaTokenizer | |
| class CloudLogClassifier: | |
| """ | |
| Reusable classifier for cloud platform detection from logs. | |
| """ | |
| def __init__(self, model_path): | |
| """ | |
| Load the fine-tuned model and tokenizer | |
| Args: | |
| model_path (str): Path to the directory containing the saved model files | |
| """ | |
| self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') | |
| try: | |
| self.model = RobertaForSequenceClassification.from_pretrained(model_path) | |
| self.tokenizer = RobertaTokenizer.from_pretrained(model_path) | |
| self.model.to(self.device) | |
| self.model.eval() | |
| # Load label mapping | |
| with open(f'{model_path}/label_mapping.json', 'r') as f: | |
| mappings = json.load(f) | |
| # Convert keys back to integers for the dictionary | |
| self.id2label = {int(k): v for k, v in mappings['id2label'].items()} | |
| except Exception as e: | |
| raise RuntimeError(f"Failed to load model from {model_path}: {str(e)}") | |
| def predict(self, log_text): | |
| """ | |
| Predict cloud platform from log text | |
| Args: | |
| log_text (str): Log text to classify | |
| Returns: | |
| dict: Prediction results with label and confidence | |
| """ | |
| # Tokenize input | |
| inputs = self.tokenizer( | |
| log_text, | |
| return_tensors='pt', | |
| truncation=True, | |
| padding='max_length', | |
| max_length=128 | |
| ).to(self.device) | |
| # Get prediction | |
| with torch.no_grad(): | |
| outputs = self.model(**inputs) | |
| logits = outputs.logits | |
| probabilities = torch.nn.functional.softmax(logits, dim=-1) | |
| predicted_class = torch.argmax(probabilities, dim=-1).item() | |
| confidence = probabilities[0][predicted_class].item() | |
| return { | |
| 'platform': self.id2label[predicted_class], | |
| 'confidence': confidence, | |
| 'all_probabilities': { | |
| self.id2label[i]: prob.item() | |
| for i, prob in enumerate(probabilities[0]) | |
| } | |
| } | |
| # Usage Example | |
| if __name__ == "__main__": | |
| # Path to your unzipped model folder | |
| model_path = './cloud-log-classifier-final' | |
| try: | |
| classifier = CloudLogClassifier(model_path) | |
| test_logs = [ | |
| "[ 3.6936] ena 0000:00:05.0: Elastic Network Adapter (ENA)", | |
| "AzureLinuxAgent: INFO Starting Azure Linux Agent", | |
| "google_guest_agent INFO GCE Agent running" | |
| ] | |
| print("Predictions:") | |
| for log in test_logs: | |
| result = classifier.predict(log) | |
| print(f"\nLog: {log}") | |
| print(f"Predicted Platform: {result['platform'].upper()}") | |
| print(f"Confidence: {result['confidence']:.2%}") | |
| except Exception as e: | |
| print(f"Error: {e}") | |
| ``` | |
| ## π Model Performance | |
| The model achieves high accuracy on the test set, effectively distinguishing between different cloud provider log formats (AWS, Azure, GCP). | |
| | Metric | Score | | |
| | :--- | :--- | | |
| | **Accuracy** | ~98% | | |
| | **Precision** | >95% | | |
| | **Recall** | >95% | | |
| | **F1-Score** | >95% | | |
| ## π οΈ Training | |
| The model was trained using the `Trainer` API from HuggingFace Transformers. | |
| - **Base Model**: microsoft/codebert-base-mlm | |
| - **Epochs**: 5 | |
| - **Batch Size**: 16 | |
| - **Learning Rate**: Default (5e-5) | |