Swapnanil09's picture
Update README.md
16f4882 verified
# Cloud Log Classifier using CodeBERT
This project implements a cloud platform log classifier using a fine-tuned CodeBERT model. It can classify logs from **AWS**, **Azure**, and **GCP** with high accuracy.
The model was fine-tuned on a dataset of simulated cloud logs using `microsoft/codebert-base-mlm` as the base model.
## πŸ“‚ Project Structure
```
.
β”œβ”€β”€ Cloud_Classifier_using_codebert.ipynb # Jupyter Notebook containing training and evaluation code
β”œβ”€β”€ cloud-log-classifier-final/ # Saved model directory (generated after training)
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ pytorch_model.bin
β”‚ β”œβ”€β”€ tokenizer_config.json
β”‚ └── ...
β”œβ”€β”€ cloud-log-classifier-final.zip # Zipped model for distribution
└── README.md # This file
```
## πŸš€ Usage
You can use the fine-tuned model in your own projects using the `CloudLogClassifier` class.
### Prerequisites
```bash
pip install torch transformers scikit-learn numpy
```
### Python Inference Code
Save the following code as `classifier.py` or use it directly in your python scripts. Ensure you have the `cloud-log-classifier-final` folder (unzipped) in your working directory.
```python
import torch
import json
import numpy as np
from transformers import RobertaForSequenceClassification, RobertaTokenizer
class CloudLogClassifier:
"""
Reusable classifier for cloud platform detection from logs.
"""
def __init__(self, model_path):
"""
Load the fine-tuned model and tokenizer
Args:
model_path (str): Path to the directory containing the saved model files
"""
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
try:
self.model = RobertaForSequenceClassification.from_pretrained(model_path)
self.tokenizer = RobertaTokenizer.from_pretrained(model_path)
self.model.to(self.device)
self.model.eval()
# Load label mapping
with open(f'{model_path}/label_mapping.json', 'r') as f:
mappings = json.load(f)
# Convert keys back to integers for the dictionary
self.id2label = {int(k): v for k, v in mappings['id2label'].items()}
except Exception as e:
raise RuntimeError(f"Failed to load model from {model_path}: {str(e)}")
def predict(self, log_text):
"""
Predict cloud platform from log text
Args:
log_text (str): Log text to classify
Returns:
dict: Prediction results with label and confidence
"""
# Tokenize input
inputs = self.tokenizer(
log_text,
return_tensors='pt',
truncation=True,
padding='max_length',
max_length=128
).to(self.device)
# Get prediction
with torch.no_grad():
outputs = self.model(**inputs)
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
predicted_class = torch.argmax(probabilities, dim=-1).item()
confidence = probabilities[0][predicted_class].item()
return {
'platform': self.id2label[predicted_class],
'confidence': confidence,
'all_probabilities': {
self.id2label[i]: prob.item()
for i, prob in enumerate(probabilities[0])
}
}
# Usage Example
if __name__ == "__main__":
# Path to your unzipped model folder
model_path = './cloud-log-classifier-final'
try:
classifier = CloudLogClassifier(model_path)
test_logs = [
"[ 3.6936] ena 0000:00:05.0: Elastic Network Adapter (ENA)",
"AzureLinuxAgent: INFO Starting Azure Linux Agent",
"google_guest_agent INFO GCE Agent running"
]
print("Predictions:")
for log in test_logs:
result = classifier.predict(log)
print(f"\nLog: {log}")
print(f"Predicted Platform: {result['platform'].upper()}")
print(f"Confidence: {result['confidence']:.2%}")
except Exception as e:
print(f"Error: {e}")
```
## πŸ“Š Model Performance
The model achieves high accuracy on the test set, effectively distinguishing between different cloud provider log formats (AWS, Azure, GCP).
| Metric | Score |
| :--- | :--- |
| **Accuracy** | ~98% |
| **Precision** | >95% |
| **Recall** | >95% |
| **F1-Score** | >95% |
## πŸ› οΈ Training
The model was trained using the `Trainer` API from HuggingFace Transformers.
- **Base Model**: microsoft/codebert-base-mlm
- **Epochs**: 5
- **Batch Size**: 16
- **Learning Rate**: Default (5e-5)