Instructions to use AungMoonLord/bert-log-anomaly-detection with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AungMoonLord/bert-log-anomaly-detection with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="AungMoonLord/bert-log-anomaly-detection")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("AungMoonLord/bert-log-anomaly-detection") model = AutoModelForSequenceClassification.from_pretrained("AungMoonLord/bert-log-anomaly-detection") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| metrics: | |
| - accuracy | |
| - precision | |
| - recall | |
| - f1 | |
| license: apache-2.0 | |
| base_model: | |
| - google-bert/bert-base-uncased | |
| pipeline_tag: text-classification | |
| # Model Card for Model ID | |
| <!-- Provide a quick summary of what the model is/does. --> | |
| Model Summary | |
| 1. `bert-log-anomaly-detection` is a BERT-based NLP model fine-tuned for single SQL transaction log anomaly detection. | |
| 2. The model classifies each database transaction log as either `Normal` or `Anomaly`, with the goal of supporting AI-powered fraud detection and cybersecurity monitoring systems. | |
| 3. This model was developed as part of the _Samsung × KBTG Digital Fraud Cybersecurity Hackathon_ (Thailand) under the AI-Powered Fraud Detection & Prevention track. | |
| ### Model Description | |
| <!-- Provide a longer summary of what this model is. --> | |
| This model analyzes individual SQL database transaction logs and detects abnormal patterns that may indicate fraudulent, malicious, or suspicious behavior. | |
| Demo: Hackathon prototype | |
| - **Developed by:** Aungruk Vanichanai, Napat Wanitwatthakorn, Thanakrit Sriphiphattana | |
| - **Shared by:** Aungruk Vanichanai | |
| - **Model type:** Transformer-based binary text classifier | |
| - **Language(s) (NLP):** English (SQL logs in text format) | |
| - **License:** Apache 2.0 | |
| - **Finetuned from model:** google-bert/bert-base-uncased | |
| ### Model Sources | |
| <!-- Provide the basic links for the model. --> | |
| - **GitHub Repository:** https://github.com/AungMoonLord/AI-Cybersecurity-Hackathon/tree/main/New%20Finetune%20Hackathon | |
| ### How to Get Started with the Model | |
| <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> | |
| ## Step 1 (Setup) | |
| ```python | |
| import torch | |
| from transformers import BertForSequenceClassification, BertTokenizer | |
| MODEL_PATH = "AungMoonLord/bert-log-anomaly-detection" | |
| model = BertForSequenceClassification.from_pretrained(MODEL_PATH) | |
| tokenizer = BertTokenizer.from_pretrained(MODEL_PATH) | |
| model.eval() | |
| ``` | |
| ## Step 2 (Clean and Label Logs) — Optional, but may slightly improve accuracy, recall, and F1-score | |
| ```python | |
| # Perfom log preprocessing | |
| def add_prefix_token(text): # log data must pass this code before training/inferencing | |
| # clean log | |
| text = text.replace("\t", " ") | |
| text = text.strip() | |
| # add token | |
| if text[0].isalpha() or text[3].isalpha(): | |
| return "[SQL]\n" + text | |
| else: | |
| return "[LOG]\n" + text | |
| ``` | |
| ## Step 3 (Create the Function for Log Classification) | |
| ```python | |
| def predict_log(log_text): | |
| log_text = add_prefix_token(log_text) | |
| inputs = tokenizer( | |
| log_text, | |
| return_tensors="pt", | |
| truncation=True, | |
| padding=True, # for cases when the inference contains more than 1 log, i.e., batch size > 1 | |
| max_length=128 | |
| ) | |
| with torch.no_grad(): | |
| logits = model(**inputs).logits | |
| pred = torch.argmax(logits, dim=1).item() | |
| prob = torch.softmax(logits, dim=-1).tolist()[0] | |
| return "Normal" if pred == 1 else "Anomaly", prob | |
| ``` | |
| ## Step 4 (Samples of Inferences) | |
| ```python | |
| # Example 1 | |
| text1 = "SELECT * FROM users WHERE id = 1 OR 1=1" | |
| print(predict_log(text1)) | |
| # Example 2 | |
| text2 = "2025-01-06 14:23:45 | User: anonymous | IP: 203.154.89.102 | Duration: 0.05s SELECT * FROM users WHERE username = 'admin' OR '1'='1' -- ' AND password = 'x'" | |
| print(predict_log(text2)) | |
| # Example 3 | |
| text3 = "3051-06-22T07:20:02.296945Z 3 Query select e3mJKDCCY from 7Q8SpG8LLEWhrfpe4s5 where ph4d = 'a1S9hQa92uC1EAyJf2Y';" | |
| print(predict_log(text3)) | |
| ``` | |
| ### Application in Hackathon Project | |
| <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> | |
| - Developed by Waris Sripatoomrak, this model integrates with an n8n workflow to automate fraud detection within financial transaction logs. | |
| ### Out-of-Scope Use | |
| - Multi-log sequence anomaly detection | |
| - Non-textual anomaly detection | |
| ## Training Data | |
| - SQL database transaction logs (1,611 samples) synthetically generated by ChatGPT, Qwen, DeepSeek, Grok, Gemini, and Claude | |
| - Each log labeled as either `Normal` or `Anomaly` | |
| - Data prepared for single-log classification | |
| ## Evaluation | |
| #### Metrics | |
| ##### - Training Set | |
| | Metric | Value | | |
| | --------------- | ------ | | |
| | Accuracy | 0.8950 | | |
| | Precision | 0.8580 | | |
| | Recall | 0.9026 | | |
| | F1-score | 0.8797 | | |
| | Validation Loss | 0.3279 | | |
| ##### - Test Set (Baseline — No Step 2 Preprocessing) | |
| | Metric | Value | | |
| | --------------- | ------ | | |
| | Accuracy | 0.6950 | | |
| | Precision | 0.6639 | | |
| | Recall | 0.7900 | | |
| | F1-score | 0.7215 | | |
| | Validation Loss | 0.6251 | | |
| ##### - Test Set (Full Pipeline — With Step 2 Preprocessing) | |
| | Metric | Value | | |
| | --------------- | ------ | | |
| | Accuracy | 0.7000 | | |
| | Precision | 0.6613 | | |
| | Recall | 0.8200 | | |
| | F1-score | 0.7321 | | |
| | Validation Loss | 0.6344 | | |
| #### Summary | |
| The model demonstrates strong anomaly detection capability with high recall, making it suitable for fraud detection and cybersecurity use cases. |