| --- |
| language: |
| - en |
| base_model: |
| - google-bert/bert-base-uncased |
| pipeline_tag: token-classification |
| library_name: transformers |
| tags: |
| - token-classification |
| - named-entity-recognition |
| - ner |
| - bert |
| - logs |
| datasets: |
| - Aliph0th/logtheus-ml-ds |
| --- |
| # Log Entity Extractor (BERT-based Token Classifier) |
|
|
| A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task). Created for my course work |
|
|
| ## Model Description |
|
|
| This model is based on `bert-base-uncased` and trained to perform **token classification** on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme. |
|
|
| **Use case:** Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation. |
|
|
| ## Model Details |
|
|
| - **Base Model:** `bert-large-uncased` |
| - **Task:** Token Classification (Named Entity Recognition) |
| - **Training Data:** Annotated log lines (character-level entity offsets) |
| - **Input:** Raw log text (string) |
| - **Output:** Per-token BIO labels → grouped entities as canonical attributes |
|
|
| ## Canonical Label Set |
|
|
| The model extracts attributes from these canonical fields: |
|
|
| | Field | Description | |
| |-------|-------------| |
| | service | Application or service name (e.g., "auth", "api") | |
| | level | Log level (e.g., "info", "error", "warn") | |
| | timestamp | Timestamp or date reference | |
| | environment | Deployment environment (e.g., "prod", "staging") | |
| | event | Event type or action (e.g., "login", "request") | |
| | error_message | Human-readable error message | |
| | status_code | HTTP or service status code | |
| | duration | Duration | |
| | ip | IP address (client or server) | |
| | method | HTTP method (GET, POST, etc.) | |
| | path | URL path or resource path | |
| | useragent | User-Agent header | |
| | hostname | Server hostname | |
|
|
| ## Usage |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch |
| ``` |
|
|
| ### Python (Hugging Face Transformers) |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForTokenClassification |
| import torch |
| model_name = "Aliph0th/logtheus-ml" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForTokenClassification.from_pretrained(model_name) |
| text = "[auth] failed login for user 123 from 10.1.2.3 code=E401" |
| # Tokenize and forward pass |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) |
| outputs = model(**inputs) |
| logits = outputs.logits |
| # Get predicted label IDs |
| predicted_ids = torch.argmax(logits, dim=-1) |
| # Map back to label names |
| id2label = model.config.id2label |
| predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids] |
| print(predictions) |
| ``` |
|
|
| This returns a structured JSON object with: |
| - `attributes`: High-confidence extractions (dict of canonical_field → value) |
| - `low_confidence_attributes`: Below-threshold extractions |
| - `attribute_confidence`: Per-field confidence scores |
| - `message`: Original log text |
| - `confidence`: Overall prediction confidence (0-1) |
| - `model_version`: Model version string |
|
|
| ## Training |
|
|
| ### Dataset Format |
|
|
| Training data in JSONL format with character-offset annotations: |
|
|
| ```json |
| {"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]} |
| ``` |
|
|
| Used dataset -[Aliph0th/logtheus-ml-ds](https://huggingface.co/datasets/Aliph0th/logtheus-ml-ds) |
|
|
| **Fields:** |
| - `text`: Raw log line (string) |
| - `entities`: List of entity annotations |
| - `start`, `end`: Character-level offsets in text (0-indexed) |
| - `label`: Canonical field name |
|
|
| ### Training Procedure |
|
|
| ```bash |
| # 1. Prepare raw log files (deduplicate, split train/val) |
| python scripts/process_data.py data/annotated/ --p 0.8 |
| # 2. Train model |
| python training/train_token_classifier.py \ |
| --train-file data/train.jsonl \ |
| --val-file data/val.jsonl \ |
| --output-dir artifacts/model_v1 \ |
| --base-model bert-base-uncased \ |
| --epochs 5 \ |
| --batch-size 16 |
| ``` |
|
|
| **Hyperparameters:** |
| - Learning rate: 3e-5 |
| - Batch size: 16 (per device) |
| - Epochs: 5 (with early stopping by F1) |
| - Optimizer: AdamW |
| - Weight decay: 0.01 |
|
|
| ## Limitations |
|
|
| - **English logs only:** Trained on ASCII/UTF-8 log text in English |
| - **Format dependency:** Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing |
|
|
| ## Contact & Support |
|
|
| For issues, questions, or contributions, please visit: |
| - **Repository:** https://github.com/Aliph0th/logtheus-ml |
| - **Issues:** https://github.com/Aliph0th/logtheus-ml/issues |
|
|
| ## Acknowledgments |
|
|
| - Based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
| - Built with [Hugging Face Transformers](https://huggingface.co/transformers/) |