Aliph0th
/

logtheus-ml-large

+---
+language:
+- en
+base_model:
+- google-bert/bert-large-uncased
+pipeline_tag: token-classification
+library_name: transformers
+tags:
+- Safetensors
+- token-classification
+- named-entity-recognition
+- ner
+- bert
+- logs
+datasets:
+- Aliph0th/logtheus-ml-ds
+---
+# Log Entity Extractor (BERT-based Token Classifier)
+A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task).
+## Model Description
+This model is based on `bert-large-uncased` and trained to perform **token classification** on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme.
+**Use case:** Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation.
+## Model Details
+- **Base Model:** `bert-large-uncased`
+- **Task:** Token Classification (Named Entity Recognition)
+- **Training Data:** Annotated log lines (character-level entity offsets)
+- **Input:** Raw log text (string)
+- **Output:** Per-token BIO labels → grouped entities as canonical attributes
+## Canonical Label Set
+The model extracts attributes from these canonical fields:
+| Field | Description |
+|-------|-------------|
+| service | Application or service name (e.g., "auth", "api") |
+| level | Log level (e.g., "info", "error", "warn") |
+| timestamp | Timestamp or date reference |
+| environment | Deployment environment (e.g., "prod", "staging") |
+| event | Event type or action (e.g., "login", "request") |
+| error_message | Human-readable error message |
+| status_code | HTTP or service status code |
+| duration | Duration |
+| ip | IP address (client or server) |
+| method | HTTP method (GET, POST, etc.) |
+| path | URL path or resource path |
+| useragent | User-Agent header |
+| hostname | Server hostname |
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Python (Hugging Face Transformers)
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+model_name = "Aliph0th/logtheus-ml"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+text = "[auth] failed login for user 123 from 10.1.2.3 code=E401"
+# Tokenize and forward pass
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
+outputs = model(**inputs)
+logits = outputs.logits
+# Get predicted label IDs
+predicted_ids = torch.argmax(logits, dim=-1)
+# Map back to label names
+id2label = model.config.id2label
+predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids]
+print(predictions)
+```
+This returns a structured JSON object with:
+- `attributes`: High-confidence extractions (dict of canonical_field → value)
+- `low_confidence_attributes`: Below-threshold extractions
+- `attribute_confidence`: Per-field confidence scores
+- `message`: Original log text
+- `confidence`: Overall prediction confidence (0-1)
+- `model_version`: Model version string
+## Training
+### Dataset Format
+Training data in JSONL format with character-offset annotations:
+```json
+{"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]}
+```
+Used dataset -[Aliph0th/logtheus-ml-ds](https://huggingface.co/datasets/Aliph0th/logtheus-ml-ds)
+**Fields:**
+- `text`: Raw log line (string)
+- `entities`: List of entity annotations
+  - `start`, `end`: Character-level offsets in text (0-indexed)
+  - `label`: Canonical field name
+### Training Procedure
+```bash
+# 1. Prepare raw log files (deduplicate, split train/val)
+python scripts/process_data.py data/annotated/ --p 0.8
+# 2. Train model
+python training/train_token_classifier.py \
+  --train-file data/train.jsonl \
+  --val-file data/val.jsonl \
+  --output-dir artifacts/model_v1 \
+  --base-model bert-base-uncased \
+  --epochs 5 \
+  --batch-size 16
+```
+**Hyperparameters:**
+- Learning rate: 3e-5
+- Batch size: 16 (per device)
+- Epochs: 5 (with early stopping by F1)
+- Optimizer: AdamW
+- Weight decay: 0.01
+## Limitations
+- **English logs only:** Trained on ASCII/UTF-8 log text in English
+- **Format dependency:** Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing
+## Contact & Support
+For issues, questions, or contributions, please visit:
+- **Repository:** https://github.com/Aliph0th/logtheus-ml
+- **Issues:** https://github.com/Aliph0th/logtheus-ml/issues
+## Acknowledgments
+- Based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
+- Built with [Hugging Face Transformers](https://huggingface.co/transformers/)