Instructions to use Aliph0th/logtheus-ml-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Aliph0th/logtheus-ml-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="Aliph0th/logtheus-ml-large")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("Aliph0th/logtheus-ml-large") model = AutoModelForTokenClassification.from_pretrained("Aliph0th/logtheus-ml-large") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| base_model: | |
| - google-bert/bert-large-uncased | |
| pipeline_tag: token-classification | |
| library_name: transformers | |
| tags: | |
| - token-classification | |
| - named-entity-recognition | |
| - ner | |
| - bert | |
| - logs | |
| datasets: | |
| - Aliph0th/logtheus-ml-ds | |
| # Log Entity Extractor (BERT-based Token Classifier) | |
| A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task). Created for my course work | |
| ## Model Description | |
| This model is based on `bert-large-uncased` and trained to perform **token classification** on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme. | |
| **Use case:** Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation. | |
| ## Model Details | |
| - **Base Model:** `bert-large-uncased` | |
| - **Task:** Token Classification (Named Entity Recognition) | |
| - **Training Data:** Annotated log lines (character-level entity offsets) | |
| - **Input:** Raw log text (string) | |
| - **Output:** Per-token BIO labels → grouped entities as canonical attributes | |
| ## Canonical Label Set | |
| The model extracts attributes from these canonical fields: | |
| | Field | Description | | |
| |-------|-------------| | |
| | service | Application or service name (e.g., "auth", "api") | | |
| | level | Log level (e.g., "info", "error", "warn") | | |
| | timestamp | Timestamp or date reference | | |
| | environment | Deployment environment (e.g., "prod", "staging") | | |
| | event | Event type or action (e.g., "login", "request") | | |
| | error_message | Human-readable error message | | |
| | status_code | HTTP or service status code | | |
| | duration | Duration | | |
| | ip | IP address (client or server) | | |
| | method | HTTP method (GET, POST, etc.) | | |
| | path | URL path or resource path | | |
| | useragent | User-Agent header | | |
| | hostname | Server hostname | | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers torch | |
| ``` | |
| ### Python (Hugging Face Transformers) | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForTokenClassification | |
| import torch | |
| model_name = "Aliph0th/logtheus-ml" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForTokenClassification.from_pretrained(model_name) | |
| text = "[auth] failed login for user 123 from 10.1.2.3 code=E401" | |
| # Tokenize and forward pass | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) | |
| outputs = model(**inputs) | |
| logits = outputs.logits | |
| # Get predicted label IDs | |
| predicted_ids = torch.argmax(logits, dim=-1) | |
| # Map back to label names | |
| id2label = model.config.id2label | |
| predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids] | |
| print(predictions) | |
| ``` | |
| This returns a structured JSON object with: | |
| - `attributes`: High-confidence extractions (dict of canonical_field → value) | |
| - `low_confidence_attributes`: Below-threshold extractions | |
| - `attribute_confidence`: Per-field confidence scores | |
| - `message`: Original log text | |
| - `confidence`: Overall prediction confidence (0-1) | |
| - `model_version`: Model version string | |
| ## Training | |
| ### Dataset Format | |
| Training data in JSONL format with character-offset annotations: | |
| ```json | |
| {"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]} | |
| ``` | |
| Used dataset -[Aliph0th/logtheus-ml-ds](https://huggingface.co/datasets/Aliph0th/logtheus-ml-ds) | |
| **Fields:** | |
| - `text`: Raw log line (string) | |
| - `entities`: List of entity annotations | |
| - `start`, `end`: Character-level offsets in text (0-indexed) | |
| - `label`: Canonical field name | |
| ### Training Procedure | |
| ```bash | |
| # 1. Prepare raw log files (deduplicate, split train/val) | |
| python scripts/process_data.py data/annotated/ --p 0.8 | |
| # 2. Train model | |
| python training/train_token_classifier.py \ | |
| --train-file data/train.jsonl \ | |
| --val-file data/val.jsonl \ | |
| --output-dir artifacts/model_v1 \ | |
| --base-model bert-base-uncased \ | |
| --epochs 5 \ | |
| --batch-size 16 | |
| ``` | |
| **Hyperparameters:** | |
| - Learning rate: 3e-5 | |
| - Batch size: 16 (per device) | |
| - Epochs: 5 (with early stopping by F1) | |
| - Optimizer: AdamW | |
| - Weight decay: 0.01 | |
| ## Limitations | |
| - **English logs only:** Trained on ASCII/UTF-8 log text in English | |
| - **Format dependency:** Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing | |
| ## Contact & Support | |
| For issues, questions, or contributions, please visit: | |
| - **Repository:** https://github.com/Aliph0th/logtheus-ml | |
| - **Issues:** https://github.com/Aliph0th/logtheus-ml/issues | |
| ## Acknowledgments | |
| - Based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) | |
| - Built with [Hugging Face Transformers](https://huggingface.co/transformers/) |