Instructions to use Aliph0th/logtheus-ml-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Aliph0th/logtheus-ml-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="Aliph0th/logtheus-ml-base")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("Aliph0th/logtheus-ml-base") model = AutoModelForTokenClassification.from_pretrained("Aliph0th/logtheus-ml-base") - Notebooks
- Google Colab
- Kaggle
File size: 4,857 Bytes
4b6cb8c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
language:
- en
base_model:
- google-bert/bert-base-uncased
pipeline_tag: token-classification
library_name: transformers
tags:
- token-classification
- named-entity-recognition
- ner
- bert
- logs
datasets:
- Aliph0th/logtheus-ml-ds
---
# Log Entity Extractor (BERT-based Token Classifier)
A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task). Created for my course work
## Model Description
This model is based on `bert-base-uncased` and trained to perform **token classification** on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme.
**Use case:** Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation.
## Model Details
- **Base Model:** `bert-large-uncased`
- **Task:** Token Classification (Named Entity Recognition)
- **Training Data:** Annotated log lines (character-level entity offsets)
- **Input:** Raw log text (string)
- **Output:** Per-token BIO labels → grouped entities as canonical attributes
## Canonical Label Set
The model extracts attributes from these canonical fields:
| Field | Description |
|-------|-------------|
| service | Application or service name (e.g., "auth", "api") |
| level | Log level (e.g., "info", "error", "warn") |
| timestamp | Timestamp or date reference |
| environment | Deployment environment (e.g., "prod", "staging") |
| event | Event type or action (e.g., "login", "request") |
| error_message | Human-readable error message |
| status_code | HTTP or service status code |
| duration | Duration |
| ip | IP address (client or server) |
| method | HTTP method (GET, POST, etc.) |
| path | URL path or resource path |
| useragent | User-Agent header |
| hostname | Server hostname |
## Usage
### Installation
```bash
pip install transformers torch
```
### Python (Hugging Face Transformers)
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "Aliph0th/logtheus-ml"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "[auth] failed login for user 123 from 10.1.2.3 code=E401"
# Tokenize and forward pass
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model(**inputs)
logits = outputs.logits
# Get predicted label IDs
predicted_ids = torch.argmax(logits, dim=-1)
# Map back to label names
id2label = model.config.id2label
predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids]
print(predictions)
```
This returns a structured JSON object with:
- `attributes`: High-confidence extractions (dict of canonical_field → value)
- `low_confidence_attributes`: Below-threshold extractions
- `attribute_confidence`: Per-field confidence scores
- `message`: Original log text
- `confidence`: Overall prediction confidence (0-1)
- `model_version`: Model version string
## Training
### Dataset Format
Training data in JSONL format with character-offset annotations:
```json
{"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]}
```
Used dataset -[Aliph0th/logtheus-ml-ds](https://huggingface.co/datasets/Aliph0th/logtheus-ml-ds)
**Fields:**
- `text`: Raw log line (string)
- `entities`: List of entity annotations
- `start`, `end`: Character-level offsets in text (0-indexed)
- `label`: Canonical field name
### Training Procedure
```bash
# 1. Prepare raw log files (deduplicate, split train/val)
python scripts/process_data.py data/annotated/ --p 0.8
# 2. Train model
python training/train_token_classifier.py \
--train-file data/train.jsonl \
--val-file data/val.jsonl \
--output-dir artifacts/model_v1 \
--base-model bert-base-uncased \
--epochs 5 \
--batch-size 16
```
**Hyperparameters:**
- Learning rate: 3e-5
- Batch size: 16 (per device)
- Epochs: 5 (with early stopping by F1)
- Optimizer: AdamW
- Weight decay: 0.01
## Limitations
- **English logs only:** Trained on ASCII/UTF-8 log text in English
- **Format dependency:** Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing
## Contact & Support
For issues, questions, or contributions, please visit:
- **Repository:** https://github.com/Aliph0th/logtheus-ml
- **Issues:** https://github.com/Aliph0th/logtheus-ml/issues
## Acknowledgments
- Based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
- Built with [Hugging Face Transformers](https://huggingface.co/transformers/) |