File size: 4,857 Bytes
4b6cb8c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
language:
- en
base_model:
- google-bert/bert-base-uncased
pipeline_tag: token-classification
library_name: transformers
tags:
- token-classification
- named-entity-recognition
- ner
- bert
- logs
datasets:
- Aliph0th/logtheus-ml-ds
---
# Log Entity Extractor (BERT-based Token Classifier)

A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task). Created for my course work

## Model Description

This model is based on `bert-base-uncased` and trained to perform **token classification** on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme.

**Use case:** Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation.

## Model Details

- **Base Model:** `bert-large-uncased`
- **Task:** Token Classification (Named Entity Recognition)
- **Training Data:** Annotated log lines (character-level entity offsets)
- **Input:** Raw log text (string)
- **Output:** Per-token BIO labels → grouped entities as canonical attributes

## Canonical Label Set

The model extracts attributes from these canonical fields:

| Field | Description |
|-------|-------------|
| service | Application or service name (e.g., "auth", "api") |
| level | Log level (e.g., "info", "error", "warn") |
| timestamp | Timestamp or date reference |
| environment | Deployment environment (e.g., "prod", "staging") |
| event | Event type or action (e.g., "login", "request") |
| error_message | Human-readable error message |
| status_code | HTTP or service status code |
| duration | Duration |
| ip | IP address (client or server) |
| method | HTTP method (GET, POST, etc.) |
| path | URL path or resource path |
| useragent | User-Agent header |
| hostname | Server hostname |

## Usage

### Installation

```bash
pip install transformers torch
```

### Python (Hugging Face Transformers)

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "Aliph0th/logtheus-ml"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "[auth] failed login for user 123 from 10.1.2.3 code=E401"
# Tokenize and forward pass
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model(**inputs)
logits = outputs.logits
# Get predicted label IDs
predicted_ids = torch.argmax(logits, dim=-1)
# Map back to label names
id2label = model.config.id2label
predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids]
print(predictions)
```

This returns a structured JSON object with:
- `attributes`: High-confidence extractions (dict of canonical_field → value)
- `low_confidence_attributes`: Below-threshold extractions
- `attribute_confidence`: Per-field confidence scores
- `message`: Original log text
- `confidence`: Overall prediction confidence (0-1)
- `model_version`: Model version string

## Training

### Dataset Format

Training data in JSONL format with character-offset annotations:

```json
{"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]}
```

Used dataset -[Aliph0th/logtheus-ml-ds](https://huggingface.co/datasets/Aliph0th/logtheus-ml-ds)

**Fields:**
- `text`: Raw log line (string)
- `entities`: List of entity annotations
  - `start`, `end`: Character-level offsets in text (0-indexed)
  - `label`: Canonical field name

### Training Procedure

```bash
# 1. Prepare raw log files (deduplicate, split train/val)
python scripts/process_data.py data/annotated/ --p 0.8
# 2. Train model
python training/train_token_classifier.py \
  --train-file data/train.jsonl \
  --val-file data/val.jsonl \
  --output-dir artifacts/model_v1 \
  --base-model bert-base-uncased \
  --epochs 5 \
  --batch-size 16
```

**Hyperparameters:**
- Learning rate: 3e-5
- Batch size: 16 (per device)
- Epochs: 5 (with early stopping by F1)
- Optimizer: AdamW
- Weight decay: 0.01

## Limitations

- **English logs only:** Trained on ASCII/UTF-8 log text in English
- **Format dependency:** Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing

## Contact & Support

For issues, questions, or contributions, please visit:
- **Repository:** https://github.com/Aliph0th/logtheus-ml
- **Issues:** https://github.com/Aliph0th/logtheus-ml/issues

## Acknowledgments

- Based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
- Built with [Hugging Face Transformers](https://huggingface.co/transformers/)