Aliph0th commited on
Commit
ea080e9
·
verified ·
1 Parent(s): 9cb6b94

add: README

Browse files
Files changed (1) hide show
  1. README.md +154 -0
README.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - google-bert/bert-large-uncased
6
+ pipeline_tag: token-classification
7
+ library_name: transformers
8
+ tags:
9
+ - Safetensors
10
+ - token-classification
11
+ - named-entity-recognition
12
+ - ner
13
+ - bert
14
+ - logs
15
+ datasets:
16
+ - Aliph0th/logtheus-ml-ds
17
+ ---
18
+
19
+ # Log Entity Extractor (BERT-based Token Classifier)
20
+
21
+ A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task).
22
+
23
+ ## Model Description
24
+
25
+ This model is based on `bert-large-uncased` and trained to perform **token classification** on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme.
26
+
27
+ **Use case:** Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation.
28
+
29
+ ## Model Details
30
+
31
+ - **Base Model:** `bert-large-uncased`
32
+ - **Task:** Token Classification (Named Entity Recognition)
33
+ - **Training Data:** Annotated log lines (character-level entity offsets)
34
+ - **Input:** Raw log text (string)
35
+ - **Output:** Per-token BIO labels → grouped entities as canonical attributes
36
+
37
+ ## Canonical Label Set
38
+
39
+ The model extracts attributes from these canonical fields:
40
+
41
+ | Field | Description |
42
+ |-------|-------------|
43
+ | service | Application or service name (e.g., "auth", "api") |
44
+ | level | Log level (e.g., "info", "error", "warn") |
45
+ | timestamp | Timestamp or date reference |
46
+ | environment | Deployment environment (e.g., "prod", "staging") |
47
+ | event | Event type or action (e.g., "login", "request") |
48
+ | error_message | Human-readable error message |
49
+ | status_code | HTTP or service status code |
50
+ | duration | Duration |
51
+ | ip | IP address (client or server) |
52
+ | method | HTTP method (GET, POST, etc.) |
53
+ | path | URL path or resource path |
54
+ | useragent | User-Agent header |
55
+ | hostname | Server hostname |
56
+
57
+ ## Usage
58
+
59
+ ### Installation
60
+
61
+ ```bash
62
+ pip install transformers torch
63
+ ```
64
+
65
+ ### Python (Hugging Face Transformers)
66
+
67
+ ```python
68
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
69
+ import torch
70
+
71
+ model_name = "Aliph0th/logtheus-ml"
72
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
73
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
74
+
75
+ text = "[auth] failed login for user 123 from 10.1.2.3 code=E401"
76
+
77
+ # Tokenize and forward pass
78
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
79
+ outputs = model(**inputs)
80
+ logits = outputs.logits
81
+
82
+ # Get predicted label IDs
83
+ predicted_ids = torch.argmax(logits, dim=-1)
84
+
85
+ # Map back to label names
86
+ id2label = model.config.id2label
87
+ predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids]
88
+ print(predictions)
89
+ ```
90
+
91
+ This returns a structured JSON object with:
92
+ - `attributes`: High-confidence extractions (dict of canonical_field → value)
93
+ - `low_confidence_attributes`: Below-threshold extractions
94
+ - `attribute_confidence`: Per-field confidence scores
95
+ - `message`: Original log text
96
+ - `confidence`: Overall prediction confidence (0-1)
97
+ - `model_version`: Model version string
98
+
99
+ ## Training
100
+
101
+ ### Dataset Format
102
+
103
+ Training data in JSONL format with character-offset annotations:
104
+
105
+ ```json
106
+ {"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]}
107
+ ```
108
+
109
+ Used dataset -[Aliph0th/logtheus-ml-ds](https://huggingface.co/datasets/Aliph0th/logtheus-ml-ds)
110
+
111
+ **Fields:**
112
+ - `text`: Raw log line (string)
113
+ - `entities`: List of entity annotations
114
+ - `start`, `end`: Character-level offsets in text (0-indexed)
115
+ - `label`: Canonical field name
116
+
117
+ ### Training Procedure
118
+
119
+ ```bash
120
+ # 1. Prepare raw log files (deduplicate, split train/val)
121
+ python scripts/process_data.py data/annotated/ --p 0.8
122
+
123
+ # 2. Train model
124
+ python training/train_token_classifier.py \
125
+ --train-file data/train.jsonl \
126
+ --val-file data/val.jsonl \
127
+ --output-dir artifacts/model_v1 \
128
+ --base-model bert-base-uncased \
129
+ --epochs 5 \
130
+ --batch-size 16
131
+ ```
132
+
133
+ **Hyperparameters:**
134
+ - Learning rate: 3e-5
135
+ - Batch size: 16 (per device)
136
+ - Epochs: 5 (with early stopping by F1)
137
+ - Optimizer: AdamW
138
+ - Weight decay: 0.01
139
+
140
+ ## Limitations
141
+
142
+ - **English logs only:** Trained on ASCII/UTF-8 log text in English
143
+ - **Format dependency:** Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing
144
+
145
+ ## Contact & Support
146
+
147
+ For issues, questions, or contributions, please visit:
148
+ - **Repository:** https://github.com/Aliph0th/logtheus-ml
149
+ - **Issues:** https://github.com/Aliph0th/logtheus-ml/issues
150
+
151
+ ## Acknowledgments
152
+
153
+ - Based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
154
+ - Built with [Hugging Face Transformers](https://huggingface.co/transformers/)