chore: update README

4f55e80 verified about 2 months ago

4.87 kB

	---
	language:
	- en
	base_model:
	- google-bert/bert-large-uncased
	pipeline_tag: token-classification
	library_name: transformers
	tags:
	- token-classification
	- named-entity-recognition
	- ner
	- bert
	- logs
	datasets:
	- Aliph0th/logtheus-ml-ds
	---

	# Log Entity Extractor (BERT-based Token Classifier)

	A fine-tuned BERT model for extracting canonical attributes from log lines using token classification (NER-style task). Created for my course work

	## Model Description

	This model is based on `bert-large-uncased` and trained to perform token classification on log messages. It extracts structured attributes (service, level, event, error_code, user_id, ip, etc.) from unstructured log text using a BIO tagging scheme.

	Use case: Convert raw log lines into canonical, structured key-value pairs for downstream analysis, alerting, or aggregation.

	## Model Details

	- Base Model: `bert-large-uncased`
	- Task: Token Classification (Named Entity Recognition)
	- Training Data: Annotated log lines (character-level entity offsets)
	- Input: Raw log text (string)
	- Output: Per-token BIO labels → grouped entities as canonical attributes

	## Canonical Label Set

	The model extracts attributes from these canonical fields:

	\| Field \| Description \|
	\|-------\|-------------\|
	\| service \| Application or service name (e.g., "auth", "api") \|
	\| level \| Log level (e.g., "info", "error", "warn") \|
	\| timestamp \| Timestamp or date reference \|
	\| environment \| Deployment environment (e.g., "prod", "staging") \|
	\| event \| Event type or action (e.g., "login", "request") \|
	\| error_message \| Human-readable error message \|
	\| status_code \| HTTP or service status code \|
	\| duration \| Duration \|
	\| ip \| IP address (client or server) \|
	\| method \| HTTP method (GET, POST, etc.) \|
	\| path \| URL path or resource path \|
	\| useragent \| User-Agent header \|
	\| hostname \| Server hostname \|

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Python (Hugging Face Transformers)

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	model_name = "Aliph0th/logtheus-ml"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	text = "[auth] failed login for user 123 from 10.1.2.3 code=E401"

	# Tokenize and forward pass
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
	outputs = model(**inputs)
	logits = outputs.logits

	# Get predicted label IDs
	predicted_ids = torch.argmax(logits, dim=-1)

	# Map back to label names
	id2label = model.config.id2label
	predictions = [[id2label[int(p)] for p in pred] for pred in predicted_ids]
	print(predictions)
	```

	This returns a structured JSON object with:
	- `attributes`: High-confidence extractions (dict of canonical_field → value)
	- `low_confidence_attributes`: Below-threshold extractions
	- `attribute_confidence`: Per-field confidence scores
	- `message`: Original log text
	- `confidence`: Overall prediction confidence (0-1)
	- `model_version`: Model version string

	## Training

	### Dataset Format

	Training data in JSONL format with character-offset annotations:

	```json
	{"id":"1","text":"[auth] failed login for user 123 from 10.1.2.3","entities":[{"start":1,"end":5,"label":"service"},{"start":28,"end":32,"label":"user_id"},{"start":38,"end":46,"label":"ip"}]}
	```

	Used dataset -[Aliph0th/logtheus-ml-ds](https://huggingface.co/datasets/Aliph0th/logtheus-ml-ds)

	Fields:
	- `text`: Raw log line (string)
	- `entities`: List of entity annotations
	- `start`, `end`: Character-level offsets in text (0-indexed)
	- `label`: Canonical field name

	### Training Procedure

	```bash
	# 1. Prepare raw log files (deduplicate, split train/val)
	python scripts/process_data.py data/annotated/ --p 0.8

	# 2. Train model
	python training/train_token_classifier.py \
	--train-file data/train.jsonl \
	--val-file data/val.jsonl \
	--output-dir artifacts/model_v1 \
	--base-model bert-base-uncased \
	--epochs 5 \
	--batch-size 16
	```

	Hyperparameters:
	- Learning rate: 3e-5
	- Batch size: 16 (per device)
	- Epochs: 5 (with early stopping by F1)
	- Optimizer: AdamW
	- Weight decay: 0.01

	## Limitations

	- English logs only: Trained on ASCII/UTF-8 log text in English
	- Format dependency: Works best on semi-structured logs (key=value, JSON, or naturally worded messages); highly custom formats may need domain-specific preprocessing

	## Contact & Support

	For issues, questions, or contributions, please visit:
	- Repository: https://github.com/Aliph0th/logtheus-ml
	- Issues: https://github.com/Aliph0th/logtheus-ml/issues

	## Acknowledgments

	- Based on [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
	- Built with [Hugging Face Transformers](https://huggingface.co/transformers/)