Create README.md

a36c4cb verified about 2 months ago

4.83 kB

	---
	language:
	- en
	- sl
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- menthos
	- modernbert
	- log-parsing
	- ner
	- cybersecurity
	---

	# MENTHOS-logparse

	## English

	### Model Description

	MENTHOS-LogParsing is a token-classification model fine-tuned from `answerdotai/ModernBERT-base` for structured field extraction from raw logs.
	It uses a maximum sequence length of 256.

	### Intended Use

	- BIO-style token labeling on log lines.
	- Useful for extracting fields like request URL, status, error level, timestamp, etc.

	### Label Space

	Training code defines BIO labels (plus special ignored/padding handling), including:

	- `B-request_url`, `I-request_url`
	- `B-status`
	- `B-error_level`, `I-error_level`
	- `B-error_message`, `I-error_message`
	- `B-time_received`, `I-time_received`
	- `B-remote_host`, `I-remote_host`
	- and additional request-header related labels

	The full token-label mapping is defined in the training code.

	### Training Data

	Trained on the MENTHOS log-parsing dataset.

	### Benchmark Results

	Benchmark results for the MENTHOS evaluation set.

	\| model \| samples \| accuracy \| precision \| recall \| f1 \| p50 latency (ms) \| throughput (samples/s) \|
	\| ---------------- \| ------: \| -------: \| --------: \| -------: \| -------: \| ---------------: \| ---------------------: \|
	\| MENTHOS-logparse \| 744 \| 0.988710 \| 0.949088 \| 0.936223 \| 0.941009 \| 24.2599 \| 43.75 \|

	Reference baseline (Morpheus ONNX):

	\| baseline model \| accuracy \| f1 \| p50 latency (ms) \| throughput (samples/s) \|
	\| ------------------------- \| -------: \| -------: \| ---------------: \| ---------------------: \|
	\| log-parsing-20220418.onnx \| 0.984583 \| 0.932764 \| 119.8934 \| 8.08 \|

	### Benchmark Plots

	![LogParsing F1: MENTHOS vs Morpheus](./log-parsing_f1_menthos_vs_morpheus.png)

	![LogParsing Throughput: MENTHOS vs Morpheus](./log-parsing_throughput_samples_per_sec_menthos_vs_morpheus.png)

	![LogParsing Latency Percentiles](./plots/latency_percentiles_log-parsing.png)

	### Limitations

	- Label matching is based on tokenized substring alignment from structured columns.
	- Domain shift in log formats can reduce extraction quality.

	### Citation

	```
	@misc{borovic_li-dobnik_kranjec_ferme_2026,
	title = {MENTHOS-logparse},
	author = {Borovic, Li Dobnik, Kranjec, Ferme},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/LHRS-UM-FERI/MENTHOS-logparse}}
	}
	```

	---

	## Slovenščina

	### Opis modela

	MENTHOS-LogParsing je model za token klasifikacijo, naučen iz `answerdotai/ModernBERT-base`, za ekstrakcijo strukturiranih polj iz surovih log zapisov.
	Uporablja maksimalno dolžino zaporedja 256.

	### Namen uporabe

	- BIO označevanje tokenov v log vrsticah.
	- Uporabno za polja kot so URL zahteve, status, error level, časovni žig ipd.

	### Prostor oznak

	Skripta učenja definira BIO oznake (ter posebne oznake za ignoriranje/padding), npr.:

	- `B-request_url`, `I-request_url`
	- `B-status`
	- `B-error_level`, `I-error_level`
	- `B-error_message`, `I-error_message`
	- `B-time_received`, `I-time_received`
	- `B-remote_host`, `I-remote_host`

	Celotno mapiranje je definirano v učni kodi.

	### Učni podatki

	Učenje je potekalo na MENTHOS log-parsing datasetu.

	### Rezultati benchmarka

	\| model \| vzorcev \| accuracy \| precision \| recall \| f1 \| p50 latenca (ms) \| prepustnost (vzorcev/s) \|
	\| ---------------- \| ------: \| -------: \| --------: \| -------: \| -------: \| ---------------: \| ----------------------: \|
	\| MENTHOS-logparse \| 744 \| 0.988710 \| 0.949088 \| 0.936223 \| 0.941009 \| 24.2599 \| 43.75 \|

	Referenčni baseline (Morpheus ONNX):

	\| baseline model \| accuracy \| f1 \| p50 latenca (ms) \| prepustnost (vzorcev/s) \|
	\| ------------------------- \| -------: \| -------: \| ---------------: \| ----------------------: \|
	\| log-parsing-20220418.onnx \| 0.984583 \| 0.932764 \| 119.8934 \| 8.08 \|

	### Grafi benchmarka

	![LogParsing F1: MENTHOS vs Morpheus](./log-parsing_f1_menthos_vs_morpheus.png)

	![LogParsing Throughput: MENTHOS vs Morpheus](./log-parsing_throughput_samples_per_sec_menthos_vs_morpheus.png)

	![LogParsing Latency Percentiles](./plots/latency_percentiles_log-parsing.png)

	### Omejitve

	- Ujemanje oznak temelji na poravnavi tokeniziranih podnizov.
	- Pri drugačnih log formatih se lahko kakovost ekstrakcije zmanjša.

	### Citiranje

	```
	@misc{borovic_li-dobnik_kranjec_ferme_2026,
	title = {MENTHOS-logparse},
	author = {Borovic, Li Dobnik, Kranjec, Ferme},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/LHRS-UM-FERI/MENTHOS-logparse}}
	}
	```