jboksa
/

modbert-chunker-base

Token Classification

semantic-segmentation

Model card Files Files and versions

jboksa commited on Mar 21

Commit

e5a5213

·

verified ·

1 Parent(s): 7e8ca69

Update README.md

Files changed (1) hide show

README.md +20 -20

README.md CHANGED Viewed

@@ -1,20 +1,20 @@
----
-language:
-- pl
-- en
-license: apache-2.0
-base_model: answerdotai/ModernBERT-base
-tags:
-- chunking
-- semantic-segmentation
-- token-classification
-- modernbert
-- nlp
-- rag
-pipeline_tag: token-classification
-datasets:
-- wikimedia/wikipedia
----
 # ModernBERT Chunker Base 🚀
@@ -31,10 +31,10 @@ This model is a fine-tuned version of **ModernBERT-base**, specialized in **sema
 The easiest way to use this model is through the official library:
 ```python
-from modbert_chunker import Chunker
 # Load the model (runs optimally on CUDA or CPU)
-chunker = Chunker.from_pretrained("jboksa/modbert-chunker-base")
 text = "Your long multi-topic document..."
 chunks = chunker.chunk(text)
@@ -47,7 +47,7 @@ for chunk in chunks:
 ### Dataset
 The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
-- **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata).
 - **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
 - **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.

+---
+language:
+- pl
+- en
+license: apache-2.0
+base_model: answerdotai/ModernBERT-base
+tags:
+- chunking
+- semantic-segmentation
+- token-classification
+- modernbert
+- nlp
+- rag
+pipeline_tag: token-classification
+datasets:
+- wikimedia/wikipedia
+---
 # ModernBERT Chunker Base 🚀
 The easiest way to use this model is through the official library:
 ```python
+from fine_chunker import Chunker
 # Load the model (runs optimally on CUDA or CPU)
+chunker = Chunker.from_pretrained(device="cpu", use_onnx=True)
 text = "Your long multi-topic document..."
 chunks = chunker.chunk(text)
 ### Dataset
 The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
+- **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by lowercase letter, and 40% of last chunk dot, were removed.
 - **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
 - **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.