jboksa commited on
Commit
e5a5213
·
verified ·
1 Parent(s): 7e8ca69

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -20
README.md CHANGED
@@ -1,20 +1,20 @@
1
- ---
2
- language:
3
- - pl
4
- - en
5
- license: apache-2.0
6
- base_model: answerdotai/ModernBERT-base
7
- tags:
8
- - chunking
9
- - semantic-segmentation
10
- - token-classification
11
- - modernbert
12
- - nlp
13
- - rag
14
- pipeline_tag: token-classification
15
- datasets:
16
- - wikimedia/wikipedia
17
- ---
18
 
19
  # ModernBERT Chunker Base 🚀
20
 
@@ -31,10 +31,10 @@ This model is a fine-tuned version of **ModernBERT-base**, specialized in **sema
31
  The easiest way to use this model is through the official library:
32
 
33
  ```python
34
- from modbert_chunker import Chunker
35
 
36
  # Load the model (runs optimally on CUDA or CPU)
37
- chunker = Chunker.from_pretrained("jboksa/modbert-chunker-base")
38
 
39
  text = "Your long multi-topic document..."
40
  chunks = chunker.chunk(text)
@@ -47,7 +47,7 @@ for chunk in chunks:
47
 
48
  ### Dataset
49
  The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
50
- - **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata).
51
  - **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
52
  - **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
53
 
 
1
+ ---
2
+ language:
3
+ - pl
4
+ - en
5
+ license: apache-2.0
6
+ base_model: answerdotai/ModernBERT-base
7
+ tags:
8
+ - chunking
9
+ - semantic-segmentation
10
+ - token-classification
11
+ - modernbert
12
+ - nlp
13
+ - rag
14
+ pipeline_tag: token-classification
15
+ datasets:
16
+ - wikimedia/wikipedia
17
+ ---
18
 
19
  # ModernBERT Chunker Base 🚀
20
 
 
31
  The easiest way to use this model is through the official library:
32
 
33
  ```python
34
+ from fine_chunker import Chunker
35
 
36
  # Load the model (runs optimally on CUDA or CPU)
37
+ chunker = Chunker.from_pretrained(device="cpu", use_onnx=True)
38
 
39
  text = "Your long multi-topic document..."
40
  chunks = chunker.chunk(text)
 
47
 
48
  ### Dataset
49
  The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
50
+ - **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by lowercase letter, and 40% of last chunk dot, were removed.
51
  - **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
52
  - **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
53