tim1900
/

bert-chunker-3

Token Classification

Model card Files Files and versions

tim1900 commited on May 16, 2025

Commit

e223f78

·

verified ·

1 Parent(s): f58ae9d

Update README.md

Files changed (1) hide show

README.md +20 -2

README.md CHANGED Viewed

@@ -205,6 +205,25 @@ for i, (c, t) in enumerate(zip(chunks, token_pos)):
 ## Experimental
 The following script supports specifying max tokens per chunk. Chunker will be forced to choose a best possible position from history to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold. This script can be seen as a new experimental version of the scripts above.
 ```python
 def chunk_text_with_max_chunk_size(model, text, tokenizer, prob_threshold=0.5,max_tokens_per_chunk = 400):
     with torch.no_grad():
@@ -379,8 +398,7 @@ Published on: 6 August 2024"
 """
 # Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
 # Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
-# when it is set to 1, the whole text will be one chunk.
-# Slide window chunking with a prob_threshold, and, will be forced to choose a best possible position to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold.
 chunks, token_pos = chunk_text_with_max_chunk_size(model, ad, tokenizer, prob_threshold=0.5, max_tokens_per_chunk = 400)
 # print chunks

 ## Experimental
 The following script supports specifying max tokens per chunk. Chunker will be forced to choose a best possible position from history to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold. This script can be seen as a new experimental version of the scripts above.
 ```python
+import torch
+from transformers import AutoTokenizer, BertForTokenClassification
+import math
+model_path = "tim1900/bert-chunker-3"
+tokenizer = AutoTokenizer.from_pretrained(
+    model_path,
+    padding_side="right",
+    model_max_length=255,
+    trust_remote_code=True,
+)
+device = "cpu"  # or 'cuda'
+model = BertForTokenClassification.from_pretrained(
+    model_path,
+).to(device)
 def chunk_text_with_max_chunk_size(model, text, tokenizer, prob_threshold=0.5,max_tokens_per_chunk = 400):
     with torch.no_grad():
 """
 # Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
 # Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
+# when it is set to 1, the whole text will be one chunk, and, will be forced to choose a best possible position to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold.
 chunks, token_pos = chunk_text_with_max_chunk_size(model, ad, tokenizer, prob_threshold=0.5, max_tokens_per_chunk = 400)
 # print chunks