File size: 27,890 Bytes

---
license: mit
language:
- en
- zh
pipeline_tag: token-classification
---
# bert-chunker-Chinese-2

[GitHub](https://github.com/jackfsuia/bert-chunker/tree/main/bcc2)


bert-chunker-Chinese-2 （中文分段器） is a text chunker based on BertForTokenClassification to predict the start token of chunks (for use in RAG, etc), and using a sliding window it cuts documents of any size into chunks. We see it as an alternative of [semantic chunker](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb), but specially, it not only works for the structured texts, but also the **unstructured and messy texts**. It is a new version of [bc-chinese](https://huggingface.co/tim1900/bert-chunker-chinese), for which we change our data labeling and train pipeline to make it more stable and useful.

Updates :
- 2025.5.12: an experimental script that **supports specifying the maximum tokens per chunk** is available now [below](#experimental).
## Usage
Run the following:

```python
# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, BertForTokenClassification
import math

model_path = "tim1900/bert-chunker-Chinese-2"

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    padding_side="right",
    model_max_length=512,
    trust_remote_code=True,
)

device = "cpu"  # or 'cuda'

model = BertForTokenClassification.from_pretrained(
    model_path,
).to(device)

def chunk_text(model, text, tokenizer, prob_threshold=0.5):
    # slide context window chunking
    MAX_TOKENS = 512
    tokens = tokenizer(text, return_tensors="pt", truncation=False)
    input_ids = tokens["input_ids"]
    attention_mask = tokens["attention_mask"][:, 0:MAX_TOKENS]
    attention_mask = attention_mask.to(model.device)
    CLS = input_ids[:, 0].unsqueeze(0)
    SEP = input_ids[:, -1].unsqueeze(0)
    input_ids = input_ids[:, 1:-1]
    model.eval()
    split_str_poses = []
    token_pos = []
    windows_start = 0
    windows_end = 0
    logits_threshold = math.log(1 / prob_threshold - 1)
    print(f"Processing {input_ids.shape[1]} tokens...")
    while windows_end <= input_ids.shape[1]:
        windows_end = windows_start + MAX_TOKENS - 2

        ids = torch.cat((CLS, input_ids[:, windows_start:windows_end], SEP), 1)

        ids = ids.to(model.device)

        output = model(
            input_ids=ids,
            attention_mask=torch.ones(1, ids.shape[1], device=model.device),
        )
        logits = output["logits"][:, 1:-1, :]
        chunk_decision = logits[:, :, 1] > (logits[:, :, 0] - logits_threshold)
        greater_rows_indices = torch.where(chunk_decision)[1].tolist()

        # null or not
        if len(greater_rows_indices) > 0 and (
            not (greater_rows_indices[0] == 0 and len(greater_rows_indices) == 1)
        ):

            split_str_pos = [
                tokens.token_to_chars(sp + windows_start + 1).start
                for sp in greater_rows_indices
                if sp > 0
            ]
            token_pos += [
                sp + windows_start + 1 for sp in greater_rows_indices if sp > 0
            ]
            split_str_poses += split_str_pos

            windows_start = greater_rows_indices[-1] + windows_start

        else:

            windows_start = windows_end

    substrings = [
        text[i:j] for i, j in zip([0] + split_str_poses, split_str_poses + [len(text)])
    ]
    token_pos = [0] + token_pos
    return substrings, token_pos


# chunking
print("\n>>>>>>>>> Chunking...")
doc = r'''经典中式家常菜：红烧肉 详尽版食谱

红烧肉，作为中华美食殿堂中一颗璀璨的明珠，以其色泽红亮、肥而不腻、入口即化的独特风味，征服了无数食客的味蕾。它不仅仅是一道菜，更是一种情怀，是记忆中妈妈厨房里飘出的诱人香气，是团圆饭桌上最温暖的慰藉。要烹制出一盘完美的红烧肉，需要耐心、技巧以及对细节的专注。以下将为您详尽分解从选材到成品的每一个步骤，力求让您在家也能复刻出餐厅级别的美味。本食谱旨在深入剖析，故篇幅较长，以确保每个环节都清晰透彻。

一、 精选主料：奠定美味的基石

•   猪肉选择。

    ◦   首选带皮的五花肉。

    ◦   层次分明为佳。

    ◦   肥瘦相间是关键。

    ◦   厚度约三指左右。

    ◦   重量约一斤半为宜。

    ◦   确保猪皮完整无缺。

    ◦   新鲜肉品呈鲜红色。

    ◦   脂肪部分应洁白细腻。

•   预处理工作。

    ◦   用镊子拔除残留猪毛。

    ◦   将肉块置于冷水中浸泡。

    ◦   浸泡时间约三十分钟。

    ◦   中途可换水一至两次。

    ◦   目的是泡出部分血水。

    ◦   然后用刀刮洗猪皮表面。

    ◦   彻底清除污物和杂质。

    ◦   最后用流动水冲洗干净。

二、 准备辅料：调配灵魂之味

•   主要调味品。

    ◦   老抽：负责上色。

    ◦   用量需谨慎控制。

    ◦   过多会导致发黑发苦。

    ◦   生抽：提供咸鲜底味。

    ◦   与老抽比例约二比一。

    ◦   优质花雕酒是精髓。

    ◦   去腥增香效果显著。

    ◦   冰糖：首选黄冰糖。

    ◦   炒出的糖色更红亮。

    ◦   风味也比白糖醇和。

•   香辛料组合。

    ◦   生姜一大块，切片。

    ◦   大葱一根，切成长段。

    ◦   蒜头数瓣，轻轻拍松。

    ◦   八角两至三颗，增香。

    ◦   桂皮一小段，勿过多。

    ◦   香叶两片，增添风味。

    ◦   草果一颗，可拍裂开。

    ◦   干辣椒依个人口味添加。

•   其他基础材料。

    ◦   食用油适量，需耐高温。

    ◦   最好使用菜籽油或花生油。

    ◦   食盐少许，用于最后调味。

    ◦   因为酱油已有咸度。

    ◦   准备足量的开水备用。

    ◦   切记不可使用冷水。

    ◦   冷水会使肉质收缩变柴。

三、 精细加工：关键步骤解析

•   肉块改刀。

    ◦   将洗净的五花肉捞出。

    ◦   用厨房纸彻底吸干水分。

    ◦   这一步非常重要。

    ◦   能有效防止后续溅油。

    ◦   将肉切成三厘米见方块。

    ◦   大小尽量保持均匀一致。

    ◦   以确保受热和入味均匀。

    ◦   切面应能看到完美层次。

•   焯水去腥。

    ◦   冷水下锅，放入切好的肉。

    ◦   同时加入几片生姜。

    ◦   倒入一汤匙花雕酒。

    ◦   开大火煮沸，撇净浮沫。

    ◦   浮沫是血水和杂质所致。

    ◦   务必撇除干净直至汤清。

    ◦   焯水时间约五到八分钟。

    ◦   煮至肉块变色定型即可。

•   捞出与冲洗。

    ◦   用漏勺将肉块捞出。

    ◦   立即放入温水中冲洗。

    ◦   洗去表面残留的浮沫。

    ◦   注意水温不宜过低。

    ◦   再次用厨房纸吸干水分。

    ◦   防止入锅时油花四溅。

    ◦   此时肉块呈灰白色。

    ◦   经过焯水已无肉腥味。

四、 核心工艺：炒糖色与煸炒

•   炒制糖色。

    ◦   锅烧热，倒入少量底油。

    ◦   放入准备好的冰糖。

    ◦   开中小火慢慢搅动。

    ◦   观察冰糖融化的过程。

    ◦   先从固体变为液态。

    ◦   再从小泡转为密集大泡。

    ◦   当大泡逐渐回落消失时。

    ◦   糖液颜色开始加深。

•   观察颜色变化。

    ◦   从浅黄色变为枣红色。

    ◦   这个瞬间非常关键。

    ◦   枣红色时立即下入肉块。

    ◦   过早则甜腻，过晚则发苦。

    ◦   动作务必迅速而准确。

    ◦   糖色是红亮色泽的来源。

    ◦   也是风味层次的基础。

•   煸炒肉块。

    ◦   快速颠锅，使每块肉均匀裹上糖色。

    ◦   持续翻炒约三到五分钟。

    ◦   直到肉块表面微微焦黄。

    ◦   部分油脂被煸炒出来。

    ◦   这样吃起来肥而不腻。

    ◦   同时香味物质充分释放。

    ◦   煸出的猪油可倒出部分。

    ◦   留作炒青菜风味极佳。

五、 炖煮入味：时间与火候的艺术

•   加入调料。

    ◦   沿着锅边烹入花雕酒。

    ◦   瞬间激发出浓郁酒香。

    ◦   接着倒入适量生抽。

    ◦   再加入少许老抽上色。

    ◦   放入所有香料：姜、葱、蒜等。

    ◦   与肉块一起翻炒均匀。

    ◦   让酱香与肉香充分融合。

    ◦   翻炒约两分钟至香气扑鼻。

•   注入开水。

    ◦   务必一次性加足开水。

    ◦   水量要完全没过肉块。

    ◦   甚至可以略多一些。

    ◦   避免中途再次加水。

    ◦   大火烧开后转小火。

    ◦   盖上锅盖，慢火焖炖。

    ◦   这是“入口即化”的关键。

    ◦   时间至少需要一小时。

•   慢炖过程。

    ◦   保持汤面微沸即可。

    ◦   火候切忌过大过急。

    ◦   否则容易烧干且肉不烂。

    ◦   期间可偶尔开盖查看。

    ◦   用勺子轻轻推动一下。

    ◦   防止粘锅底的情况发生。

    ◦   但尽量不要频繁翻动。

    ◦   以免影响肉块的完整。

六、 收汁与装盘：成就最终美味

•   大火收汁。

    ◦   炖煮一小时后。

    ◦   用筷子戳一下瘦肉部分。

    ◦   若能轻松戳透即表示已软烂。

    ◦   此时根据汤汁咸度加盐。

    ◦   开大火，将汤汁收浓。

    ◦   用锅铲不停搅动。

    ◦   防止糊底，并让汤汁变稠。

    ◦   均匀包裹在每一块肉上。

•   收汁技巧。

    ◦   收到汤汁浓稠如蜜。

    ◦   油亮红润的汤汁紧裹肉块。

    ◦   锅中泛起密集的大泡。

    ◦   即可准备关火出锅。

    ◦   收汁程度依个人喜好。

    ◦   喜欢拌饭可多留些汤汁。

    ◦   整个过程需密切留意。

    ◦   最后阶段变化非常迅速。

•   最终成品。

    ◦   将红烧肉盛入预热好的盘中。

    ◦   可烫几棵小油菜围边。

    ◦   既点缀色彩，又解油腻。

    ◦   撒上少许葱花或香菜末。

    ◦   一道色香味俱全的红烧肉完成。

    ◦   肉质软糯，肥而不腻。

    ◦   咸中带甜，回味无穷。

    ◦   配上一碗白米饭是绝配。

七、 要点总结与升华

•   成功关键。

    ◦   选材是基础，务必新鲜。

    ◦   焯水步骤不可省略。

    ◦   炒糖色是技术核心。

    ◦   火候控制是成败关键。

    ◦   耐心慢炖是美味保证。

•   变化与创新。

    ◦   可加入土豆、鹌鹑蛋同烧。

    ◦   吸收肉汤，滋味更丰富。

    ◦   也可尝试用啤酒代替水。

    ◦   别有一番风味层次。

    ◦   但万变不离其宗。

    ◦   核心技法仍需掌握。

烹饪是一门需要实践的艺术，红烧肉更是如此。希望这份详尽的食谱能成为您厨房路上的得力助手，愿您能享受从准备到品尝的整个过程，与家人朋友分享这份由时间与匠心凝聚而成的温暖美味。每一次尝试都是一次经验的积累，祝您烹饪愉快，早日成就属于自己的招牌红烧肉！
'''
# Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
# Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
# when it is set to 1, the whole text is one chunk.
chunks, token_pos = chunk_text(model, doc, tokenizer, prob_threshold=0.5)

# print chunks
for i, (c, t) in enumerate(zip(chunks, token_pos)):
    print(f"-----chunk: {i}----token_idx: {t}--------")
    print(c)
```
## Experimental
The following script supports specifying max tokens per chunk. Chunker will be forced to choose a best possible position from history to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold. This script can be seen as a new experimental version of the scripts above.
```python
# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, BertForTokenClassification
import math

model_path = "tim1900/bert-chunker-Chinese-2"

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    padding_side="right",
    model_max_length=512,
    trust_remote_code=True,
)

device = "cpu"  # or 'cuda'

model = BertForTokenClassification.from_pretrained(
    model_path,
).to(device)

def chunk_text_with_max_chunk_size(model, text, tokenizer, prob_threshold=0.5,max_tokens_per_chunk = 400):
    with torch.no_grad():
        
        # slide context window chunking
        MAX_TOKENS = 512
        tokens = tokenizer(text, return_tensors="pt", truncation=False)
        input_ids = tokens["input_ids"]
        attention_mask = tokens["attention_mask"][:, 0:MAX_TOKENS]
        attention_mask = attention_mask.to(model.device)
        CLS = input_ids[:, 0].unsqueeze(0)
        SEP = input_ids[:, -1].unsqueeze(0)
        input_ids = input_ids[:, 1:-1]
        model.eval()
        split_str_poses = []
        token_pos = []
        windows_start = 0
        windows_end = 0
        logits_threshold = math.log(1 / prob_threshold - 1)
        
        unchunk_tokens = 0
        backup_pos = None
        best_logits = torch.finfo(torch.float32).min 
        STEP = round(((MAX_TOKENS - 2)//2)*1.75 )
        print(f"Processing {input_ids.shape[1]} tokens...")
        # while windows_end <= input_ids.shape[1]:#记得改成windstart
        while windows_start < input_ids.shape[1]:#记得改成windstart    
            windows_end = windows_start + MAX_TOKENS - 2 
            ids = torch.cat((CLS, input_ids[:, windows_start:windows_end], SEP), 1)
            ids = ids.to(model.device)
            output = model(
                input_ids=ids,
                attention_mask=torch.ones(1, ids.shape[1], device=model.device),
            )
            logits = output["logits"][:, 1:-1, :]
            
            
            logit_diff = logits[:, :, 1] - logits[:, :, 0]
                    
                    
            chunk_decision = logit_diff > - logits_threshold
            greater_rows_indices = torch.where(chunk_decision)[1].tolist()

            # null or not
            if len(greater_rows_indices) > 0 and (
                not (greater_rows_indices[0] == 0 and len(greater_rows_indices) == 1)
            ):

                
                unchunk_tokens_this_window = greater_rows_indices[0] if greater_rows_indices[0]!=0 else greater_rows_indices[1]#exclude the fist index

                # manually chunk
                if unchunk_tokens + unchunk_tokens_this_window > max_tokens_per_chunk:
                    big_windows_end = max_tokens_per_chunk - unchunk_tokens
                    max_value, max_index= logit_diff[:,1:big_windows_end].max(),  logit_diff[:,1:big_windows_end].argmax() + 1
                    if best_logits < max_value:
                        backup_pos = windows_start + max_index
                    
                    windows_start = backup_pos
                    
                    
                    split_str_pos = [tokens.token_to_chars(backup_pos + 1).start]
                    split_str_poses = split_str_poses + split_str_pos
                    token_pos = token_pos + [backup_pos]
                    best_logits = torch.finfo(torch.float32).min
                    backup_pos = -1
                    unchunk_tokens = 0
                    
                # auto chunk    
                else:
                    
                    if len(greater_rows_indices) >= 2:
                        for gi, (gri0,gri1) in enumerate(zip(greater_rows_indices[:-1],greater_rows_indices[1:])):
                            
                            if gri1 - gri0 > max_tokens_per_chunk:
                                greater_rows_indices=greater_rows_indices[:gi+1]
                                break
                                
                    split_str_pos = [tokens.token_to_chars(sp + windows_start + 1).start for sp in greater_rows_indices if sp > 0]
                    split_str_poses = split_str_poses + split_str_pos
                    token_pos = token_pos+ [sp + windows_start for sp in greater_rows_indices if sp > 0]
                    
                    windows_start = greater_rows_indices[-1] + windows_start
                    best_logits = torch.finfo(torch.float32).min
                    backup_pos = -1
                    unchunk_tokens = 0

            else:

                # unchunk_tokens_this_window = min(windows_end - windows_start,STEP)
                unchunk_tokens_this_window = min(windows_start+STEP,input_ids.shape[1]) - windows_start

                # manually chunk
                if unchunk_tokens + unchunk_tokens_this_window > max_tokens_per_chunk:
                    big_windows_end =  max_tokens_per_chunk - unchunk_tokens
                    if logit_diff.shape[1] > 1:
                        
                        max_value, max_index= logit_diff[:,1:big_windows_end].max(),  logit_diff[:,1:big_windows_end].argmax() + 1
                        if best_logits < max_value:
                            backup_pos = windows_start + max_index
                        
                        
                    windows_start = backup_pos
                    split_str_pos = [tokens.token_to_chars(backup_pos + 1).start]
                    split_str_poses = split_str_poses + split_str_pos
                    token_pos = token_pos + [backup_pos]
                    best_logits = torch.finfo(torch.float32).min
                    backup_pos = -1
                    unchunk_tokens = 0
                else:
                # auto leave
                    if logit_diff.shape[1] > 1:
                        max_value, max_index= logit_diff[:,1:].max(),  logit_diff[:,1:].argmax() + 1
                        if best_logits < max_value:
                            best_logits = max_value
                            backup_pos = windows_start + max_index
                       
                    unchunk_tokens = unchunk_tokens + STEP
                    windows_start = windows_start + STEP

        substrings = [
            text[i:j] for i, j in zip([0] + split_str_poses, split_str_poses + [len(text)])
        ]
        token_pos = [0] + token_pos
    return substrings, token_pos
# chunking
print("\n>>>>>>>>> Chunking...")
doc = r'''经典中式家常菜：红烧肉 详尽版食谱

红烧肉，作为中华美食殿堂中一颗璀璨的明珠，以其色泽红亮、肥而不腻、入口即化的独特风味，征服了无数食客的味蕾。它不仅仅是一道菜，更是一种情怀，是记忆中妈妈厨房里飘出的诱人香气，是团圆饭桌上最温暖的慰藉。要烹制出一盘完美的红烧肉，需要耐心、技巧以及对细节的专注。以下将为您详尽分解从选材到成品的每一个步骤，力求让您在家也能复刻出餐厅级别的美味。本食谱旨在深入剖析，故篇幅较长，以确保每个环节都清晰透彻。

一、 精选主料：奠定美味的基石

•   猪肉选择。

    ◦   首选带皮的五花肉。

    ◦   层次分明为佳。

    ◦   肥瘦相间是关键。

    ◦   厚度约三指左右。

    ◦   重量约一斤半为宜。

    ◦   确保猪皮完整无缺。

    ◦   新鲜肉品呈鲜红色。

    ◦   脂肪部分应洁白细腻。

•   预处理工作。

    ◦   用镊子拔除残留猪毛。

    ◦   将肉块置于冷水中浸泡。

    ◦   浸泡时间约三十分钟。

    ◦   中途可换水一至两次。

    ◦   目的是泡出部分血水。

    ◦   然后用刀刮洗猪皮表面。

    ◦   彻底清除污物和杂质。

    ◦   最后用流动水冲洗干净。

二、 准备辅料：调配灵魂之味

•   主要调味品。

    ◦   老抽：负责上色。

    ◦   用量需谨慎控制。

    ◦   过多会导致发黑发苦。

    ◦   生抽：提供咸鲜底味。

    ◦   与老抽比例约二比一。

    ◦   优质花雕酒是精髓。

    ◦   去腥增香效果显著。

    ◦   冰糖：首选黄冰糖。

    ◦   炒出的糖色更红亮。

    ◦   风味也比白糖醇和。

•   香辛料组合。

    ◦   生姜一大块，切片。

    ◦   大葱一根，切成长段。

    ◦   蒜头数瓣，轻轻拍松。

    ◦   八角两至三颗，增香。

    ◦   桂皮一小段，勿过多。

    ◦   香叶两片，增添风味。

    ◦   草果一颗，可拍裂开。

    ◦   干辣椒依个人口味添加。

•   其他基础材料。

    ◦   食用油适量，需耐高温。

    ◦   最好使用菜籽油或花生油。

    ◦   食盐少许，用于最后调味。

    ◦   因为酱油已有咸度。

    ◦   准备足量的开水备用。

    ◦   切记不可使用冷水。

    ◦   冷水会使肉质收缩变柴。

三、 精细加工：关键步骤解析

•   肉块改刀。

    ◦   将洗净的五花肉捞出。

    ◦   用厨房纸彻底吸干水分。

    ◦   这一步非常重要。

    ◦   能有效防止后续溅油。

    ◦   将肉切成三厘米见方块。

    ◦   大小尽量保持均匀一致。

    ◦   以确保受热和入味均匀。

    ◦   切面应能看到完美层次。

•   焯水去腥。

    ◦   冷水下锅，放入切好的肉。

    ◦   同时加入几片生姜。

    ◦   倒入一汤匙花雕酒。

    ◦   开大火煮沸，撇净浮沫。

    ◦   浮沫是血水和杂质所致。

    ◦   务必撇除干净直至汤清。

    ◦   焯水时间约五到八分钟。

    ◦   煮至肉块变色定型即可。

•   捞出与冲洗。

    ◦   用漏勺将肉块捞出。

    ◦   立即放入温水中冲洗。

    ◦   洗去表面残留的浮沫。

    ◦   注意水温不宜过低。

    ◦   再次用厨房纸吸干水分。

    ◦   防止入锅时油花四溅。

    ◦   此时肉块呈灰白色。

    ◦   经过焯水已无肉腥味。

四、 核心工艺：炒糖色与煸炒

•   炒制糖色。

    ◦   锅烧热，倒入少量底油。

    ◦   放入准备好的冰糖。

    ◦   开中小火慢慢搅动。

    ◦   观察冰糖融化的过程。

    ◦   先从固体变为液态。

    ◦   再从小泡转为密集大泡。

    ◦   当大泡逐渐回落消失时。

    ◦   糖液颜色开始加深。

•   观察颜色变化。

    ◦   从浅黄色变为枣红色。

    ◦   这个瞬间非常关键。

    ◦   枣红色时立即下入肉块。

    ◦   过早则甜腻，过晚则发苦。

    ◦   动作务必迅速而准确。

    ◦   糖色是红亮色泽的来源。

    ◦   也是风味层次的基础。

•   煸炒肉块。

    ◦   快速颠锅，使每块肉均匀裹上糖色。

    ◦   持续翻炒约三到五分钟。

    ◦   直到肉块表面微微焦黄。

    ◦   部分油脂被煸炒出来。

    ◦   这样吃起来肥而不腻。

    ◦   同时香味物质充分释放。

    ◦   煸出的猪油可倒出部分。

    ◦   留作炒青菜风味极佳。

五、 炖煮入味：时间与火候的艺术

•   加入调料。

    ◦   沿着锅边烹入花雕酒。

    ◦   瞬间激发出浓郁酒香。

    ◦   接着倒入适量生抽。

    ◦   再加入少许老抽上色。

    ◦   放入所有香料：姜、葱、蒜等。

    ◦   与肉块一起翻炒均匀。

    ◦   让酱香与肉香充分融合。

    ◦   翻炒约两分钟至香气扑鼻。

•   注入开水。

    ◦   务必一次性加足开水。

    ◦   水量要完全没过肉块。

    ◦   甚至可以略多一些。

    ◦   避免中途再次加水。

    ◦   大火烧开后转小火。

    ◦   盖上锅盖，慢火焖炖。

    ◦   这是“入口即化”的关键。

    ◦   时间至少需要一小时。

•   慢炖过程。

    ◦   保持汤面微沸即可。

    ◦   火候切忌过大过急。

    ◦   否则容易烧干且肉不烂。

    ◦   期间可偶尔开盖查看。

    ◦   用勺子轻轻推动一下。

    ◦   防止粘锅底的情况发生。

    ◦   但尽量不要频繁翻动。

    ◦   以免影响肉块的完整。

六、 收汁与装盘：成就最终美味

•   大火收汁。

    ◦   炖煮一小时后。

    ◦   用筷子戳一下瘦肉部分。

    ◦   若能轻松戳透即表示已软烂。

    ◦   此时根据汤汁咸度加盐。

    ◦   开大火，将汤汁收浓。

    ◦   用锅铲不停搅动。

    ◦   防止糊底，并让汤汁变稠。

    ◦   均匀包裹在每一块肉上。

•   收汁技巧。

    ◦   收到汤汁浓稠如蜜。

    ◦   油亮红润的汤汁紧裹肉块。

    ◦   锅中泛起密集的大泡。

    ◦   即可准备关火出锅。

    ◦   收汁程度依个人喜好。

    ◦   喜欢拌饭可多留些汤汁。

    ◦   整个过程需密切留意。

    ◦   最后阶段变化非常迅速。

•   最终成品。

    ◦   将红烧肉盛入预热好的盘中。

    ◦   可烫几棵小油菜围边。

    ◦   既点缀色彩，又解油腻。

    ◦   撒上少许葱花或香菜末。

    ◦   一道色香味俱全的红烧肉完成。

    ◦   肉质软糯，肥而不腻。

    ◦   咸中带甜，回味无穷。

    ◦   配上一碗白米饭是绝配。

七、 要点总结与升华

•   成功关键。

    ◦   选材是基础，务必新鲜。

    ◦   焯水步骤不可省略。

    ◦   炒糖色是技术核心。

    ◦   火候控制是成败关键。

    ◦   耐心慢炖是美味保证。

•   变化与创新。

    ◦   可加入土豆、鹌鹑蛋同烧。

    ◦   吸收肉汤，滋味更丰富。

    ◦   也可尝试用啤酒代替水。

    ◦   别有一番风味层次。

    ◦   但万变不离其宗。

    ◦   核心技法仍需掌握。

烹饪是一门需要实践的艺术，红烧肉更是如此。希望这份详尽的食谱能成为您厨房路上的得力助手，愿您能享受从准备到品尝的整个过程，与家人朋友分享这份由时间与匠心凝聚而成的温暖美味。每一次尝试都是一次经验的积累，祝您烹饪愉快，早日成就属于自己的招牌红烧肉！。
'''
# Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
# Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
# when it is set to 1, the whole text will be one chunk, and will be forced to choose a best possible position to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold.
chunks, token_pos = chunk_text_with_max_chunk_size(model, doc, tokenizer, prob_threshold=0.5, max_tokens_per_chunk = 100)

# print chunks
for i, (c, t) in enumerate(zip(chunks, token_pos)):
    print(f"-----chunk: {i}----token_idx: {t}--------")
    print(c)
```
## Citation
```bibtex
@article{bert-chunker,
  title={bert-chunker: Efficient and Trained Chunking for Unstructured Documents}, 
  author={Yannan Luo},
  year={2024},
  url={https://github.com/jackfsuia/bert-chunker}
}
```
Base model is from [bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5).