tim1900

Update README.md

3de7ef9 verified 3 months ago

27.9 kB

	---
	license: mit
	language:
	- en
	- zh
	pipeline_tag: token-classification
	---
	# bert-chunker-Chinese-2

	[GitHub](https://github.com/jackfsuia/bert-chunker/tree/main/bcc2)


	bert-chunker-Chinese-2 （中文分段器） is a text chunker based on BertForTokenClassification to predict the start token of chunks (for use in RAG, etc), and using a sliding window it cuts documents of any size into chunks. We see it as an alternative of [semantic chunker](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb), but specially, it not only works for the structured texts, but also the unstructured and messy texts. It is a new version of [bc-chinese](https://huggingface.co/tim1900/bert-chunker-chinese), for which we change our data labeling and train pipeline to make it more stable and useful.

	Updates :
	- 2025.5.12: an experimental script that supports specifying the maximum tokens per chunk is available now [below](#experimental).
	## Usage
	Run the following:

	```python
	# -- coding: utf-8 --
	import torch
	from transformers import AutoTokenizer, BertForTokenClassification
	import math

	model_path = "tim1900/bert-chunker-Chinese-2"

	tokenizer = AutoTokenizer.from_pretrained(
	model_path,
	padding_side="right",
	model_max_length=512,
	trust_remote_code=True,
	)

	device = "cpu" # or 'cuda'

	model = BertForTokenClassification.from_pretrained(
	model_path,
	).to(device)

	def chunk_text(model, text, tokenizer, prob_threshold=0.5):
	# slide context window chunking
	MAX_TOKENS = 512
	tokens = tokenizer(text, return_tensors="pt", truncation=False)
	input_ids = tokens["input_ids"]
	attention_mask = tokens["attention_mask"][:, 0:MAX_TOKENS]
	attention_mask = attention_mask.to(model.device)
	CLS = input_ids[:, 0].unsqueeze(0)
	SEP = input_ids[:, -1].unsqueeze(0)
	input_ids = input_ids[:, 1:-1]
	model.eval()
	split_str_poses = []
	token_pos = []
	windows_start = 0
	windows_end = 0
	logits_threshold = math.log(1 / prob_threshold - 1)
	print(f"Processing {input_ids.shape[1]} tokens...")
	while windows_end <= input_ids.shape[1]:
	windows_end = windows_start + MAX_TOKENS - 2

	ids = torch.cat((CLS, input_ids[:, windows_start:windows_end], SEP), 1)

	ids = ids.to(model.device)

	output = model(
	input_ids=ids,
	attention_mask=torch.ones(1, ids.shape[1], device=model.device),
	)
	logits = output["logits"][:, 1:-1, :]
	chunk_decision = logits[:, :, 1] > (logits[:, :, 0] - logits_threshold)
	greater_rows_indices = torch.where(chunk_decision)[1].tolist()

	# null or not
	if len(greater_rows_indices) > 0 and (
	not (greater_rows_indices[0] == 0 and len(greater_rows_indices) == 1)
	):

	split_str_pos = [
	tokens.token_to_chars(sp + windows_start + 1).start
	for sp in greater_rows_indices
	if sp > 0
	]
	token_pos += [
	sp + windows_start + 1 for sp in greater_rows_indices if sp > 0
	]
	split_str_poses += split_str_pos

	windows_start = greater_rows_indices[-1] + windows_start

	else:

	windows_start = windows_end

	substrings = [
	text[i:j] for i, j in zip([0] + split_str_poses, split_str_poses + [len(text)])
	]
	token_pos = [0] + token_pos
	return substrings, token_pos


	# chunking
	print("\n>>>>>>>>> Chunking...")
	doc = r'''经典中式家常菜：红烧肉详尽版食谱

	红烧肉，作为中华美食殿堂中一颗璀璨的明珠，以其色泽红亮、肥而不腻、入口即化的独特风味，征服了无数食客的味蕾。它不仅仅是一道菜，更是一种情怀，是记忆中妈妈厨房里飘出的诱人香气，是团圆饭桌上最温暖的慰藉。要烹制出一盘完美的红烧肉，需要耐心、技巧以及对细节的专注。以下将为您详尽分解从选材到成品的每一个步骤，力求让您在家也能复刻出餐厅级别的美味。本食谱旨在深入剖析，故篇幅较长，以确保每个环节都清晰透彻。

	一、精选主料：奠定美味的基石

	• 猪肉选择。

	◦ 首选带皮的五花肉。

	◦ 层次分明为佳。

	◦ 肥瘦相间是关键。

	◦ 厚度约三指左右。

	◦ 重量约一斤半为宜。

	◦ 确保猪皮完整无缺。

	◦ 新鲜肉品呈鲜红色。

	◦ 脂肪部分应洁白细腻。

	• 预处理工作。

	◦ 用镊子拔除残留猪毛。

	◦ 将肉块置于冷水中浸泡。

	◦ 浸泡时间约三十分钟。

	◦ 中途可换水一至两次。

	◦ 目的是泡出部分血水。

	◦ 然后用刀刮洗猪皮表面。

	◦ 彻底清除污物和杂质。

	◦ 最后用流动水冲洗干净。

	二、准备辅料：调配灵魂之味

	• 主要调味品。

	◦ 老抽：负责上色。

	◦ 用量需谨慎控制。

	◦ 过多会导致发黑发苦。

	◦ 生抽：提供咸鲜底味。

	◦ 与老抽比例约二比一。

	◦ 优质花雕酒是精髓。

	◦ 去腥增香效果显著。

	◦ 冰糖：首选黄冰糖。

	◦ 炒出的糖色更红亮。

	◦ 风味也比白糖醇和。

	• 香辛料组合。

	◦ 生姜一大块，切片。

	◦ 大葱一根，切成长段。

	◦ 蒜头数瓣，轻轻拍松。

	◦ 八角两至三颗，增香。

	◦ 桂皮一小段，勿过多。

	◦ 香叶两片，增添风味。

	◦ 草果一颗，可拍裂开。

	◦ 干辣椒依个人口味添加。

	• 其他基础材料。

	◦ 食用油适量，需耐高温。

	◦ 最好使用菜籽油或花生油。

	◦ 食盐少许，用于最后调味。

	◦ 因为酱油已有咸度。

	◦ 准备足量的开水备用。

	◦ 切记不可使用冷水。

	◦ 冷水会使肉质收缩变柴。

	三、精细加工：关键步骤解析

	• 肉块改刀。

	◦ 将洗净的五花肉捞出。

	◦ 用厨房纸彻底吸干水分。

	◦ 这一步非常重要。

	◦ 能有效防止后续溅油。

	◦ 将肉切成三厘米见方块。

	◦ 大小尽量保持均匀一致。

	◦ 以确保受热和入味均匀。

	◦ 切面应能看到完美层次。

	• 焯水去腥。

	◦ 冷水下锅，放入切好的肉。

	◦ 同时加入几片生姜。

	◦ 倒入一汤匙花雕酒。

	◦ 开大火煮沸，撇净浮沫。

	◦ 浮沫是血水和杂质所致。

	◦ 务必撇除干净直至汤清。

	◦ 焯水时间约五到八分钟。

	◦ 煮至肉块变色定型即可。

	• 捞出与冲洗。

	◦ 用漏勺将肉块捞出。

	◦ 立即放入温水中冲洗。

	◦ 洗去表面残留的浮沫。

	◦ 注意水温不宜过低。

	◦ 再次用厨房纸吸干水分。

	◦ 防止入锅时油花四溅。

	◦ 此时肉块呈灰白色。

	◦ 经过焯水已无肉腥味。

	四、核心工艺：炒糖色与煸炒

	• 炒制糖色。

	◦ 锅烧热，倒入少量底油。

	◦ 放入准备好的冰糖。

	◦ 开中小火慢慢搅动。

	◦ 观察冰糖融化的过程。

	◦ 先从固体变为液态。

	◦ 再从小泡转为密集大泡。

	◦ 当大泡逐渐回落消失时。

	◦ 糖液颜色开始加深。

	• 观察颜色变化。

	◦ 从浅黄色变为枣红色。

	◦ 这个瞬间非常关键。

	◦ 枣红色时立即下入肉块。

	◦ 过早则甜腻，过晚则发苦。

	◦ 动作务必迅速而准确。

	◦ 糖色是红亮色泽的来源。

	◦ 也是风味层次的基础。

	• 煸炒肉块。

	◦ 快速颠锅，使每块肉均匀裹上糖色。

	◦ 持续翻炒约三到五分钟。

	◦ 直到肉块表面微微焦黄。

	◦ 部分油脂被煸炒出来。

	◦ 这样吃起来肥而不腻。

	◦ 同时香味物质充分释放。

	◦ 煸出的猪油可倒出部分。

	◦ 留作炒青菜风味极佳。

	五、炖煮入味：时间与火候的艺术

	• 加入调料。

	◦ 沿着锅边烹入花雕酒。

	◦ 瞬间激发出浓郁酒香。

	◦ 接着倒入适量生抽。

	◦ 再加入少许老抽上色。

	◦ 放入所有香料：姜、葱、蒜等。

	◦ 与肉块一起翻炒均匀。

	◦ 让酱香与肉香充分融合。

	◦ 翻炒约两分钟至香气扑鼻。

	• 注入开水。

	◦ 务必一次性加足开水。

	◦ 水量要完全没过肉块。

	◦ 甚至可以略多一些。

	◦ 避免中途再次加水。

	◦ 大火烧开后转小火。

	◦ 盖上锅盖，慢火焖炖。

	◦ 这是“入口即化”的关键。

	◦ 时间至少需要一小时。

	• 慢炖过程。

	◦ 保持汤面微沸即可。

	◦ 火候切忌过大过急。

	◦ 否则容易烧干且肉不烂。

	◦ 期间可偶尔开盖查看。

	◦ 用勺子轻轻推动一下。

	◦ 防止粘锅底的情况发生。

	◦ 但尽量不要频繁翻动。

	◦ 以免影响肉块的完整。

	六、收汁与装盘：成就最终美味

	• 大火收汁。

	◦ 炖煮一小时后。

	◦ 用筷子戳一下瘦肉部分。

	◦ 若能轻松戳透即表示已软烂。

	◦ 此时根据汤汁咸度加盐。

	◦ 开大火，将汤汁收浓。

	◦ 用锅铲不停搅动。

	◦ 防止糊底，并让汤汁变稠。

	◦ 均匀包裹在每一块肉上。

	• 收汁技巧。

	◦ 收到汤汁浓稠如蜜。

	◦ 油亮红润的汤汁紧裹肉块。

	◦ 锅中泛起密集的大泡。

	◦ 即可准备关火出锅。

	◦ 收汁程度依个人喜好。

	◦ 喜欢拌饭可多留些汤汁。

	◦ 整个过程需密切留意。

	◦ 最后阶段变化非常迅速。

	• 最终成品。

	◦ 将红烧肉盛入预热好的盘中。

	◦ 可烫几棵小油菜围边。

	◦ 既点缀色彩，又解油腻。

	◦ 撒上少许葱花或香菜末。

	◦ 一道色香味俱全的红烧肉完成。

	◦ 肉质软糯，肥而不腻。

	◦ 咸中带甜，回味无穷。

	◦ 配上一碗白米饭是绝配。

	七、要点总结与升华

	• 成功关键。

	◦ 选材是基础，务必新鲜。

	◦ 焯水步骤不可省略。

	◦ 炒糖色是技术核心。

	◦ 火候控制是成败关键。

	◦ 耐心慢炖是美味保证。

	• 变化与创新。

	◦ 可加入土豆、鹌鹑蛋同烧。

	◦ 吸收肉汤，滋味更丰富。

	◦ 也可尝试用啤酒代替水。

	◦ 别有一番风味层次。

	◦ 但万变不离其宗。

	◦ 核心技法仍需掌握。

	烹饪是一门需要实践的艺术，红烧肉更是如此。希望这份详尽的食谱能成为您厨房路上的得力助手，愿您能享受从准备到品尝的整个过程，与家人朋友分享这份由时间与匠心凝聚而成的温暖美味。每一次尝试都是一次经验的积累，祝您烹饪愉快，早日成就属于自己的招牌红烧肉！
	'''
	# Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
	# Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
	# when it is set to 1, the whole text is one chunk.
	chunks, token_pos = chunk_text(model, doc, tokenizer, prob_threshold=0.5)

	# print chunks
	for i, (c, t) in enumerate(zip(chunks, token_pos)):
	print(f"-----chunk: {i}----token_idx: {t}--------")
	print(c)
	```
	## Experimental
	The following script supports specifying max tokens per chunk. Chunker will be forced to choose a best possible position from history to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold. This script can be seen as a new experimental version of the scripts above.
	```python
	# -- coding: utf-8 --
	import torch
	from transformers import AutoTokenizer, BertForTokenClassification
	import math

	model_path = "tim1900/bert-chunker-Chinese-2"

	tokenizer = AutoTokenizer.from_pretrained(
	model_path,
	padding_side="right",
	model_max_length=512,
	trust_remote_code=True,
	)

	device = "cpu" # or 'cuda'

	model = BertForTokenClassification.from_pretrained(
	model_path,
	).to(device)

	def chunk_text_with_max_chunk_size(model, text, tokenizer, prob_threshold=0.5,max_tokens_per_chunk = 400):
	with torch.no_grad():

	# slide context window chunking
	MAX_TOKENS = 512
	tokens = tokenizer(text, return_tensors="pt", truncation=False)
	input_ids = tokens["input_ids"]
	attention_mask = tokens["attention_mask"][:, 0:MAX_TOKENS]
	attention_mask = attention_mask.to(model.device)
	CLS = input_ids[:, 0].unsqueeze(0)
	SEP = input_ids[:, -1].unsqueeze(0)
	input_ids = input_ids[:, 1:-1]
	model.eval()
	split_str_poses = []
	token_pos = []
	windows_start = 0
	windows_end = 0
	logits_threshold = math.log(1 / prob_threshold - 1)

	unchunk_tokens = 0
	backup_pos = None
	best_logits = torch.finfo(torch.float32).min
	STEP = round(((MAX_TOKENS - 2)//2)*1.75 )
	print(f"Processing {input_ids.shape[1]} tokens...")
	# while windows_end <= input_ids.shape[1]:#记得改成windstart
	while windows_start < input_ids.shape[1]:#记得改成windstart
	windows_end = windows_start + MAX_TOKENS - 2
	ids = torch.cat((CLS, input_ids[:, windows_start:windows_end], SEP), 1)
	ids = ids.to(model.device)
	output = model(
	input_ids=ids,
	attention_mask=torch.ones(1, ids.shape[1], device=model.device),
	)
	logits = output["logits"][:, 1:-1, :]


	logit_diff = logits[:, :, 1] - logits[:, :, 0]


	chunk_decision = logit_diff > - logits_threshold
	greater_rows_indices = torch.where(chunk_decision)[1].tolist()

	# null or not
	if len(greater_rows_indices) > 0 and (
	not (greater_rows_indices[0] == 0 and len(greater_rows_indices) == 1)
	):


	unchunk_tokens_this_window = greater_rows_indices[0] if greater_rows_indices[0]!=0 else greater_rows_indices[1]#exclude the fist index

	# manually chunk
	if unchunk_tokens + unchunk_tokens_this_window > max_tokens_per_chunk:
	big_windows_end = max_tokens_per_chunk - unchunk_tokens
	max_value, max_index= logit_diff[:,1:big_windows_end].max(), logit_diff[:,1:big_windows_end].argmax() + 1
	if best_logits < max_value:
	backup_pos = windows_start + max_index

	windows_start = backup_pos


	split_str_pos = [tokens.token_to_chars(backup_pos + 1).start]
	split_str_poses = split_str_poses + split_str_pos
	token_pos = token_pos + [backup_pos]
	best_logits = torch.finfo(torch.float32).min
	backup_pos = -1
	unchunk_tokens = 0

	# auto chunk
	else:

	if len(greater_rows_indices) >= 2:
	for gi, (gri0,gri1) in enumerate(zip(greater_rows_indices[:-1],greater_rows_indices[1:])):

	if gri1 - gri0 > max_tokens_per_chunk:
	greater_rows_indices=greater_rows_indices[:gi+1]
	break

	split_str_pos = [tokens.token_to_chars(sp + windows_start + 1).start for sp in greater_rows_indices if sp > 0]
	split_str_poses = split_str_poses + split_str_pos
	token_pos = token_pos+ [sp + windows_start for sp in greater_rows_indices if sp > 0]

	windows_start = greater_rows_indices[-1] + windows_start
	best_logits = torch.finfo(torch.float32).min
	backup_pos = -1
	unchunk_tokens = 0

	else:

	# unchunk_tokens_this_window = min(windows_end - windows_start,STEP)
	unchunk_tokens_this_window = min(windows_start+STEP,input_ids.shape[1]) - windows_start

	# manually chunk
	if unchunk_tokens + unchunk_tokens_this_window > max_tokens_per_chunk:
	big_windows_end = max_tokens_per_chunk - unchunk_tokens
	if logit_diff.shape[1] > 1:

	max_value, max_index= logit_diff[:,1:big_windows_end].max(), logit_diff[:,1:big_windows_end].argmax() + 1
	if best_logits < max_value:
	backup_pos = windows_start + max_index


	windows_start = backup_pos
	split_str_pos = [tokens.token_to_chars(backup_pos + 1).start]
	split_str_poses = split_str_poses + split_str_pos
	token_pos = token_pos + [backup_pos]
	best_logits = torch.finfo(torch.float32).min
	backup_pos = -1
	unchunk_tokens = 0
	else:
	# auto leave
	if logit_diff.shape[1] > 1:
	max_value, max_index= logit_diff[:,1:].max(), logit_diff[:,1:].argmax() + 1
	if best_logits < max_value:
	best_logits = max_value
	backup_pos = windows_start + max_index

	unchunk_tokens = unchunk_tokens + STEP
	windows_start = windows_start + STEP

	substrings = [
	text[i:j] for i, j in zip([0] + split_str_poses, split_str_poses + [len(text)])
	]
	token_pos = [0] + token_pos
	return substrings, token_pos
	# chunking
	print("\n>>>>>>>>> Chunking...")
	doc = r'''经典中式家常菜：红烧肉详尽版食谱

	红烧肉，作为中华美食殿堂中一颗璀璨的明珠，以其色泽红亮、肥而不腻、入口即化的独特风味，征服了无数食客的味蕾。它不仅仅是一道菜，更是一种情怀，是记忆中妈妈厨房里飘出的诱人香气，是团圆饭桌上最温暖的慰藉。要烹制出一盘完美的红烧肉，需要耐心、技巧以及对细节的专注。以下将为您详尽分解从选材到成品的每一个步骤，力求让您在家也能复刻出餐厅级别的美味。本食谱旨在深入剖析，故篇幅较长，以确保每个环节都清晰透彻。

	一、精选主料：奠定美味的基石

	• 猪肉选择。

	◦ 首选带皮的五花肉。

	◦ 层次分明为佳。

	◦ 肥瘦相间是关键。

	◦ 厚度约三指左右。

	◦ 重量约一斤半为宜。

	◦ 确保猪皮完整无缺。

	◦ 新鲜肉品呈鲜红色。

	◦ 脂肪部分应洁白细腻。

	• 预处理工作。

	◦ 用镊子拔除残留猪毛。

	◦ 将肉块置于冷水中浸泡。

	◦ 浸泡时间约三十分钟。

	◦ 中途可换水一至两次。

	◦ 目的是泡出部分血水。

	◦ 然后用刀刮洗猪皮表面。

	◦ 彻底清除污物和杂质。

	◦ 最后用流动水冲洗干净。

	二、准备辅料：调配灵魂之味

	• 主要调味品。

	◦ 老抽：负责上色。

	◦ 用量需谨慎控制。

	◦ 过多会导致发黑发苦。

	◦ 生抽：提供咸鲜底味。

	◦ 与老抽比例约二比一。

	◦ 优质花雕酒是精髓。

	◦ 去腥增香效果显著。

	◦ 冰糖：首选黄冰糖。

	◦ 炒出的糖色更红亮。

	◦ 风味也比白糖醇和。

	• 香辛料组合。

	◦ 生姜一大块，切片。

	◦ 大葱一根，切成长段。

	◦ 蒜头数瓣，轻轻拍松。

	◦ 八角两至三颗，增香。

	◦ 桂皮一小段，勿过多。

	◦ 香叶两片，增添风味。

	◦ 草果一颗，可拍裂开。

	◦ 干辣椒依个人口味添加。

	• 其他基础材料。

	◦ 食用油适量，需耐高温。

	◦ 最好使用菜籽油或花生油。

	◦ 食盐少许，用于最后调味。

	◦ 因为酱油已有咸度。

	◦ 准备足量的开水备用。

	◦ 切记不可使用冷水。

	◦ 冷水会使肉质收缩变柴。

	三、精细加工：关键步骤解析

	• 肉块改刀。

	◦ 将洗净的五花肉捞出。

	◦ 用厨房纸彻底吸干水分。

	◦ 这一步非常重要。

	◦ 能有效防止后续溅油。

	◦ 将肉切成三厘米见方块。

	◦ 大小尽量保持均匀一致。

	◦ 以确保受热和入味均匀。

	◦ 切面应能看到完美层次。

	• 焯水去腥。

	◦ 冷水下锅，放入切好的肉。

	◦ 同时加入几片生姜。

	◦ 倒入一汤匙花雕酒。

	◦ 开大火煮沸，撇净浮沫。

	◦ 浮沫是血水和杂质所致。

	◦ 务必撇除干净直至汤清。

	◦ 焯水时间约五到八分钟。

	◦ 煮至肉块变色定型即可。

	• 捞出与冲洗。

	◦ 用漏勺将肉块捞出。

	◦ 立即放入温水中冲洗。

	◦ 洗去表面残留的浮沫。

	◦ 注意水温不宜过低。

	◦ 再次用厨房纸吸干水分。

	◦ 防止入锅时油花四溅。

	◦ 此时肉块呈灰白色。

	◦ 经过焯水已无肉腥味。

	四、核心工艺：炒糖色与煸炒

	• 炒制糖色。

	◦ 锅烧热，倒入少量底油。

	◦ 放入准备好的冰糖。

	◦ 开中小火慢慢搅动。

	◦ 观察冰糖融化的过程。

	◦ 先从固体变为液态。

	◦ 再从小泡转为密集大泡。

	◦ 当大泡逐渐回落消失时。

	◦ 糖液颜色开始加深。

	• 观察颜色变化。

	◦ 从浅黄色变为枣红色。

	◦ 这个瞬间非常关键。

	◦ 枣红色时立即下入肉块。

	◦ 过早则甜腻，过晚则发苦。

	◦ 动作务必迅速而准确。

	◦ 糖色是红亮色泽的来源。

	◦ 也是风味层次的基础。

	• 煸炒肉块。

	◦ 快速颠锅，使每块肉均匀裹上糖色。

	◦ 持续翻炒约三到五分钟。

	◦ 直到肉块表面微微焦黄。

	◦ 部分油脂被煸炒出来。

	◦ 这样吃起来肥而不腻。

	◦ 同时香味物质充分释放。

	◦ 煸出的猪油可倒出部分。

	◦ 留作炒青菜风味极佳。

	五、炖煮入味：时间与火候的艺术

	• 加入调料。

	◦ 沿着锅边烹入花雕酒。

	◦ 瞬间激发出浓郁酒香。

	◦ 接着倒入适量生抽。

	◦ 再加入少许老抽上色。

	◦ 放入所有香料：姜、葱、蒜等。

	◦ 与肉块一起翻炒均匀。

	◦ 让酱香与肉香充分融合。

	◦ 翻炒约两分钟至香气扑鼻。

	• 注入开水。

	◦ 务必一次性加足开水。

	◦ 水量要完全没过肉块。

	◦ 甚至可以略多一些。

	◦ 避免中途再次加水。

	◦ 大火烧开后转小火。

	◦ 盖上锅盖，慢火焖炖。

	◦ 这是“入口即化”的关键。

	◦ 时间至少需要一小时。

	• 慢炖过程。

	◦ 保持汤面微沸即可。

	◦ 火候切忌过大过急。

	◦ 否则容易烧干且肉不烂。

	◦ 期间可偶尔开盖查看。

	◦ 用勺子轻轻推动一下。

	◦ 防止粘锅底的情况发生。

	◦ 但尽量不要频繁翻动。

	◦ 以免影响肉块的完整。

	六、收汁与装盘：成就最终美味

	• 大火收汁。

	◦ 炖煮一小时后。

	◦ 用筷子戳一下瘦肉部分。

	◦ 若能轻松戳透即表示已软烂。

	◦ 此时根据汤汁咸度加盐。

	◦ 开大火，将汤汁收浓。

	◦ 用锅铲不停搅动。

	◦ 防止糊底，并让汤汁变稠。

	◦ 均匀包裹在每一块肉上。

	• 收汁技巧。

	◦ 收到汤汁浓稠如蜜。

	◦ 油亮红润的汤汁紧裹肉块。

	◦ 锅中泛起密集的大泡。

	◦ 即可准备关火出锅。

	◦ 收汁程度依个人喜好。

	◦ 喜欢拌饭可多留些汤汁。

	◦ 整个过程需密切留意。

	◦ 最后阶段变化非常迅速。

	• 最终成品。

	◦ 将红烧肉盛入预热好的盘中。

	◦ 可烫几棵小油菜围边。

	◦ 既点缀色彩，又解油腻。

	◦ 撒上少许葱花或香菜末。

	◦ 一道色香味俱全的红烧肉完成。

	◦ 肉质软糯，肥而不腻。

	◦ 咸中带甜，回味无穷。

	◦ 配上一碗白米饭是绝配。

	七、要点总结与升华

	• 成功关键。

	◦ 选材是基础，务必新鲜。

	◦ 焯水步骤不可省略。

	◦ 炒糖色是技术核心。

	◦ 火候控制是成败关键。

	◦ 耐心慢炖是美味保证。

	• 变化与创新。

	◦ 可加入土豆、鹌鹑蛋同烧。

	◦ 吸收肉汤，滋味更丰富。

	◦ 也可尝试用啤酒代替水。

	◦ 别有一番风味层次。

	◦ 但万变不离其宗。

	◦ 核心技法仍需掌握。

	烹饪是一门需要实践的艺术，红烧肉更是如此。希望这份详尽的食谱能成为您厨房路上的得力助手，愿您能享受从准备到品尝的整个过程，与家人朋友分享这份由时间与匠心凝聚而成的温暖美味。每一次尝试都是一次经验的积累，祝您烹饪愉快，早日成就属于自己的招牌红烧肉！。
	'''
	# Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
	# Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
	# when it is set to 1, the whole text will be one chunk, and will be forced to choose a best possible position to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold.
	chunks, token_pos = chunk_text_with_max_chunk_size(model, doc, tokenizer, prob_threshold=0.5, max_tokens_per_chunk = 100)

	# print chunks
	for i, (c, t) in enumerate(zip(chunks, token_pos)):
	print(f"-----chunk: {i}----token_idx: {t}--------")
	print(c)
	```
	## Citation
	```bibtex
	@article{bert-chunker,
	title={bert-chunker: Efficient and Trained Chunking for Unstructured Documents},
	author={Yannan Luo},
	year={2024},
	url={https://github.com/jackfsuia/bert-chunker}
	}
	```
	Base model is from [bge-small-zh-v1.5](https://huggingface.co/BAAI/bge-small-zh-v1.5).