tim1900 commited on
Commit
909e068
·
verified ·
1 Parent(s): db91bad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +192 -3
README.md CHANGED
@@ -1,3 +1,192 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - zh
6
+ pipeline_tag: token-classification
7
+ ---
8
+ # bert-chunker-3.5
9
+
10
+ [GitHub](https://github.com/jackfsuia/bert-chunker/)
11
+
12
+ bert-chunker-3.5 is a text chunker based on BertForTokenClassification to predict the start token of chunks (for use in RAG, etc), and using a sliding window it cuts documents of any size into chunks. We see it as an alternative of [Kamradt semantic chunker](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb), but specially, it not only works for the structured texts, but also the **unstructured and messy texts**.
13
+
14
+ As a new version of bert-chunker-3, it has a **competitive** [**performance**](#evaluation).
15
+
16
+ ## Usage
17
+ Run the following:
18
+
19
+ ```python
20
+ import torch
21
+ from transformers import AutoTokenizer, BertForTokenClassification
22
+ from collections import deque
23
+
24
+ model_path =r"tim1900/bert-chunker-3.5"
25
+
26
+ tokenizer = AutoTokenizer.from_pretrained(
27
+ model_path,
28
+ padding_side="right",
29
+ trust_remote_code=True,
30
+ )
31
+
32
+ device = "cpu" # or 'cuda'
33
+
34
+ model = BertForTokenClassification.from_pretrained(model_path).to(device)
35
+
36
+ def split_text(text, max_tokens_per_chunk):
37
+ MAX_LEN = 512
38
+ MAX_USE_LEN = MAX_LEN-2
39
+ MASK_EDGE =50
40
+ EFFECT_SIZE = MAX_USE_LEN - MASK_EDGE*2
41
+ with torch.no_grad():
42
+ unk_id = tokenizer.unk_token_id
43
+ print("\n>>>>>>>>> Tokening text...")
44
+ tokens = tokenizer(text, return_tensors="pt", truncation=False)
45
+ print("\n>>>>>>>>> Chunking text...")
46
+ input_ids = tokens['input_ids'].squeeze()[1:-1]
47
+ len_of_input_ids = len(input_ids)
48
+ left = EFFECT_SIZE - len_of_input_ids%EFFECT_SIZE
49
+ full_input_ids = torch.cat([ torch.tensor([unk_id]*MASK_EDGE), input_ids, torch.tensor([unk_id]*(MASK_EDGE + left))])
50
+ prob_list = []
51
+ start_idx = 0
52
+ while(1):
53
+ end_idx = start_idx + MAX_USE_LEN
54
+ window_input_ids =torch.cat([ torch.tensor([tokenizer.cls_token_id]), full_input_ids[start_idx:end_idx], torch.tensor([tokenizer.sep_token_id])])
55
+ window_input_ids = window_input_ids.to(model.device)
56
+ output = model(
57
+ input_ids=window_input_ids.unsqueeze(0),
58
+ attention_mask=torch.ones(1, window_input_ids.shape[0], device=model.device),
59
+ )
60
+ logits = output["logits"][:, 1+MASK_EDGE:-1-MASK_EDGE, :]
61
+ logit_diff = logits[:, :, 1] - logits[:, :, 0]
62
+ logit_diff = logit_diff.squeeze().tolist()
63
+ prob_list = prob_list + logit_diff
64
+ start_idx = start_idx + EFFECT_SIZE
65
+ if end_idx==len(full_input_ids):
66
+ break
67
+ prob_list = prob_list[:len_of_input_ids]
68
+ def find_split_points(numbers, m):
69
+ if m >= len(numbers): return [0]
70
+ def sliding_window_max_indices(arr, M):
71
+ dq = deque()
72
+ result = []
73
+ for i in range(len(arr)):
74
+ while dq and dq[0] < i - M + 1:
75
+ dq.popleft()
76
+ while dq and arr[dq[-1]] <= arr[i]:
77
+ dq.pop()
78
+ dq.append(i)
79
+ if i >= M - 1:
80
+ result.append(dq[0])
81
+ return result
82
+
83
+ max_pos = sliding_window_max_indices(numbers, m + 1)
84
+
85
+ splits = [0]
86
+ st = 1
87
+
88
+ while st < len(max_pos):
89
+ split = max_pos[st]
90
+ splits.append(split)
91
+ st = split + 1
92
+
93
+ return splits
94
+
95
+ # prob_list = prob_list[:1000]
96
+ token_split_points = find_split_points(prob_list, max_tokens_per_chunk)
97
+
98
+ str_split_points = [tokens.token_to_chars(pos + 1).start for pos in token_split_points]
99
+
100
+ if str_split_points[0] != 0:
101
+ str_split_points[0] = 0
102
+
103
+ substrings = [
104
+ text[i:j] for i, j in zip(str_split_points, str_split_points[1:] + [len(text)])
105
+ ]
106
+
107
+ return substrings, token_split_points
108
+
109
+ # chunking
110
+
111
+ txt = r"""The causes and effects of dropouts in vocational and professional education are more pressing than ever. A decreasing attractiveness of vocational education, particularly in payment and quality, causes higher dropout rates while hitting ongoing demographic changes resulting in extensive skill shortages for many regions. Therefore, tackling the internationally high dropout rates is of utmost political and scientific interest. This thematic issue contributes to the conceptualization, analysis, and prevention of vocational and professional dropouts by bringing together current research that progresses to a deeper processual understanding and empirical modelling of dropouts. It aims to expand our understanding of how dropout and decision processes leading to dropout can be conceptualized and measured in vocational and professional contexts. Another aim is to gather empirical studies on both predictors and dropout consequences. Based on this knowledge, the thematic issue intends to provide evidence of effective interventions to avoid dropouts and identify promising ways for future dropout research in professional and vocational education to support evidence-based vocational education policy.
112
+
113
+ We thus welcome research contributions (original empirical and conceptual/measurement-related articles, literature reviews, meta-analyses) on dropouts (e.g., premature terminations, intentions to terminate, vertical and horizontal dropouts) that are situated in vocational and professional education at workplaces, schools, or other tertiary professional education institutions.
114
+
115
+
116
+ Part 1 of the thematic series outlines central theories and measurement concepts for vocational and professional dropouts. Part 2 outlines measurement approaches for dropout. Part 3 investigates relevant predictors of dropout. Part 4 analyzes the effects of dropout on an individual, organizational, and systemic level. Part 5 deals with programs and interventions for the prevention of dropouts.
117
+
118
+ We welcome papers that include but are not limited to:
119
+
120
+ Theoretical papers on the concept and processes of vocational and professional dropout or retention
121
+ Measurement approaches to assess dropout or retention
122
+ Quantitative and qualitative papers on the causes of dropout or retention
123
+ Quantitative and qualitative papers on the effects of dropout or retention on learners, providers/organizations and the (educational) system
124
+ Design-based research and experimental papers on dropout prevention programs or retention
125
+ Submission instructions
126
+ Before submitting your manuscript, please ensure you have carefully read the Instructions for Authors for Empirical Research in Vocational Education and Training. The complete manuscript should be submitted through the Empirical Research in Vocational Education and Training submission system. To ensure that you submit to the correct thematic series please select the appropriate section in the drop-down menu upon submission. In addition, indicate within your cover letter that you wish your manuscript to be considered as part of the thematic series on series title. All submissions will undergo rigorous peer review, and accepted articles will be published within the journal as a collection.
127
+
128
+ Lead Guest Editor:
129
+ Prof. Dr. Viola Deutscher, University of Mannheim
130
+ viola.deutscher@uni-mannheim.de
131
+
132
+ Guest Editors:
133
+ Prof. Dr. Stefanie Findeisen, University of Konstanz
134
+ stefanie.findeisen@uni-konstanz.de
135
+
136
+ Prof. Dr. Christian Michaelis, Georg-August-University of Göttingen
137
+ christian.michaelis@wiwi.uni-goettingen.de
138
+
139
+ Deadline for submission
140
+ This Call for Papers is open from now until 29 February 2023. Submitted papers will be reviewed in a timely manner and published directly after acceptance (i.e., without waiting for the accomplishment of all other contributions). Thanks to the Empirical Research in Vocational Education and Training (ERVET) open access policy, the articles published in this thematic issue will have a wide, global audience.
141
+
142
+ Option of submitting abstracts: Interested authors should submit a letter of intent including a working title for the manuscript, names, affiliations, and contact information for all authors, and an abstract of no more than 500 words to the lead guest editor Viola Deutscher (viola.deutscher@uni-mannheim.de) by July, 31st 2023. Due to technical issues, we also ask authors who already submitted an abstract before May, 30th to send their abstracts again to the address stated above. However, abstract submission is optional and is not mandatory for the full paper submission.
143
+
144
+ Different dropout directions in vocational education and training: the role of the initiating party and trainees’ reasons for dropping out
145
+ The high rates of premature contract termination (PCT) in vocational education and training (VET) programs have led to an increasing number of studies examining the reasons why adolescents drop out. Since adol...
146
+
147
+ Authors:Christian Michaelis and Stefanie Findeisen
148
+ Citation:Empirical Research in Vocational Education and Training 2024 16:15
149
+ Content type:Research
150
+ Published on: 6 August 2024"
151
+ """
152
+ chunks, token_pos = split_text(txt, max_tokens_per_chunk = 400)
153
+
154
+ # print chunks
155
+ for i, (c, t) in enumerate(zip(chunks, token_pos)):
156
+ print(f"========================== CHUNK: {i} TOKEN_IDX: {t} ==========================")
157
+ print(c)
158
+ ```
159
+ ## Evaluation
160
+ The following RAG evaluation is done by code from [brandonstarxel/chunking_evaluation](https://github.com/brandonstarxel/chunking_evaluation), most of the following results come from [Evaluating Chunking Strategies for Retrieval](https://research.trychroma.com/evaluating-chunking).
161
+ | Chunking | Size| Overlap | Recall | Precision | PrecisionΩ | IoU | Time complexity by token number N | Is max chunk size strictly controlable|
162
+ |---------|---------|---------|---------|---------|---------|---------|---------|---------|
163
+ | Recursive | <= 800 | 400 | 85.4 ± 34.9 | 1.5 ± 1.3 | 6.7 ± 5.2 | 1.5 ± 1.3| **O(N)** | **Yes**
164
+ | TokenText | 800 | 400 | 87.9 ± 31.7| 1.4 ± 1.1 | 4.7 ± 3.1 | 1.4 ± 1.1 |**O(N)** | **Yes**
165
+ | Recursive | <= 400 | 200 |88.1 ± 31.6 | 3.3 ± 2.7 | 13.9 ± 10.4 | 3.3 ± 2.7 | **O(N)** | **Yes**
166
+ | TokenText | 400 | 200 | 88.6 ± 29.7 | 2.7 ± 2.2 | 8.4 ± 5.1 | 2.7 ± 2.2 | **O(N)** | **Yes**
167
+ | Recursive | <= 400 | 0 | 89.5 ± 29.7 | 3.6 ± 3.2 | 17.7 ± 14.0 | 3.6 ± 3.2 | **O(N)** | **Yes**
168
+ | TokenText | 400 |0 | 89.2 ± 29.2 | 2.7 ± 2.2 | 12.5 ± 8.1 | 2.7 ± 2.2 | **O(N)** | **Yes**
169
+ | Recursive | <= 200 | 0 | 88.1 ± 30.1 | 7.0 ± 5.6 | 29.9 ± 18.4 | 6.9 ± 5.6 | **O(N)** | **Yes**
170
+ | TokenText | 200 | 0 | 87.0 ± 30.8 | 5.2 ± 4.1 | 21.0 ± 11.9 | 5.1 ± 4.1 | **O(N)** | **Yes**
171
+ | Kamradt | N/A (~660) | 0 | 83.6 ± 36.8 | 1.5 ± 1.6 | 7.4 ± 10.2 | 1.5 ± 1.6 | **O(N)** | No
172
+ KamradtMod | <= 300 | 0 | 87.1 ± 31.9 | 2.1 ± 2.0 | 10.5 ± 12.3 | 2.1 ± 2.0 | **O(N)**| **Yes**
173
+ Cluster | 400 (~182) | 0 | 91.3 ± 25.4 | 4.5 ± 3.4 | 20.7 ± 14.5 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
174
+ Cluster | 200 (~103) | 0 | 87.3 ± 29.8 | **8.0 ± 6.0** | **34.0 ± 19.7** | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
175
+ LLM (GPT4o) | N/A (~240) | 0 | 91.9 ± 26.5 | 3.9 ± 3.2 | 19.9 ± 16.3 | 3.9 ± 3.2 | O(N<sup>2</sup>)| No
176
+ semchunk | <=400 | 0 | 90.0 ± 29.1 | 3.6 ± 2.8 | 17.3 ± 12.6 | 3.6 ± 2.8 | **O(N)**| **Yes**
177
+ semchunk | <=200 | 0 | 89.3 ± 28.7 | 6.8 ± 5.2 | 28.9 ± 17.1 | 6.7 ± 5.1 | **O(N)**| **Yes**
178
+ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 400 | 0 | 91.3 ± 26.6 | 5.4 ± 4.7 | 23.1 ± 17.6 | 5.4 ± 4.7 |**O(N)** | **Yes**
179
+ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 200 | 0 | 89.7 ± 27.9 | 7.6 ± 6.0 | 30.9 ± 19.1 | 7.7 ± 5.8 |**O(N)**| **Yes**
180
+ bert-chunker-3 (prob_threshold=0.50543) | N/A | 0 | 90.4 ± 28.7 | 3.3 ± 3.1 | 16.0 ± 17.0 | 3.3 ± 3.1 |**O(N)**| No
181
+ ★ bert-chunker-3.5 | <= 400 | 0 | **94.1 ± 22.5**| 4.3 ± 3.5 | 18.1 ± 13.3 | 4.3 ± 3.5 |**O(N)** | **Yes**
182
+ ★ bert-chunker-3.5 | <= 200 | 0 | 90.4 ± 26.2 | 7.7 ± 5.7 | 29.2 ± 17.9 | 7.6 ± 5.7 |**O(N)** | **Yes**
183
+ ## Citation
184
+ ```bibtex
185
+ @article{bert-chunker,
186
+ title={bert-chunker: Efficient and Trained Chunking for Unstructured Documents},
187
+ author={Yannan Luo},
188
+ year={2024},
189
+ url={https://github.com/jackfsuia/bert-chunker}
190
+ }
191
+ ```
192
+ Base model is from [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).