tim1900 commited on
Commit
2776e7b
·
verified ·
1 Parent(s): 13cd299

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +265 -1
README.md CHANGED
@@ -13,7 +13,7 @@ bert-chunker-3 is a text chunker based on BertForTokenClassification to predict
13
 
14
  Different from [bc-2](https://huggingface.co/tim1900/bert-chunker-2) and [bc](https://huggingface.co/tim1900/bert-chunker), to overcome the data distribution shift, our training data were labeled by a LLM and trainng pipeline was improved, therefore it is **more stable**.
15
 
16
- Updates (2025.5.12): an experimental script that **supports specifying the maximum tokens per chunk** is [here](https://huggingface.co/tim1900/bert-chunker-3.1/blob/main/README.md)
17
 
18
  ## Usage
19
  Run the following:
@@ -196,6 +196,270 @@ Published on: 6 August 2024"
196
  # when it is set to 1, the whole text will be one chunk.
197
  chunks, token_pos = chunk_text(model, ad, tokenizer, prob_threshold=0.5)
198
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  # print chunks
200
  for i, (c, t) in enumerate(zip(chunks, token_pos)):
201
  print(f"-----chunk: {i}----token_idx: {t}--------")
 
13
 
14
  Different from [bc-2](https://huggingface.co/tim1900/bert-chunker-2) and [bc](https://huggingface.co/tim1900/bert-chunker), to overcome the data distribution shift, our training data were labeled by a LLM and trainng pipeline was improved, therefore it is **more stable**.
15
 
16
+ Updates (2025.5.12): [an experimental script](#experemental) that **supports specifying the maximum tokens per chunk** is available now.
17
 
18
  ## Usage
19
  Run the following:
 
196
  # when it is set to 1, the whole text will be one chunk.
197
  chunks, token_pos = chunk_text(model, ad, tokenizer, prob_threshold=0.5)
198
 
199
+ # print chunks
200
+ for i, (c, t) in enumerate(zip(chunks, token_pos)):
201
+ print(f"-----chunk: {i}----token_idx: {t}--------")
202
+ print(c)
203
+ ```
204
+ ## Experemental
205
+ The following script supports specifying max tokens per chunk, which can be seen as a new experimental version of the scripts above.
206
+ ```python
207
+ import torch
208
+ from transformers import AutoTokenizer, BertForTokenClassification
209
+ import math
210
+
211
+ model_path = "tim1900/bert-chunker-3"
212
+
213
+ tokenizer = AutoTokenizer.from_pretrained(
214
+ model_path,
215
+ padding_side="right",
216
+ model_max_length=255,
217
+ trust_remote_code=True,
218
+ )
219
+
220
+ device = "cpu" # or 'cuda'
221
+
222
+ model = BertForTokenClassification.from_pretrained(
223
+ model_path,
224
+ ).to(device)
225
+
226
+ def chunk_text(model, text, tokenizer, prob_threshold=0.5, max_tokens_per_chunk = None):
227
+ # slide window chunking with a prob_threshold if max_tokens_per_chunk == None.
228
+ # If max_tokens_per_chunk is not None, slide window chunking with a prob_threshold, and, sometimes forced to choose a best possible position to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold.
229
+
230
+ with torch.no_grad():
231
+
232
+ # slide context window chunking
233
+ MAX_TOKENS = 255
234
+ tokens = tokenizer(text, return_tensors="pt", truncation=False)
235
+ input_ids = tokens["input_ids"]
236
+ attention_mask = tokens["attention_mask"][:, 0:MAX_TOKENS]
237
+ attention_mask = attention_mask.to(model.device)
238
+ CLS = input_ids[:, 0].unsqueeze(0)
239
+ SEP = input_ids[:, -1].unsqueeze(0)
240
+ input_ids = input_ids[:, 1:-1]
241
+ model.eval()
242
+ split_str_poses = []
243
+ token_pos = []
244
+ windows_start = 0
245
+ windows_end = 0
246
+ logits_threshold = math.log(1 / prob_threshold - 1)
247
+
248
+ unchunk_tokens = 0
249
+ backup_pos = None
250
+ best_logits = torch.finfo(torch.float32).min
251
+ is_chunk_start = True
252
+
253
+
254
+ print(f"Processing {input_ids.shape[1]} tokens...")
255
+ while windows_end <= input_ids.shape[1]:
256
+
257
+ windows_end = windows_start + MAX_TOKENS - 2
258
+
259
+ ids = torch.cat((CLS, input_ids[:, windows_start:windows_end], SEP), 1)
260
+
261
+ ids = ids.to(model.device)
262
+
263
+ output = model(
264
+ input_ids=ids,
265
+ attention_mask=torch.ones(1, ids.shape[1], device=model.device),
266
+ )
267
+ logits = output["logits"][:, 1:-1, :]
268
+
269
+
270
+ logit_diff = logits[:, :, 1] - logits[:, :, 0]
271
+
272
+
273
+ chunk_decision = logit_diff > - logits_threshold
274
+ greater_rows_indices = torch.where(chunk_decision)[1].tolist()
275
+
276
+ # null or not
277
+ if len(greater_rows_indices) > 0 and (
278
+ not (greater_rows_indices[0] == 0 and len(greater_rows_indices) == 1)
279
+ ):
280
+
281
+
282
+ unchunk_tokens_this_window = greater_rows_indices[0] if greater_rows_indices[0]!=0 else greater_rows_indices[1]#exclude the fist index
283
+ # manually chunk
284
+ if max_tokens_per_chunk is not None and unchunk_tokens + unchunk_tokens_this_window > max_tokens_per_chunk:
285
+ big_windows_end = max_tokens_per_chunk - unchunk_tokens
286
+ if is_chunk_start:
287
+ max_value, max_index= logit_diff[:,1:big_windows_end].max(), logit_diff[:,1:big_windows_end].argmax() + 1
288
+ else:
289
+ max_value, max_index= logit_diff[:,:big_windows_end].max(), logit_diff[:,:big_windows_end].argmax()
290
+ if best_logits < max_value:
291
+ backup_pos = windows_start + max_index
292
+
293
+ windows_start = backup_pos
294
+
295
+
296
+ split_str_pos = [tokens.token_to_chars(backup_pos + 1).start]
297
+ split_str_poses += split_str_pos
298
+ token_pos += [backup_pos + 1]
299
+ best_logits = torch.finfo(torch.float32).min
300
+ backup_pos = -1
301
+ unchunk_tokens = 0
302
+ is_chunk_start = True
303
+
304
+ # auto chunk
305
+ else:
306
+ split_str_pos = [tokens.token_to_chars(sp + windows_start + 1).start for sp in greater_rows_indices if sp > 0]
307
+ split_str_poses += split_str_pos
308
+ token_pos += [sp + windows_start + 1 for sp in greater_rows_indices if sp > 0]
309
+
310
+ windows_start = greater_rows_indices[-1] + windows_start
311
+ is_chunk_start = True
312
+
313
+ else:
314
+
315
+ unchunk_tokens_this_window = (windows_end - windows_start)
316
+ # manually chunk
317
+ if max_tokens_per_chunk is not None and unchunk_tokens + unchunk_tokens_this_window > max_tokens_per_chunk:
318
+ big_windows_end = max_tokens_per_chunk - unchunk_tokens
319
+ if is_chunk_start:
320
+ max_value, max_index= logit_diff[:,1:big_windows_end].max(), logit_diff[:,1:big_windows_end].argmax() + 1
321
+ else:
322
+ max_value, max_index= logit_diff[:,:big_windows_end].max(), logit_diff[:,:big_windows_end].argmax()
323
+ if best_logits < max_value:
324
+ backup_pos = windows_start + max_index
325
+
326
+
327
+ windows_start = backup_pos
328
+ split_str_pos = [tokens.token_to_chars(backup_pos + 1).start]
329
+ split_str_poses += split_str_pos
330
+ token_pos += [backup_pos + 1]
331
+ best_logits = torch.finfo(torch.float32).min
332
+ backup_pos = -1
333
+ unchunk_tokens = 0
334
+ is_chunk_start = True
335
+ else:
336
+ # auto leave
337
+ if max_tokens_per_chunk is not None:
338
+ if is_chunk_start:
339
+ # is chunk start, need to rule out first position
340
+ max_value, max_index= logit_diff[:,1:].max(), logit_diff[:,1:].argmax() + 1
341
+
342
+ else:
343
+ max_value, max_index= logit_diff[:,:].max(), logit_diff[:,:].argmax()
344
+ if best_logits < max_value:
345
+ best_logits = max_value
346
+ backup_pos = windows_start + max_index
347
+
348
+ unchunk_tokens += MAX_TOKENS - 2
349
+ windows_start = windows_end
350
+ is_chunk_start = False
351
+
352
+
353
+ substrings = [
354
+ text[i:j] for i, j in zip([0] + split_str_poses, split_str_poses + [len(text)])
355
+ ]
356
+ token_pos = [0] + token_pos
357
+ return substrings, token_pos
358
+
359
+
360
+
361
+ # chunking code docs
362
+ print("\n>>>>>>>>> Chunking code docs...")
363
+ doc = r"""
364
+ Of course, as our first example shows, it is not always _necessary_ to declare an expression holder before it is created or used. But doing so provides an extra measure of clarity to models, so we strongly recommend it.
365
+
366
+ ## Chapter 4 The Basics
367
+
368
+ ## Chapter 5 The DCP Ruleset
369
+
370
+ ### 5.1 A taxonomy of curvature
371
+
372
+ In disciplined convex programming, a scalar expression is classified by its _curvature_. There are four categories of curvature: _constant_, _affine_, _convex_, and _concave_. For a function \(f:\mathbf{R}^{n}\rightarrow\mathbf{R}\) defined on all \(\mathbf{R}^{n}\)the categories have the following meanings:
373
+
374
+ \[\begin{array}{llll}\text{constant}&f(\alpha x+(1-\alpha)y)=f(x)&\forall x,y\in \mathbf{R}^{n},\;\alpha\in\mathbf{R}\\ \text{affine}&f(\alpha x+(1-\alpha)y)=\alpha f(x)+(1-\alpha)f(y)&\forall x,y\in \mathbf{R}^{n},\;\alpha\in\mathbf{R}\\ \text{convex}&f(\alpha x+(1-\alpha)y)\leq\alpha f(x)+(1-\alpha)f(y)&\forall x,y \in\mathbf{R}^{n},\;\alpha\in[0,1]\\ \text{concave}&f(\alpha x+(1-\alpha)y)\geq\alpha f(x)+(1-\alpha)f(y)&\forall x,y \in\mathbf{R}^{n},\;\alpha\in[0,1]\end{array}\]
375
+
376
+ Of course, there is significant overlap in these categories. For example, constant expressions are also affine, and (real) affine expressions are both convex and concave.
377
+
378
+ Convex and concave expressions are real by definition. Complex constant and affine expressions can be constructed, but their usage is more limited; for example, they cannot appear as the left- or right-hand side of an inequality constraint.
379
+
380
+ ### Top-level rules
381
+
382
+ CVX supports three different types of disciplined convex programs:
383
+
384
+ * A _minimization problem_, consisting of a convex objective function and zero or more constraints.
385
+ * A _maximization problem_, consisting of a concave objective function and zero or more constraints.
386
+ * A _feasibility problem_, consisting of one or more constraints and no objective.
387
+
388
+ ### Constraints
389
+
390
+ Three types of constraints may be specified in disciplined convex programs:
391
+
392
+ * An _equality constraint_, constructed using \(==\), where both sides are affine.
393
+ * A _less-than inequality constraint_, using \(<=\), where the left side is convex and the right side is concave.
394
+ * A _greater-than inequality constraint_, using \(>=\), where the left side is concave and the right side is convex.
395
+
396
+ _Non_-equality constraints, constructed using \(\sim=\), are never allowed. (Such constraints are not convex.)
397
+
398
+ One or both sides of an equality constraint may be complex; inequality constraints, on the other hand, must be real. A complex equality constraint is equivalent to two real equality constraints, one for the real part and one for the imaginary part. An equality constraint with a real side and a complex side has the effect of constraining the imaginary part of the complex side to be zero."""
399
+ # Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
400
+ # Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
401
+ # when it is set to 1, the whole text will be one chunk.
402
+ # slide window chunking with a prob_threshold if max_tokens_per_chunk == None.
403
+ # If max_tokens_per_chunk is not None, slide window chunking with a prob_threshold, and, sometimes forced to choose a best possible position to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold.
404
+ chunks, token_pos = chunk_text(model, doc, tokenizer, prob_threshold=0.5, max_tokens_per_chunk=None)
405
+
406
+ # print chunks
407
+ for i, (c, t) in enumerate(zip(chunks, token_pos)):
408
+ print(f"-----chunk: {i}----token_idx: {t}--------")
409
+ print(c)
410
+
411
+
412
+ # chunking ads
413
+ print("\n>>>>>>>>> Chunking ads...")
414
+
415
+ ad = r"""The causes and effects of dropouts in vocational and professional education are more pressing than ever. A decreasing attractiveness of vocational education, particularly in payment and quality, causes higher dropout rates while hitting ongoing demographic changes resulting in extensive skill shortages for many regions. Therefore, tackling the internationally high dropout rates is of utmost political and scientific interest. This thematic issue contributes to the conceptualization, analysis, and prevention of vocational and professional dropouts by bringing together current research that progresses to a deeper processual understanding and empirical modelling of dropouts. It aims to expand our understanding of how dropout and decision processes leading to dropout can be conceptualized and measured in vocational and professional contexts. Another aim is to gather empirical studies on both predictors and dropout consequences. Based on this knowledge, the thematic issue intends to provide evidence of effective interventions to avoid dropouts and identify promising ways for future dropout research in professional and vocational education to support evidence-based vocational education policy.
416
+
417
+ We thus welcome research contributions (original empirical and conceptual/measurement-related articles, literature reviews, meta-analyses) on dropouts (e.g., premature terminations, intentions to terminate, vertical and horizontal dropouts) that are situated in vocational and professional education at workplaces, schools, or other tertiary professional education institutions.
418
+
419
+
420
+ Part 1 of the thematic series outlines central theories and measurement concepts for vocational and professional dropouts. Part 2 outlines measurement approaches for dropout. Part 3 investigates relevant predictors of dropout. Part 4 analyzes the effects of dropout on an individual, organizational, and systemic level. Part 5 deals with programs and interventions for the prevention of dropouts.
421
+
422
+ We welcome papers that include but are not limited to:
423
+
424
+ Theoretical papers on the concept and processes of vocational and professional dropout or retention
425
+ Measurement approaches to assess dropout or retention
426
+ Quantitative and qualitative papers on the causes of dropout or retention
427
+ Quantitative and qualitative papers on the effects of dropout or retention on learners, providers/organizations and the (educational) system
428
+ Design-based research and experimental papers on dropout prevention programs or retention
429
+ Submission instructions
430
+ Before submitting your manuscript, please ensure you have carefully read the Instructions for Authors for Empirical Research in Vocational Education and Training. The complete manuscript should be submitted through the Empirical Research in Vocational Education and Training submission system. To ensure that you submit to the correct thematic series please select the appropriate section in the drop-down menu upon submission. In addition, indicate within your cover letter that you wish your manuscript to be considered as part of the thematic series on series title. All submissions will undergo rigorous peer review, and accepted articles will be published within the journal as a collection.
431
+
432
+ Lead Guest Editor:
433
+ Prof. Dr. Viola Deutscher, University of Mannheim
434
+ viola.deutscher@uni-mannheim.de
435
+
436
+ Guest Editors:
437
+ Prof. Dr. Stefanie Findeisen, University of Konstanz
438
+ stefanie.findeisen@uni-konstanz.de
439
+
440
+ Prof. Dr. Christian Michaelis, Georg-August-University of Göttingen
441
+ christian.michaelis@wiwi.uni-goettingen.de
442
+
443
+ Deadline for submission
444
+ This Call for Papers is open from now until 29 February 2023. Submitted papers will be reviewed in a timely manner and published directly after acceptance (i.e., without waiting for the accomplishment of all other contributions). Thanks to the Empirical Research in Vocational Education and Training (ERVET) open access policy, the articles published in this thematic issue will have a wide, global audience.
445
+
446
+ Option of submitting abstracts: Interested authors should submit a letter of intent including a working title for the manuscript, names, affiliations, and contact information for all authors, and an abstract of no more than 500 words to the lead guest editor Viola Deutscher (viola.deutscher@uni-mannheim.de) by July, 31st 2023. Due to technical issues, we also ask authors who already submitted an abstract before May, 30th to send their abstracts again to the address stated above. However, abstract submission is optional and is not mandatory for the full paper submission.
447
+
448
+ Different dropout directions in vocational education and training: the role of the initiating party and trainees’ reasons for dropping out
449
+ The high rates of premature contract termination (PCT) in vocational education and training (VET) programs have led to an increasing number of studies examining the reasons why adolescents drop out. Since adol...
450
+
451
+ Authors:Christian Michaelis and Stefanie Findeisen
452
+ Citation:Empirical Research in Vocational Education and Training 2024 16:15
453
+ Content type:Research
454
+ Published on: 6 August 2024"
455
+ """
456
+ # Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
457
+ # Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
458
+ # when it is set to 1, the whole text will be one chunk.
459
+ # slide window chunking with a prob_threshold if max_tokens_per_chunk == None.
460
+ # If max_tokens_per_chunk is not None, slide window chunking with a prob_threshold, and, sometimes forced to choose a best possible position to chunk when it is about to exceed the max_tokens_per_chunk and no token satisfy the prob_threshold.
461
+ chunks, token_pos = chunk_text(model, ad, tokenizer, prob_threshold=0.5, max_tokens_per_chunk = 400)
462
+
463
  # print chunks
464
  for i, (c, t) in enumerate(zip(chunks, token_pos)):
465
  print(f"-----chunk: {i}----token_idx: {t}--------")