bhavnicksm commited on
Commit
50dc7aa
·
verified ·
1 Parent(s): 4878fe0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +64 -32
README.md CHANGED
@@ -1,61 +1,93 @@
1
  ---
 
 
 
2
  tags:
 
 
3
  - tokie
4
- - model2vec
5
- library_name: tokie
 
 
 
 
6
  ---
7
 
8
  <p align="center">
9
  <img src="tokie-banner.png" alt="tokie" width="600">
10
  </p>
11
 
12
- # potion-8m-edu-classifier
13
 
14
- Pre-built [tokie](https://github.com/chonkie-inc/tokie) tokenizer for [potion-8m-edu-classifier](https://huggingface.co/minishlab/potion-8m-edu-classifier).
 
 
 
 
 
15
 
16
- ## Quick Start (Python)
17
 
18
- ```bash
19
- pip install tokie
 
 
 
 
 
 
 
 
 
 
 
20
  ```
21
 
22
- ```python
23
- import tokie
24
 
25
- tokenizer = tokie.Tokenizer.from_pretrained("tokiers/potion-8m-edu-classifier")
26
- encoding = tokenizer.encode("Hello, world!")
27
- print(encoding.ids)
28
- print(encoding.attention_mask)
29
  ```
 
30
 
31
- ## Quick Start (Rust)
 
32
 
33
- ```toml
34
- [dependencies]
35
- tokie = { version = "0.0.7", features = ["hf"] }
36
  ```
37
 
38
- ```rust
39
- use tokie::Tokenizer;
40
 
41
- let tokenizer = Tokenizer::from_pretrained("tokiers/potion-8m-edu-classifier").unwrap();
42
- let encoding = tokenizer.encode("Hello, world!", true);
43
- println!("{:?}", encoding.ids);
44
  ```
45
 
46
- ## Files
 
 
 
47
 
48
- - `tokenizer.tkz` tokie binary format (~10x smaller, loads in ~5ms)
49
- - `tokenizer.json` — original HuggingFace tokenizer
50
- - `model.safetensors` — original model weights
51
- - All other files from [potion-8m-edu-classifier](https://huggingface.co/minishlab/potion-8m-edu-classifier)
52
 
53
- ## About tokie
 
 
54
 
55
- **50x faster tokenization, 10x smaller model files, 100% accurate.**
56
 
57
- tokie is a drop-in replacement for HuggingFace tokenizers, built in Rust. See [GitHub](https://github.com/chonkie-inc/tokie) for benchmarks and documentation.
58
 
59
- ## License
60
 
61
- MIT OR Apache-2.0 (tokie library). Original model files retain their original license from [potion-8m-edu-classifier](https://huggingface.co/minishlab/potion-8m-edu-classifier).
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: model2vec
3
+ license: mit
4
+ model_name: tmpqsu1ee6a
5
  tags:
6
+ - embeddings
7
+ - static-embeddings
8
  - tokie
9
+ datasets:
10
+ - HuggingFaceFW/fineweb-edu-llama3-annotations
11
+ language:
12
+ - en
13
+ base_model:
14
+ - minishlab/potion-base-8M
15
  ---
16
 
17
  <p align="center">
18
  <img src="tokie-banner.png" alt="tokie" width="600">
19
  </p>
20
 
21
+ > Pre-built [tokie](https://github.com/chonkie-inc/tokie) tokenizer included (`tokenizer.tkz`). 5x faster tokenization, drop-in replacement for HuggingFace tokenizers.
22
 
23
+ ---
24
+
25
+ # potion-8m-edu-classifier Model Card
26
+
27
+ This [Model2Vec](https://github.com/MinishLab/model2vec) model is a fine-tuned version of [potion-base-8m](https://huggingface.co/minishlab/potion-base-8M).
28
+ It was trained to predict educational content, analogous to how the [fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) was used to filter educational content.
29
 
30
+ It achieves the following performance on the evaluation split:
31
 
32
+ ```
33
+ precision recall f1-score support
34
+
35
+ 0 0.70 0.42 0.52 5694
36
+ 1 0.75 0.86 0.80 26512
37
+ 2 0.55 0.51 0.53 10322
38
+ 3 0.54 0.45 0.49 3407
39
+ 4 0.59 0.30 0.40 807
40
+ 5 0.00 0.00 0.00 1
41
+
42
+ accuracy 0.69 46743
43
+ macro avg 0.52 0.42 0.46 46743
44
+ weighted avg 0.68 0.69 0.68 46743
45
  ```
46
 
47
+ When thresholded to a binary classifier, it achieves a macro-averaged F1-score of `0.79`. The original classifier achieves `0.81` on the same dataset, but this classifier is orders of magnitude faster on CPU.
 
48
 
 
 
 
 
49
  ```
50
+ precision recall f1-score support
51
 
52
+ not edu 0.96 0.98 0.97 42528
53
+ edu 0.70 0.54 0.61 4215
54
 
55
+ accuracy 0.94 46743
56
+ macro avg 0.83 0.76 0.79 46743
57
+ weighted avg 0.93 0.94 0.93 46743
58
  ```
59
 
60
+ ## Installation
 
61
 
62
+ Install model2vec with the inference extra using pip:
63
+ ```
64
+ pip install model2vec[inference]
65
  ```
66
 
67
+ ## Usage
68
+ Load this model using the `from_pretrained` method:
69
+ ```python
70
+ from model2vec.inference import StaticModelPipeline
71
 
72
+ # Load a pretrained Model2Vec model
73
+ model = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")
 
 
74
 
75
+ # Predict labels
76
+ label = model.predict(["Example sentence"])
77
+ ```
78
 
79
+ ## Library Authors
80
 
81
+ Model2Vec was developed by [Minish](https://github.com/MinishLab).
82
 
83
+ ## Citation
84
 
85
+ Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) if you use this model in your work.
86
+ ```
87
+ @software{minishlab2024model2vec,
88
+ authors = {Stephan Tulkens, Thomas van Dongen},
89
+ title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
90
+ year = {2024},
91
+ url = {https://github.com/MinishLab/model2vec},
92
+ }
93
+ ```