tokiers
/

potion-8m-edu-classifier

@@ -1,61 +1,93 @@
 ---
 tags:
 - tokie
-- model2vec
-library_name: tokie
 ---
 <p align="center">
   <img src="tokie-banner.png" alt="tokie" width="600">
 </p>
-# potion-8m-edu-classifier
-Pre-built [tokie](https://github.com/chonkie-inc/tokie) tokenizer for [potion-8m-edu-classifier](https://huggingface.co/minishlab/potion-8m-edu-classifier).
-## Quick Start (Python)
-```bash
-pip install tokie
 ```
-```python
-import tokie
-tokenizer = tokie.Tokenizer.from_pretrained("tokiers/potion-8m-edu-classifier")
-encoding = tokenizer.encode("Hello, world!")
-print(encoding.ids)
-print(encoding.attention_mask)
 ```
-## Quick Start (Rust)
-```toml
-[dependencies]
-tokie = { version = "0.0.7", features = ["hf"] }
 ```
-```rust
-use tokie::Tokenizer;
-let tokenizer = Tokenizer::from_pretrained("tokiers/potion-8m-edu-classifier").unwrap();
-let encoding = tokenizer.encode("Hello, world!", true);
-println!("{:?}", encoding.ids);
 ```
-## Files
-- `tokenizer.tkz` — tokie binary format (~10x smaller, loads in ~5ms)
-- `tokenizer.json` — original HuggingFace tokenizer
-- `model.safetensors` — original model weights
-- All other files from [potion-8m-edu-classifier](https://huggingface.co/minishlab/potion-8m-edu-classifier)
-## About tokie
-**50x faster tokenization, 10x smaller model files, 100% accurate.**
-tokie is a drop-in replacement for HuggingFace tokenizers, built in Rust. See [GitHub](https://github.com/chonkie-inc/tokie) for benchmarks and documentation.
-## License
-MIT OR Apache-2.0 (tokie library). Original model files retain their original license from [potion-8m-edu-classifier](https://huggingface.co/minishlab/potion-8m-edu-classifier).

 ---
+library_name: model2vec
+license: mit
+model_name: tmpqsu1ee6a
 tags:
+- embeddings
+- static-embeddings
 - tokie
+datasets:
+- HuggingFaceFW/fineweb-edu-llama3-annotations
+language:
+- en
+base_model:
+- minishlab/potion-base-8M
 ---
 <p align="center">
   <img src="tokie-banner.png" alt="tokie" width="600">
 </p>
+> Pre-built [tokie](https://github.com/chonkie-inc/tokie) tokenizer included (`tokenizer.tkz`). 5x faster tokenization, drop-in replacement for HuggingFace tokenizers.
+---
+# potion-8m-edu-classifier Model Card
+This [Model2Vec](https://github.com/MinishLab/model2vec) model is a fine-tuned version of [potion-base-8m](https://huggingface.co/minishlab/potion-base-8M).
+It was trained to predict educational content, analogous to how the [fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) was used to filter educational content.
+It achieves the following performance on the evaluation split:
+```
+              precision    recall  f1-score   support
+           0       0.70      0.42      0.52      5694
+           1       0.75      0.86      0.80     26512
+           2       0.55      0.51      0.53     10322
+           3       0.54      0.45      0.49      3407
+           4       0.59      0.30      0.40       807
+           5       0.00      0.00      0.00         1
+    accuracy                           0.69     46743
+   macro avg       0.52      0.42      0.46     46743
+weighted avg       0.68      0.69      0.68     46743
 ```
+When thresholded to a binary classifier, it achieves a macro-averaged F1-score of `0.79`. The original classifier achieves `0.81` on the same dataset, but this classifier is orders of magnitude faster on CPU.
 ```
+              precision    recall  f1-score   support
+     not edu       0.96      0.98      0.97     42528
+         edu       0.70      0.54      0.61      4215
+    accuracy                           0.94     46743
+   macro avg       0.83      0.76      0.79     46743
+weighted avg       0.93      0.94      0.93     46743
 ```
+## Installation
+Install model2vec with the inference extra using pip:
+```
+pip install model2vec[inference]
 ```
+## Usage
+Load this model using the `from_pretrained` method:
+```python
+from model2vec.inference import StaticModelPipeline
+# Load a pretrained Model2Vec model
+model = StaticModelPipeline.from_pretrained("minishlab/potion-8m-edu-classifier")
+# Predict labels
+label = model.predict(["Example sentence"])
+```
+## Library Authors
+Model2Vec was developed by [Minish](https://github.com/MinishLab).
+## Citation
+Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) if you use this model in your work.
+```
+@software{minishlab2024model2vec,
+  authors = {Stephan Tulkens, Thomas van Dongen},
+  title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
+  year = {2024},
+  url = {https://github.com/MinishLab/model2vec},
+}
+```