Instructions to use kenhktsui/llm-data-textbook-quality-classifier-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kenhktsui/llm-data-textbook-quality-classifier-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="kenhktsui/llm-data-textbook-quality-classifier-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("kenhktsui/llm-data-textbook-quality-classifier-v1") model = AutoModelForSequenceClassification.from_pretrained("kenhktsui/llm-data-textbook-quality-classifier-v1") - Notebooks
- Google Colab
- Kaggle
2024-05-19: v2 is released -> llm-data-textbook-quality-fasttext-classifier-v2
A more optimized model is released -> kenhktsui/llm-data-textbook-quality-fasttext-classifier-v1
llm-data-textbook-quality-classifier-v1
This model can classify if a text is of textbook quality data. It can be used as a filter for data curation when training a LLM. Please note textbook quality is a subset of high quality.
Benchmark
| Dataset | Sampling | Average Quality Score |
|---|---|---|
| nampdn-ai/tiny-textbooks | First 10,000 | 0.8618 |
| nampdn-ai/tiny-orca-textbooks | First 10,000 | 0.8544 |
| SciPhi/textbooks-are-all-you-need-lite | First 10,000 | 0.8109 |
| vikp/textbook_quality_programming | First 10,000 | 0.6883 |
| BEE-spoke-data/fineweb-100k_en-med | Full | 0.5516 |
| pszemraj/simple_wikipedia_LM | Full | 0.5386 |
| mattymchen/refinedweb-3m | Full | 0.2951 |
| JeanKaddour/minipile | Full | 0.2618 |
The classifier aligns with the expectation. Textbook category scores the highest, reflecting the effectiveness of this model. Wikipedia scores lower because it is not textbook after all. Web scores the lowest.
This model is a fine-tuned version of FacebookAI/xlm-roberta-base on an unknown dataset. It achieves the following results on the evaluation set:
- Loss: 0.2689
- Accuracy: 0.8833
- Precision: 0.7551
- Recall: 0.7598
- F1: 0.7574
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
Training results
| Training Loss | Epoch | Step | Accuracy | F1 | Validation Loss | Precision | Recall |
|---|---|---|---|---|---|---|---|
| 0.4745 | 0.01 | 500 | 0.8076 | 0.6181 | 0.4327 | 0.5898 | 0.6493 |
| 0.4088 | 0.02 | 1000 | 0.8346 | 0.5522 | 0.4287 | 0.7870 | 0.4254 |
| 0.3811 | 0.02 | 1500 | 0.8286 | 0.6651 | 0.3741 | 0.6257 | 0.7098 |
| 0.3762 | 0.03 | 2000 | 0.85 | 0.6529 | 0.3413 | 0.7334 | 0.5884 |
| 0.3647 | 0.04 | 2500 | 0.8427 | 0.6632 | 0.3852 | 0.6815 | 0.6460 |
| 0.3495 | 0.05 | 3000 | 0.8629 | 0.6987 | 0.3253 | 0.7385 | 0.6631 |
| 0.3508 | 0.06 | 3500 | 0.8335 | 0.6967 | 0.3605 | 0.6186 | 0.7973 |
| 0.3342 | 0.06 | 4000 | 0.8553 | 0.7075 | 0.3273 | 0.6865 | 0.7298 |
| 0.341 | 0.07 | 4500 | 0.8602 | 0.6679 | 0.3320 | 0.7759 | 0.5863 |
| 0.3344 | 0.08 | 5000 | 0.8531 | 0.6916 | 0.3441 | 0.6964 | 0.6868 |
| 0.3341 | 0.09 | 5500 | 0.8536 | 0.7027 | 0.3265 | 0.6849 | 0.7214 |
| 0.3319 | 0.1 | 6000 | 0.8599 | 0.7081 | 0.3266 | 0.7076 | 0.7085 |
| 0.3259 | 0.1 | 6500 | 0.8136 | 0.6907 | 0.3908 | 0.5736 | 0.8678 |
| 0.3391 | 0.11 | 7000 | 0.8642 | 0.6770 | 0.3338 | 0.7879 | 0.5934 |
| 0.3207 | 0.12 | 7500 | 0.8668 | 0.7224 | 0.3035 | 0.7221 | 0.7227 |
| 0.3191 | 0.13 | 8000 | 0.8543 | 0.7153 | 0.3179 | 0.6730 | 0.7631 |
| 0.3142 | 0.14 | 8500 | 0.8679 | 0.7052 | 0.3101 | 0.7585 | 0.6589 |
| 0.3195 | 0.14 | 9000 | 0.8636 | 0.7254 | 0.3433 | 0.7012 | 0.7515 |
| 0.3196 | 0.15 | 9500 | 0.8707 | 0.7191 | 0.3048 | 0.7506 | 0.6902 |
| 0.3176 | 0.16 | 10000 | 0.8597 | 0.7271 | 0.3177 | 0.6814 | 0.7794 |
| 0.3218 | 0.17 | 10500 | 0.8723 | 0.6993 | 0.3212 | 0.8031 | 0.6193 |
| 0.3175 | 0.18 | 11000 | 0.8601 | 0.7239 | 0.3366 | 0.6871 | 0.7648 |
| 0.3296 | 0.18 | 11500 | 0.8526 | 0.7190 | 0.3218 | 0.6622 | 0.7865 |
| 0.3249 | 0.19 | 12000 | 0.8731 | 0.7081 | 0.2926 | 0.7896 | 0.6418 |
| 0.3141 | 0.2 | 12500 | 0.8741 | 0.7215 | 0.3035 | 0.7683 | 0.6802 |
| 0.3126 | 0.21 | 13000 | 0.8659 | 0.7231 | 0.3127 | 0.7162 | 0.7302 |
| 0.3204 | 0.22 | 13500 | 0.8665 | 0.7233 | 0.3456 | 0.7190 | 0.7277 |
| 0.3108 | 0.22 | 14000 | 0.8674 | 0.7214 | 0.3018 | 0.7269 | 0.7160 |
| 0.3114 | 0.23 | 14500 | 0.8726 | 0.7016 | 0.2967 | 0.8002 | 0.6247 |
| 0.3071 | 0.24 | 15000 | 0.8768 | 0.7211 | 0.2904 | 0.7886 | 0.6643 |
| 0.2965 | 0.25 | 15500 | 0.8674 | 0.7310 | 0.3126 | 0.7117 | 0.7515 |
| 0.3022 | 0.26 | 16000 | 0.8738 | 0.7077 | 0.2887 | 0.7958 | 0.6372 |
| 0.3101 | 0.26 | 16500 | 0.8559 | 0.7251 | 0.3312 | 0.6683 | 0.7923 |
| 0.3154 | 0.27 | 17000 | 0.8575 | 0.7304 | 0.3221 | 0.6685 | 0.8048 |
| 0.3041 | 0.28 | 17500 | 0.8754 | 0.7248 | 0.2864 | 0.7704 | 0.6843 |
| 0.3093 | 0.29 | 18000 | 0.8603 | 0.7292 | 0.3101 | 0.6813 | 0.7844 |
| 0.3006 | 0.3 | 18500 | 0.8753 | 0.7111 | 0.3008 | 0.7999 | 0.6401 |
| 0.3108 | 0.3 | 19000 | 0.8689 | 0.7316 | 0.2911 | 0.7185 | 0.7452 |
| 0.3071 | 0.31 | 19500 | 0.8793 | 0.7366 | 0.2839 | 0.7725 | 0.7039 |
| 0.3002 | 0.32 | 20000 | 0.852 | 0.7239 | 0.3391 | 0.6550 | 0.8090 |
| 0.301 | 0.33 | 20500 | 0.8769 | 0.7396 | 0.2896 | 0.7505 | 0.7289 |
| 0.3075 | 0.34 | 21000 | 0.8785 | 0.7402 | 0.2891 | 0.7595 | 0.7219 |
| 0.2922 | 0.34 | 21500 | 0.8393 | 0.7164 | 0.4094 | 0.6210 | 0.8465 |
| 0.2973 | 0.35 | 22000 | 0.8787 | 0.7416 | 0.2962 | 0.7579 | 0.7260 |
| 0.2987 | 0.36 | 22500 | 0.8711 | 0.7430 | 0.2983 | 0.7119 | 0.7769 |
| 0.3071 | 0.37 | 23000 | 0.8739 | 0.7407 | 0.3167 | 0.7306 | 0.7510 |
| 0.2846 | 0.38 | 23500 | 0.8801 | 0.7401 | 0.2901 | 0.7707 | 0.7118 |
| 0.2924 | 0.38 | 24000 | 0.863 | 0.7299 | 0.3155 | 0.6922 | 0.7719 |
| 0.2938 | 0.39 | 24500 | 0.8724 | 0.7368 | 0.2973 | 0.7290 | 0.7448 |
| 0.2917 | 0.4 | 25000 | 0.8772 | 0.7436 | 0.2939 | 0.7446 | 0.7427 |
| 0.294 | 0.41 | 25500 | 0.8772 | 0.7394 | 0.2944 | 0.7528 | 0.7264 |
| 0.2979 | 0.42 | 26000 | 0.8774 | 0.7421 | 0.2819 | 0.7487 | 0.7356 |
| 0.2884 | 0.42 | 26500 | 0.873 | 0.7394 | 0.2932 | 0.7278 | 0.7515 |
| 0.2992 | 0.43 | 27000 | 0.8655 | 0.7419 | 0.3053 | 0.6872 | 0.8061 |
| 0.3018 | 0.44 | 27500 | 0.8788 | 0.7296 | 0.2781 | 0.7845 | 0.6818 |
| 0.305 | 0.45 | 28000 | 0.8785 | 0.7408 | 0.2760 | 0.7584 | 0.7239 |
| 0.2918 | 0.46 | 28500 | 0.8788 | 0.7381 | 0.2826 | 0.7659 | 0.7123 |
| 0.2998 | 0.46 | 29000 | 0.874 | 0.7403 | 0.2893 | 0.7319 | 0.7490 |
| 0.2875 | 0.47 | 29500 | 0.8803 | 0.7422 | 0.2891 | 0.7675 | 0.7185 |
| 0.2946 | 0.48 | 30000 | 0.2781 | 0.8798 | 0.7415 | 0.7656 | 0.7534 |
| 0.2907 | 0.49 | 30500 | 0.2860 | 0.8752 | 0.7280 | 0.7656 | 0.7463 |
| 0.2981 | 0.5 | 31000 | 0.3012 | 0.8732 | 0.7276 | 0.7531 | 0.7402 |
| 0.2948 | 0.5 | 31500 | 0.2777 | 0.8792 | 0.7894 | 0.6768 | 0.7288 |
| 0.2933 | 0.51 | 32000 | 0.2839 | 0.8773 | 0.7428 | 0.7469 | 0.7449 |
| 0.2891 | 0.52 | 32500 | 0.2774 | 0.8795 | 0.7678 | 0.7131 | 0.7395 |
| 0.2869 | 0.53 | 33000 | 0.2790 | 0.8764 | 0.7405 | 0.7460 | 0.7432 |
| 0.2907 | 0.54 | 33500 | 0.2889 | 0.8764 | 0.7580 | 0.7118 | 0.7342 |
| 0.2912 | 0.54 | 34000 | 0.2887 | 0.8807 | 0.7464 | 0.7611 | 0.7537 |
| 0.283 | 0.55 | 34500 | 0.2754 | 0.8816 | 0.7847 | 0.6977 | 0.7386 |
| 0.2877 | 0.56 | 35000 | 0.3036 | 0.8727 | 0.7221 | 0.7627 | 0.7418 |
| 0.2923 | 0.57 | 35500 | 0.2853 | 0.8783 | 0.7693 | 0.7035 | 0.7349 |
| 0.2902 | 0.58 | 36000 | 0.2881 | 0.8772 | 0.7462 | 0.7394 | 0.7428 |
| 0.2863 | 0.58 | 36500 | 0.2886 | 0.8768 | 0.7303 | 0.7711 | 0.7501 |
| 0.2837 | 0.59 | 37000 | 0.2753 | 0.8801 | 0.7503 | 0.7494 | 0.7498 |
| 0.3021 | 0.6 | 37500 | 0.2848 | 0.8775 | 0.7330 | 0.7694 | 0.7508 |
| 0.291 | 0.61 | 38000 | 0.2793 | 0.88 | 0.7423 | 0.7652 | 0.7536 |
| 0.2821 | 0.62 | 38500 | 0.2867 | 0.88 | 0.7429 | 0.7640 | 0.7533 |
| 0.2867 | 0.62 | 39000 | 0.2851 | 0.8796 | 0.7367 | 0.7748 | 0.7553 |
| 0.2846 | 0.63 | 39500 | 0.2813 | 0.8828 | 0.7661 | 0.7360 | 0.7507 |
| 0.2836 | 0.64 | 40000 | 0.2842 | 0.8793 | 0.7406 | 0.7644 | 0.7523 |
| 0.2835 | 0.65 | 40500 | 0.2797 | 0.8792 | 0.7382 | 0.7690 | 0.7533 |
| 0.2833 | 0.66 | 41000 | 0.2763 | 0.8821 | 0.7895 | 0.6931 | 0.7382 |
| 0.2743 | 0.66 | 41500 | 0.2852 | 0.8833 | 0.7717 | 0.7289 | 0.7497 |
| 0.2921 | 0.67 | 42000 | 0.2780 | 0.8791 | 0.7561 | 0.7319 | 0.7438 |
| 0.279 | 0.68 | 42500 | 0.2759 | 0.8827 | 0.7882 | 0.6985 | 0.7407 |
| 0.2752 | 0.69 | 43000 | 0.2795 | 0.8796 | 0.7642 | 0.7202 | 0.7415 |
| 0.2902 | 0.7 | 43500 | 0.2735 | 0.8809 | 0.7824 | 0.6972 | 0.7374 |
| 0.2832 | 0.7 | 44000 | 0.2742 | 0.8815 | 0.7690 | 0.7231 | 0.7453 |
| 0.2783 | 0.71 | 44500 | 0.2773 | 0.8815 | 0.7692 | 0.7227 | 0.7452 |
| 0.2879 | 0.72 | 45000 | 0.2716 | 0.8838 | 0.7766 | 0.7235 | 0.7491 |
| 0.2898 | 0.73 | 45500 | 0.2728 | 0.8804 | 0.7513 | 0.7494 | 0.7503 |
| 0.2771 | 0.74 | 46000 | 0.2795 | 0.877 | 0.7370 | 0.7573 | 0.7470 |
| 0.2743 | 0.74 | 46500 | 0.2833 | 0.8707 | 0.7013 | 0.8028 | 0.7486 |
| 0.2868 | 0.75 | 47000 | 0.2719 | 0.8821 | 0.7575 | 0.7477 | 0.7526 |
| 0.2771 | 0.76 | 47500 | 0.2784 | 0.8833 | 0.7636 | 0.7435 | 0.7534 |
| 0.2824 | 0.77 | 48000 | 0.2778 | 0.8772 | 0.7291 | 0.7765 | 0.7520 |
| 0.2819 | 0.78 | 48500 | 0.2772 | 0.8825 | 0.7532 | 0.7585 | 0.7559 |
| 0.2781 | 0.78 | 49000 | 0.2747 | 0.881 | 0.7502 | 0.7552 | 0.7527 |
| 0.2844 | 0.79 | 49500 | 0.2877 | 0.8762 | 0.7215 | 0.7877 | 0.7532 |
| 0.2732 | 0.8 | 50000 | 0.2738 | 0.8809 | 0.7511 | 0.7527 | 0.7519 |
| 0.2681 | 0.81 | 50500 | 0.2832 | 0.8761 | 0.7191 | 0.7932 | 0.7543 |
| 0.2795 | 0.82 | 51000 | 0.2755 | 0.8856 | 0.7876 | 0.7160 | 0.7501 |
| 0.2649 | 0.82 | 51500 | 0.2797 | 0.8805 | 0.7360 | 0.7823 | 0.7584 |
| 0.2776 | 0.83 | 52000 | 0.2671 | 0.8833 | 0.7627 | 0.7452 | 0.7538 |
| 0.2762 | 0.84 | 52500 | 0.2745 | 0.8812 | 0.7416 | 0.7744 | 0.7576 |
| 0.2803 | 0.85 | 53000 | 0.2766 | 0.8847 | 0.7694 | 0.7415 | 0.7551 |
| 0.2675 | 0.86 | 53500 | 0.2742 | 0.8785 | 0.7392 | 0.7623 | 0.7506 |
| 0.2725 | 0.86 | 54000 | 0.2720 | 0.8826 | 0.7576 | 0.7506 | 0.7541 |
| 0.2693 | 0.87 | 54500 | 0.2739 | 0.8836 | 0.7650 | 0.7427 | 0.7537 |
| 0.2745 | 0.88 | 55000 | 0.2751 | 0.8792 | 0.7348 | 0.7765 | 0.7551 |
| 0.273 | 0.89 | 55500 | 0.2762 | 0.8812 | 0.7388 | 0.7807 | 0.7591 |
| 0.2645 | 0.9 | 56000 | 0.2664 | 0.8828 | 0.7647 | 0.7385 | 0.7514 |
| 0.2698 | 0.9 | 56500 | 0.2728 | 0.8814 | 0.7467 | 0.7648 | 0.7557 |
| 0.2771 | 0.91 | 57000 | 0.2681 | 0.8839 | 0.7635 | 0.7473 | 0.7553 |
| 0.2663 | 0.92 | 57500 | 0.2715 | 0.885 | 0.7617 | 0.7573 | 0.7595 |
| 0.2546 | 0.93 | 58000 | 0.2836 | 0.8796 | 0.7323 | 0.7848 | 0.7576 |
| 0.2752 | 0.94 | 58500 | 0.2747 | 0.8801 | 0.7363 | 0.7790 | 0.7570 |
| 0.2645 | 0.94 | 59000 | 0.2733 | 0.8834 | 0.7484 | 0.7740 | 0.7610 |
| 0.2561 | 0.95 | 59500 | 0.2765 | 0.8828 | 0.7508 | 0.7652 | 0.7580 |
| 0.2753 | 0.96 | 60000 | 0.2721 | 0.8815 | 0.7483 | 0.7623 | 0.7552 |
| 0.251 | 0.97 | 60500 | 0.2735 | 0.8822 | 0.7546 | 0.7540 | 0.7543 |
| 0.2742 | 0.98 | 61000 | 0.2721 | 0.8831 | 0.7497 | 0.7694 | 0.7594 |
| 0.2734 | 0.98 | 61500 | 0.2712 | 0.8836 | 0.7512 | 0.7694 | 0.7602 |
| 0.2713 | 0.99 | 62000 | 0.2690 | 0.8836 | 0.7556 | 0.7606 | 0.7581 |
| 0.2764 | 1.0 | 62500 | 0.2689 | 0.8833 | 0.7551 | 0.7598 | 0.7574 |
Framework versions
- Transformers 4.35.2
- Pytorch 2.1.0+cu121
- Datasets 2.16.1
- Tokenizers 0.15.0
- Downloads last month
- 19
Model tree for kenhktsui/llm-data-textbook-quality-classifier-v1
Base model
FacebookAI/xlm-roberta-base