Text Classification
Safetensors
English
modernbert
davidmezzetti commited on
Commit
95f8cad
·
1 Parent(s): 6f6a301

Add README

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-classification
3
+ tags: text-classification
4
+ base_model: jhu-clsp/ettin-encoder-32m
5
+ datasets: NeuML/wikipedia-domain-labels
6
+ language: en
7
+ license: apache-2.0
8
+ ---
9
+
10
+ # Domain Labeler
11
+
12
+ This is a [Ettin 32M parameter model](https://huggingface.co/jhu-clsp/ettin-encoder-32m) fined-tuned with the [Wikipedia Domain Labels dataset](https://huggingface.co/datasets/NeuML/wikipedia-domain-labels) for text classification.
13
+
14
+ This model classifies text into one of the following classes.
15
+
16
+ ```python
17
+ labels = [
18
+ "aerospace", "agronomy", "artistic", "astronomy", "atmospheric_science", "automotive", "beauty",
19
+ "biology", "celebrity", "chemistry", "civil_engineering", "communication_engineering",
20
+ "computer_science_and_technology", "design", "drama_and_film", "economics",
21
+ "electronic_science", "entertainment", "environmental_science", "fashion", "finance",
22
+ "food", "gamble", "game", "geography", "health", "history", "hobby", "hydraulic_engineering",
23
+ "instrument_science", "journalism_and_media_communication", "landscape_architecture", "law",
24
+ "library", "literature", "materials_science", "mathematics", "mechanical_engineering",
25
+ "medical", "mining_engineering", "movie", "music_and_dance", "news", "nuclear_science",
26
+ "ocean_science", "optical_engineering", "painting", "pet",
27
+ "petroleum_and_natural_gas_engineering", "philosophy", "photo", "physics", "politics",
28
+ "psychology", "public_administration", "relationship", "religion", "sociology", "sports",
29
+ "statistics", "systems_science", "textile_science", "topicality", "transportation_engineering",
30
+ "travel", "urban_planning", "vulgar_language"
31
+ ]
32
+ ```
33
+
34
+ ## Usage (txtai)
35
+
36
+ This model can be used to classify text into one of the domain labels above with txtai.
37
+
38
+ ```python
39
+ from txtai.pipeline import Labels
40
+
41
+ labels = Label("NeuML/domain-labeler", dynamic=False)
42
+ labels("Text to classify")
43
+
44
+ # Get only the top label
45
+ labels("Text to classify", flatten=True)
46
+ ```
47
+
48
+ ## Usage (Hugging Face Transformers)
49
+
50
+ The following code is used to run a transformers `text-classification` pipeline.
51
+
52
+ ```python
53
+ labels = pipeline("text-classification", model="NeuML/domain-labeler")
54
+ labels("Text to classify")
55
+ ```
56
+
57
+ ## Evaluation
58
+
59
+ The following are the metrics for the test dataset. Note that these labels have significant overlap and the overall accuracy is much higher when generalizing the categories. In other words the "wrong" labels aren't always necessarily wrong (i.e. Medical vs Health, Entertainment vs Celebrity etc)
60
+
61
+ | Accuracy | F1 | Precision | Recall | PR-ACU |
62
+ | -------- | ----- | --------- | ------ | ------ |
63
+ | 0.8426 | 83.97 | 83.96 | 84.26 | 90.033
64
+
65
+ ## Training code
66
+
67
+ [The training code used to build this model is here](https://huggingface.co/NeuML/domain-labeler/blob/main/train.py).