NeuML
/

domain-labeler

Text Classification

Model card Files Files and versions

domain-labeler / README.md

davidmezzetti's picture

Add README

cede694 about 1 month ago

|

history blame contribute delete

2.8 kB

	---
	pipeline_tag: text-classification
	tags:
	- text-classification
	base_model: jhu-clsp/ettin-encoder-32m
	datasets: NeuML/wikipedia-domain-labels
	language: en
	license: apache-2.0
	---

	# Domain Labeler

	This is a [Ettin 32M parameter model](https://huggingface.co/jhu-clsp/ettin-encoder-32m) fined-tuned with the [Wikipedia Domain Labels dataset](https://huggingface.co/datasets/NeuML/wikipedia-domain-labels) for text classification.

	This model classifies text into one of the following classes.

	```python
	labels = [
	"aerospace", "agronomy", "artistic", "astronomy", "atmospheric_science", "automotive", "beauty",
	"biology", "celebrity", "chemistry", "civil_engineering", "communication_engineering",
	"computer_science_and_technology", "design", "drama_and_film", "economics",
	"electronic_science", "entertainment", "environmental_science", "fashion", "finance",
	"food", "gamble", "game", "geography", "health", "history", "hobby", "hydraulic_engineering",
	"instrument_science", "journalism_and_media_communication", "landscape_architecture", "law",
	"library", "literature", "materials_science", "mathematics", "mechanical_engineering",
	"medical", "mining_engineering", "movie", "music_and_dance", "news", "nuclear_science",
	"ocean_science", "optical_engineering", "painting", "pet",
	"petroleum_and_natural_gas_engineering", "philosophy", "photo", "physics", "politics",
	"psychology", "public_administration", "relationship", "religion", "sociology", "sports",
	"statistics", "systems_science", "textile_science", "topicality", "transportation_engineering",
	"travel", "urban_planning", "vulgar_language"
	]
	```

	## Usage (txtai)

	This model can be used to classify text into one of the domain labels above with txtai.

	```python
	from txtai.pipeline import Labels

	labels = Label("NeuML/domain-labeler", dynamic=False)
	labels("Text to classify")

	# Get only the top label
	labels("Text to classify", flatten=True)
	```

	## Usage (Hugging Face Transformers)

	The following code is used to run a transformers `text-classification` pipeline.

	```python
	labels = pipeline("text-classification", model="NeuML/domain-labeler")
	labels("Text to classify")
	```

	## Evaluation

	The following are the metrics for the test dataset. Note that these labels have significant overlap and the overall accuracy is much higher when generalizing the categories. In other words the "wrong" labels aren't always necessarily wrong (i.e. Medical vs Health, Entertainment vs Celebrity etc)

	\| Accuracy \| F1 \| Precision \| Recall \| PR-ACU \|
	\| -------- \| ----- \| --------- \| ------ \| ------ \|
	\| 0.8426 \| 83.97 \| 83.96 \| 84.26 \| 90.033

	## Training code

	[The training code used to build this model is here](https://huggingface.co/NeuML/domain-labeler/blob/main/train.py).