daxa-ai
/

pebblo-classifier

 ---
 license: mit
+language:
+- en
 ---
+# Model Card for Model ID
+This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels.
+## Model Details
+### Model Description
+The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.
+- **Developed by:** DAXA.AI
+- **Funded by:** Open Source
+- **Model type:** Classification model
+- **Language(s) (NLP):** English
+- **License:** MIT
+- **Finetuned from model:** distilbert-base-uncased
+### Model Sources
+- **Repository:** [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you)
+- **Demo:** [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier)
+## Uses
+### Intended Use
+The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.
+### Recommendations
+End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+# Import necessary libraries
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import joblib
+from huggingface_hub import hf_hub_url, cached_download
+# Load the tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")
+# Example text
+text = "Please enter your text here."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+# Apply softmax to the logits
+probabilities = torch.nn.functional.softmax(output.logits, dim=-1)
+# Get the predicted label
+predicted_label = torch.argmax(probabilities, dim=-1)
+# URL of your Hugging Face model repository
+REPO_NAME = "daxa-ai/pebblo-classifier"
+# Path to the label encoder file in the repository
+LABEL_ENCODER_FILE = "label encoder.joblib"
+# Construct the URL to the label encoder file
+url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)
+# Download and cache the label encoder file
+filename = cached_download(url)
+# Load the label encoder
+label_encoder = joblib.load(filename)
+# Decode the predicted label
+decoded_label = label_encoder.inverse_transform(predicted_label.numpy())
+print(decoded_label)
+```
+## Training Details
+### Training Data
+The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20).
+Here are the labels along with their respective counts in the dataset:
+| Agreement Type                          | Instances |
+| --------------------------------------- | --------- |
+| BOARD_MEETING_AGREEMENT                 | 4,225     |
+| CONSULTING_AGREEMENT                    | 2,965     |
+| CUSTOMER_LIST_AGREEMENT                 | 9,000     |
+| DISTRIBUTION_PARTNER_AGREEMENT          | 8,339     |
+| EMPLOYEE_AGREEMENT                      | 3,921     |
+| ENTERPRISE_AGREEMENT                    | 3,820     |
+| ENTERPRISE_LICENSE_AGREEMENT            | 9,000     |
+| EXECUTIVE_SEVERANCE_AGREEMENT           | 9,000     |
+| FINANCIAL_REPORT_AGREEMENT              | 8,381     |
+| HARMFUL_ADVICE                          | 2,025     |
+| INTERNAL_PRODUCT_ROADMAP_AGREEMENT      | 7,037     |
+| LOAN_AND_SECURITY_AGREEMENT             | 9,000     |
+| MEDICAL_ADVICE                          | 2,359     |
+| MERGER_AGREEMENT                        | 7,706     |
+| NDA_AGREEMENT                           | 2,966     |
+| NORMAL_TEXT                             | 6,742     |
+| PATENT_APPLICATION_FILLINGS_AGREEMENT   | 9,000     |
+| PRICE_LIST_AGREEMENT                    | 9,000     |
+| SETTLEMENT_AGREEMENT                    | 9,000     |
+| SEXUAL_HARRASSMENT                      | 8,321     |
+## Evaluation
+### Testing Data & Metrics
+#### Testing Data
+Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness.
+Here are the labels along with their respective counts in the dataset:
+| Agreement Type                          | Instances |
+| --------------------------------------- | --------- |
+| BOARD_MEETING_AGREEMENT                 | 4,335     |
+| CONSULTING_AGREEMENT                    | 1,533     |
+| CUSTOMER_LIST_AGREEMENT                 | 4,995     |
+| DISTRIBUTION_PARTNER_AGREEMENT          | 7,231     |
+| EMPLOYEE_AGREEMENT                      | 1,433     |
+| ENTERPRISE_AGREEMENT                    | 1,616     |
+| ENTERPRISE_LICENSE_AGREEMENT            | 8,574     |
+| EXECUTIVE_SEVERANCE_AGREEMENT           | 5,177     |
+| FINANCIAL_REPORT_AGREEMENT              | 4,264     |
+| HARMFUL_ADVICE                          | 474       |
+| INTERNAL_PRODUCT_ROADMAP_AGREEMENT      | 4,116     |
+| LOAN_AND_SECURITY_AGREEMENT             | 6,354     |
+| MEDICAL_ADVICE                          | 289       |
+| MERGER_AGREEMENT                        | 7,079     |
+| NDA_AGREEMENT                           | 1,452     |
+| NORMAL_TEXT                             | 1,808     |
+| PATENT_APPLICATION_FILLINGS_AGREEMENT   | 6,177     |
+| PRICE_LIST_AGREEMENT                    | 5,453     |
+| SETTLEMENT_AGREEMENT                    | 5,806     |
+| SEXUAL_HARRASSMENT                      | 4,750     |
+#### Metrics
+| Agreement Type                              | precision | recall | f1-score | support |
+| ------------------------------------------- | --------- | ------ | -------- | ------- |
+| BOARD_MEETING_AGREEMENT                     | 0.93      | 0.95   | 0.94     | 4335    |
+| CONSULTING_AGREEMENT                        | 0.72      | 0.98   | 0.84     | 1593    |
+| CUSTOMER_LIST_AGREEMENT                     | 0.64      | 0.82   | 0.72     | 4335    |
+| DISTRIBUTION_PARTNER_AGREEMENT              | 0.83      | 0.47   | 0.61     | 7231    |
+| EMPLOYEE_AGREEMENT                          | 0.78      | 0.92   | 0.85     | 1333    |
+| ENTERPRISE_AGREEMENT                        | 0.29      | 0.40   | 0.34     | 1616    |
+| ENTERPRISE_LICENSE_AGREEMENT                | 0.88      | 0.79   | 0.83     | 5574    |
+| EXECUTIVE_SERVICE_AGREEMENT                 | 0.92      | 0.85   | 0.89     | 8177    |
+| FINANCIAL_REPORT_AGREEMENT                  | 0.89      | 0.98   | 0.93     | 4264    |
+| HARMFUL_ADVICE                              | 0.79      | 0.95   | 0.86     | 474     |
+| INTERNAL_PRODUCT_ROADMAP_AGREEMENT          | 0.91      | 0.98   | 0.94     | 4116    |
+| LOAN_AND_SECURITY_AGREEMENT                 | 0.77      | 0.98   | 0.86     | 6354    |
+| MEDICAL_ADVICE                              | 0.81      | 0.99   | 0.89     | 289     |
+| MERGER_AGREEMENT                            | 0.89      | 0.77   | 0.83     | 7279    |
+| NDA_AGREEMENT                               | 0.70      | 0.57   | 0.62     | 1452    |
+| NORMAL_TEXT                                 | 0.79      | 0.97   | 0.87     | 1888    |
+| PATENT_APPLICATION_FILLINGS_AGREEMENT       | 0.95      | 0.99   | 0.97     | 6177    |
+| PRICE_LIST_AGREEMENT                        | 0.60      | 0.75   | 0.67     | 5565    |
+| SETTLEMENT_AGREEMENT                        | 0.82      | 0.54   | 0.65     | 5843    |
+| SEXUAL_HARASSMENT                           | 0.97      | 0.94   | 0.95     | 440     |
+|                                             |           |        |          |         |
+| accuracy                                    |           |        | 0.79     | 82916   |
+| macro avg                                   | 0.79      | 0.83   | 0.80     | 82916   |
+| weighted avg                                | 0.83      | 0.81   | 0.81     | 82916   |
+#### Results
+The model's performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. The accuracy stands at 0.79 for the entire test set, with a macro average and weighted average of precision, recall, and f1-score around 0.80 and 0.81, respectively.

config.json CHANGED Viewed

@@ -9,53 +9,53 @@
   "dropout": 0.1,
   "hidden_dim": 3072,
   "id2label": {
-  "0": "Board meeting",
-  "1": "Consulting Agreement",
-  "2": "Customer List",
-  "3": "Distribution/Partner Agreement",
-  "4": "Enterprise License Agreement",
-  "5": "Executive Severance Agreement",
-  "6": "Financial Report",
   "7": "HARMFUL_ADVICE",
-  "8": "Internal Use Only",
-  "9": "Loan and security Agreement",
   "10": "MEDICAL_ADVICE",
-  "11": "Merger Agreement",
-  "12": "NDA",
   "13": "NORMAL_TEXT",
-  "14": "Patent Application Fillings",
-  "15": "Price list",
-  "16": "Secret Sauce",
-  "17": "Security Breach",
-  "18": "Settlement Agreement",
-  "19": "Sexual Harrassment",
-  "20": "employee agreement",
-  "21": "enterprise agreement"
 },
   "initializer_range": 0.02,
   "label2id": {
-  "Board meeting": 0,
-  "Consulting Agreement": 1,
   "MEDICAL_ADVICE": 10,
-  "Merger Agreement": 11,
-  "NDA": 12,
   "NORMAL_TEXT": 13,
-  "Patent Application Fillings": 14,
-  "Price list": 15,
-  "Secret Sauce": 16,
-  "Security Breach": 17,
-  "Settlement Agreement": 18,
-  "Sexual Harrassment": 19,
-  "Customer List": 2,
-  "employee agreement": 20,
-  "enterprise agreement": 21,
-  "Distribution/Partner Agreement": 3,
-  "Enterprise License Agreement": 4,
-  "Executive Severance Agreement": 5,
-  "Financial Report": 6,
   "HARMFUL_ADVICE": 7,
-  "Internal Use Only": 8,
-  "Loan and security Agreement": 9
 },
   "max_position_embeddings": 512,
   "model_type": "distilbert",
@@ -70,3 +70,4 @@
   "transformers_version": "4.36.2",
   "vocab_size": 30522
 }

   "dropout": 0.1,
   "hidden_dim": 3072,
   "id2label": {
+  "0": "BOARD_MEETING_AGREEMENT",
+  "1": "CONSULTING_AGREEMENT",
+  "2": "CUSTOMER_LIST_AGREEMENT",
+  "3": "DISTRIBUTION_PARTNER_AGREEMENT",
+  "4": "ENTERPRISE_LICENSE_AGREEMENT",
+  "5": "EXECUTIVE_SEVERANCE_AGREEMENT",
+  "6": "FINANCIAL_REPORT_AGREEMENT",
   "7": "HARMFUL_ADVICE",
+  "8": "INTERNAL_USE_ONLY_AGREEMENT",
+  "9": "LOAN_AND_SECURITY_AGREEMENT",
   "10": "MEDICAL_ADVICE",
+  "11": "MERGER_AGREEMENT",
+  "12": "NDA_AGREEMENT",
   "13": "NORMAL_TEXT",
+  "14": "PATENT_APPLICATION_FILLINGS_AGREEMENT",
+  "15": "PRICE_LIST_AGREEMENT",
+  "16": "SECRET_SAUCE_AGREEMENT",
+  "17": "SECURITY_BREACH_AGREEMENT",
+  "18": "SETTLEMENT_AGREEMENT",
+  "19": "SEXUAL_HARRASSMENT_AGREEMENT",
+  "20": "EMPLOYEE_AGREEMENT",
+  "21": "ENTERPRISE_AGREEMENT"
 },
   "initializer_range": 0.02,
   "label2id": {
+  "BOARD_MEETING_AGREEMENT": 0,
+  "CONSULTING_AGREEMENT": 1,
   "MEDICAL_ADVICE": 10,
+  "MERGER_AGREEMENT": 11,
+  "NDA_AGREEMENT": 12,
   "NORMAL_TEXT": 13,
+  "PATENT_APPLICATION_FILLINGS_AGREEMENT": 14,
+  "PRICE_LIST_AGREEMENT": 15,
+  "SECRET_SAUCE_AGREEMENT": 16,
+  "SECURITY_BREACH_AGREEMENT": 17,
+  "SETTLEMENT_AGREEMENT": 18,
+  "SEXUAL_HARRASSMENT_AGREEMENT": 19,
+  "CUSTOMER_LIST_AGREEMENT": 2,
+  "EMPLOYEE_AGREEMENT": 20,
+  "ENTERPRISE_AGREEMENT": 21,
+  "DISTRIBUTION_PARTNER_AGREEMENT": 3,
+  "ENTERPRISE_LICENSE_AGREEMENT": 4,
+  "EXECUTIVE_SEVERANCE_AGREEMENT": 5,
+  "FINANCIAL_REPORT_AGREEMENT": 6,
   "HARMFUL_ADVICE": 7,
+  "INTERNAL_USE_ONLY_AGREEMENT": 8,
+  "LOAN_AND_SECURITY_AGREEMENT": 9
 },
   "max_position_embeddings": 512,
   "model_type": "distilbert",
   "transformers_version": "4.36.2",
   "vocab_size": 30522
 }