Kaleemullah commited on
Commit
decae70
·
1 Parent(s): aed8f3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -14
README.md CHANGED
@@ -8,11 +8,37 @@ pipeline_tag: text-classification
8
  ---
9
 
10
  # Kaleemullah/paraphrase-mpnet-ad-classifier
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
13
 
14
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
15
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
16
 
17
  ## Usage
18
 
@@ -33,17 +59,29 @@ model = SetFitModel.from_pretrained("Kaleemullah/paraphrase-mpnet-ad-classifier"
33
  preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
34
  ```
35
 
36
- ## BibTeX entry and citation info
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ```bibtex
39
- @article{https://doi.org/10.48550/arxiv.2209.11055,
40
- doi = {10.48550/ARXIV.2209.11055},
41
- url = {https://arxiv.org/abs/2209.11055},
42
- author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
43
- keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
44
- title = {Efficient Few-Shot Learning Without Prompts},
45
- publisher = {arXiv},
46
- year = {2022},
47
- copyright = {Creative Commons Attribution 4.0 International}
48
  }
49
- ```
 
8
  ---
9
 
10
  # Kaleemullah/paraphrase-mpnet-ad-classifier
11
+ The model "Kaleemullah/paraphrase-mpnet-ad-classifier" adaptation of the SetFit model, a robust architecture designed for text classification tasks. The distinctive feature of this model lies in its ability to leverage few-shot learning techniques effectively, thereby optimizing performance with limited training data. The model's architecture and training process can be delineated into two primary stages:
12
+
13
+ - Fine-tuning of a Sentence Transformer: At the core of the model is a Sentence Transformer, fine-tuned using a contrastive learning approach. This step focuses on enhancing the transformer's ability to generate semantically rich and contextually nuanced sentence embeddings, a critical factor for accurate classification in downstream tasks.
14
+
15
+ - Training of a Classification Head: Post fine-tuning, the model integrates a classification head. This component is trained utilizing the feature representations extracted from the fine-tuned Sentence Transformer. The classification head's primary role is to map the nuanced sentence embeddings to specific classes, thereby facilitating effective text classification.
16
+
17
+ The model's design and training methodology underscore its capability to handle a wide range of text classification scenarios, especially in contexts where data availability is limited. This approach not only maximizes the efficiency of the learning process but also ensures the model's applicability in diverse real-world applications where large-scale labeled datasets may not be readily available.
18
+
19
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b2c130ac5ecaae3d1efe27/KQ_Sx7oSOJGFpy04KyCv_.png)
20
+
21
+ # Training Data:
22
+ The training dataset for this text classification model is designed to differentiate between ad and non-ad texts. It comprises a balanced set of 2000 ad instances and 2000 non-ad instances, each meticulously curated to train the model effectively. The details of the dataset composition are as follows:
23
+
24
+ ## Ads Data
25
+ - Web-Scraped Ads: The primary source of ad data involves scraping textual advertisements from various websites. These ads represent a diverse range of internet advertising styles and formats.
26
+ - Data Augmentation via ChatGPT-3.5 Turbo: To enhance the dataset, original scraped ads were used as a basis for generating additional ad content. This was accomplished using OpenAI's ChatGPT-3.5 Turbo, which created new ads by mimicking the structure and features of the scraped ads. This method ensured the inclusion of varied and realistic ad content in the training data.
27
+ - E-commerce Listings: As a part of the ad dataset, listings from various e-commerce platforms were also included. These listings are considered as ads due to their promotional nature and direct product showcasing.
28
+
29
+ ## Non-Ads Data
30
+ Open Source Datasets: The non-ads portion of the dataset encompasses a variety of text sources to provide a contrast to ad-like text structures. This includes:
31
+ - Tweets from various open-source collections.
32
+ - Articles from BBC News.
33
+ - Narratives and stories from open-source human story databases.
34
+ - Conversations derived from human interaction datasets.
35
+ - Transcriptions of YouTube videos, encompassing a wide range of topics and speaking styles.
36
+
37
+ ## Data Balance and Diversity
38
+ The dataset maintains an equal balance between ad and non-ad texts, with 2000 instances in each category. This balance ensures that the model is not biased towards any particular class during training. Additionally, the diversity in the data sources for both ads and non-ads contributes to the robustness of the model, enabling it to effectively distinguish between the two categories in varied contexts.
39
+
40
 
 
41
 
 
 
42
 
43
  ## Usage
44
 
 
59
  preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
60
  ```
61
 
62
+ ### Evaluation Results
63
+ The model's performance was rigorously evaluated on unseen data to ensure its reliability and effectiveness in real-world applications. The key metric reflecting the model's performance is outlined below:
64
+
65
+ - Accuracy: The model achieved a remarkable accuracy of 99.25% on the validation set. This high level of accuracy indicates the model's proficiency in correctly classifying texts as ads or non-ads.
66
+
67
+ ### Limitations and Bias
68
+ While the model exhibits high accuracy, it is important to acknowledge its limitations and potential areas of bias:
69
+
70
+ - Data Diversity: The model's performance is contingent on the diversity of the training data. It may not perform as effectively on ad or non-ad texts that significantly deviate from the styles and formats present in the training dataset.
71
+ - Contextual Limitations: The model might struggle with texts where the distinction between ads and non-ads is subtle or context-dependent.
72
+ - Language and Cultural Bias: Since the training data predominantly consists of English texts, the model may not generalize well to ads and non-ads in other languages or cultural contexts.
73
+
74
+
75
+ ### Ethical Considerations
76
+ - Misuse Potential: There is a potential for misuse in applications where the distinction between promotional content and genuine information is critical, such as in news dissemination or educational content.
77
+ - Privacy Concerns: Care must be taken to ensure that the model is not used to classify sensitive or private texts without proper authorization or consent.
78
+
79
 
80
  ```bibtex
81
+ @article{Kaleemullah2023ParaphraseMPNetAdClassifier,
82
+ author = {Kaleem Ullah Qasim},
83
+ title = {Paraphrase MPNet for Ad Classification},
84
+ year = {2023},
85
+ organization = {Southwest Jiaotong University},
86
+ model_url = {https://huggingface.co/Kaleemullah/paraphrase-mpnet-ad-classifier},
 
 
 
87
  }