Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +269 -154
special_tokens_map.json +5 -35
tokenizer_config.json +0 -7
training_args.bin +3 -0

README.md CHANGED Viewed

@@ -1,199 +1,314 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+language:
+- en
+license: mit
 library_name: transformers
+tags:
+- text-classification
+- code-quality
+- documentation
+- code-comments
+- developer-tools
+- code-review
+- distilbert
+datasets:
+- synthetic
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+base_model: distilbert-base-uncased
+pipeline_tag: text-classification
+widget:
+- text: "This function calculates the Fibonacci sequence using dynamic programming to avoid redundant calculations. Time complexity: O(n), Space complexity: O(n)"
+  example_title: "Excellent Comment"
+- text: "Calculates the sum of two numbers and returns the result"
+  example_title: "Helpful Comment"
+- text: "does stuff with numbers"
+  example_title: "Unclear Comment"
+- text: "DEPRECATED: Use calculate_new() instead. This method will be removed in v2.0"
+  example_title: "Outdated Comment"
+- text: "Validates user input against SQL injection attacks using parameterized queries"
+  example_title: "Excellent Example 2"
+- text: "magic happens here"
+  example_title: "Unclear Example 2"
+model-index:
+- name: code-comment-classifier
+  results:
+  - task:
+      type: text-classification
+      name: Text Classification
+    dataset:
+      name: Synthetic Code Comments
+      type: synthetic
+    metrics:
+    - type: accuracy
+      value: 0.9485
+      name: Accuracy
+      verified: false
+    - type: f1
+      value: 0.9468
+      name: F1 Score
+      verified: false
+    - type: precision
+      value: 0.9535
+      name: Precision
+      verified: false
+    - type: recall
+      value: 0.9485
+      name: Recall
+      verified: false
 ---
+# Code Comment Quality Classifier 🔍
+Automatically classify code comments into quality categories to improve code documentation and review processes.
+## 🎯 Model Description
+This fine-tuned DistilBERT model analyzes code comments and classifies them into **4 quality categories**:
+| Category | Precision | Recall | Description |
+|----------|-----------|--------|-------------|
+| 🌟 **Excellent** | 100% | 100% | Clear, comprehensive, highly informative comments with context |
+| ✅ **Helpful** | 88.9% | 100% | Good comments that add value but could be more detailed |
+| ⚠️ **Unclear** | 100% | 79.2% | Vague, confusing, or uninformative comments |
+| 🚫 **Outdated** | 92.3% | 100% | Deprecated, obsolete, or TODO comments |
+### 📊 Overall Performance
+- **Accuracy**: 94.85%
+- **F1 Score**: 94.68%
+- *🚀 Quick Start
+### Using Transformers Pipeline (Easiest)
+```python
+from transformers import pipeline
+# Load the classifier
+classifier = pipeline("text-classification", model="Snaseem2026/code-comment-classifier")
+# Classify comments
+comments = [
+    "This function uses dynamic programming for O(n) time complexity",
+    "does stuff",
+    "DEPRECATED: use new_function() instead"
+]
+results = classifier(comments)
+for comment, result in zip(comments, results):
+    print(f"{comment}: {result['label']} ({result['score']:.2%} confidence)")
+```
+### Manual Usage with Transformers
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+mod💡 Use Cases
+### 1. **Code Review Automation**
+Automatically flag low-quality comments during pull request reviews:
+```python
+def check_pr_comments(file_comments):
+    classifier = pipeline("text-classification", model="Snaseem2026/code-comment-classifier")
+    results = classifier(file_comments)
+    return [c for c, r in zip(file_comments, results) if r['label'] in ['unclear', 'outdated']]
+```
+### 2. **Documentation Quality Audits**
+Scan codebases to identify documentation that needs improvement.
+### 3. **Developer Education**
+Help developers learn what constitutes good documentation practices.
+### 4. **IDE Integration**
+Provide real-time feedback on comment quality while coding.
+### 5. **Technical Debt Analysis**
+Identify outdated comments and TODOs that need addressing.
+## 🏋️ Training Details
+### Model Architecture
+- **Base Model**: [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)
+- **Parameters**: 66.96 million
+- **Model Type**: Sequence Classification
+- **Framework**: PyTorch + Hugging Face Transformers
 ### Training Data
+- **Dataset Size**: 970 samples (776 train, 97 validation, 97 test)
+- **Data Source**: Synthetic code comments
+- **Classes**: 4 (balanced distribution)
+- **Language**: English
+### Training Hyperparameters
+- **Epochs**: 3
+- **Batch Size**: 16 (train), 32 (eval)
+- **Learning Rate**: 2e-5
+- **Optimizer**: AdamW
+- **Weight Decay**: 0.01
+- **Warmup Steps**: 500
+- **Max Sequence Length**: 512 tokenselpful", "unclear", "outdated"]
+print(f"Quality: {labels[predicted_class]} (confidence: {confidence:.2%})")
+```
+### Batch Processing
+```python
+from transformers import pipeline
+classifier = pipeline("text-classification", model="Snaseem2026/code-comment-classifier")
+comments = [
+    "Implements binary search with O(log n) time complexity",
+    "TODO fix later",
+    "Handles user authentication",
+   📈 Evaluation Results
+### Test Set Performance (97 samples)
+```
+              precision    recall  f1-score   support
+   excellent     1.0000    1.0000    1.0000        25
+     helpful     0.8889    1.0000    0.9412        24
+     unclear     1.0000    0.7917    0.8837        24
+    outdated     0.9231    1.0000    0.9600        24
+    accuracy                         0.9485        97
+   macro avg     0.9530    0.9479    0.9462        97
+weighted avg     0.9535    0.9485    0.9468        97
+```
+### Key Findings
+- ✨ **Perfect classification** of excellent comments (100% precision & recall)
+- 🎯 **Zero false negatives** for helpful and outdated comments
+- ⚠️ Slight challenge distinguishing unclear comments from other categories
+- 📊 Strong overall performance with 94.85% accuracy
+## ⚠️ Limitations
+1. **Synthetic Training Data**: Model trained on synthetic examples; may require fine-tuning for specific domains (e.g., scientific computing, embedded systems)
+2. **English Only**: Currently supports English code comments only
+3. **No Code Context**: Evaluates comments in isolation without analyzing the actual code
+4. **Subjectivity**: Comment quality is inherently subjective; model reflects patterns in training data
+5. **Short Comments**: May struggle with very short comments (< 3 words)
+## 🎯 Intended Use
+### Recommended Use
+- Supplementary tool in code review automation
+- Documentation quality auditing
+- Developer education and training
+- IDE plugins for real-time feedback
+### Not Recommended
+- Sole decision-maker for code quality
+- Production-critical systems without human oversight
+- Evaluating non-English comments
+- Analyzing code quality (only evaluates comments)
+## 🔧 How to Improve Performance
+### Fine-tune on Your Domain
+```python
+from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
+# Load the pre-trained model
+model = AutoModelForSequenceClassification.from_pretrained("Snaseem2026/code-comment-classifier")
+# Fine-tune on your domain-specific data
+training_args = TrainingArguments(
+    output_dir="./fine_tuned_model",
+    learning_rate=1e-5,  # Lower learning rate for fine-tuning
+    num_train_epochs=2,
+    per_device_train_batch_size=8,
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=your_dataset,
+)
+trainer.train()
+```
+## 📝 License
+**MIT License** - Free to use, modify, and distribute for commercial and non-commercial purposes.
+## 🙏 Acknowledgments
+- Built with [🤗 Transformers](https://huggingface.co/transformers/)
+- Base model: [DistilBERT](https://huggingface.co/distilbert-base-uncased) by Hugging Face
+- Inspired by the need for better code documentation practices in software development
+## 📚 Citation
+If you use this model in your research or application, please cite:
+```bibtex
+@misc{code-comment-classifier-2026,
+  author = {Naseem, Sharyar},
+  title = {Code Comment Quality Classifier},
+  year = {2026},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Model Hub},
+  howpublished = {\url{https://huggingface.co/Snaseem2026/code-comment-classifier}}
+}
+```
+## 📧 Contact
+For questions, suggestions, or collaboration:
+- 🤗 Hugging Face: [@Snaseem2026](https://huggingface.co/Snaseem2026)
+- 📫 Issues: Report on the model's discussion tab
+---
+<div align="center">
+**Made with ❤️ for the developer community**
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Transformers](https://img.shields.io/badge/Transformers-4.35+-blue.svg)](https://github.com/huggingface/transformers)
+[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
+[🤗 Model Hub](https://huggingface.co/Snaseem2026/code-comment-classifier) • [Report Issue](https://huggingface.co/Snaseem2026/code-comment-classifier/discussions)
+</div>
+## Limitations
+- Trained on synthetic data; may require fine-tuning for specific domains
+- English comments only
+- Evaluates comments in isolation without code context
+- Comment quality assessment is subjective
+## Intended Use
+This model is designed for **educational and productivity purposes**. Use as a supplementary tool in code review processes, not as a replacement for human judgment.
+## License
+MIT License - Free to use, modify, and distribute.
+## Citation
+```bibtex
+@misc{code-comment-classifier-2026,
+  title={Code Comment Quality Classifier},
+  year={2026},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/your-username/code-comment-classifier}}
+}
+```
+---
+Built with [Hugging Face Transformers](https://huggingface.co/transformers/) • Base model: [DistilBERT](https://huggingface.co/distilbert-base-uncased)

special_tokens_map.json CHANGED Viewed

@@ -1,37 +1,7 @@
 {
-  "cls_token": {
-    "content": "[CLS]",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "mask_token": {
-    "content": "[MASK]",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": {
-    "content": "[PAD]",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "sep_token": {
-    "content": "[SEP]",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "unk_token": {
-    "content": "[UNK]",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  }
 }

 {
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
 }

tokenizer_config.json CHANGED Viewed

@@ -46,18 +46,11 @@
   "do_lower_case": true,
   "extra_special_tokens": {},
   "mask_token": "[MASK]",
-  "max_length": 512,
   "model_max_length": 512,
-  "pad_to_multiple_of": null,
   "pad_token": "[PAD]",
-  "pad_token_type_id": 0,
-  "padding_side": "right",
   "sep_token": "[SEP]",
-  "stride": 0,
   "strip_accents": null,
   "tokenize_chinese_chars": true,
   "tokenizer_class": "DistilBertTokenizer",
-  "truncation_side": "right",
-  "truncation_strategy": "longest_first",
   "unk_token": "[UNK]"
 }

   "do_lower_case": true,
   "extra_special_tokens": {},
   "mask_token": "[MASK]",
   "model_max_length": 512,
   "pad_token": "[PAD]",
   "sep_token": "[SEP]",
   "strip_accents": null,
   "tokenize_chinese_chars": true,
   "tokenizer_class": "DistilBertTokenizer",
   "unk_token": "[UNK]"
 }

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6832f2d92a4eb7be0c28d655c5fbe622f84d59c589ff33d2da3bdb508e7ac75c
+size 5777