aquiro1994
/

naics-github-classifier

@@ -1,205 +1,170 @@
- ---
-  license: mit
-  language:
-  - en
-  library_name: transformers
-  tags:
-  - text-classification
-  - naics
-  - industry-classification
-  - github
-  - roberta
-  datasets:
-  - custom
-  metrics:
-  - f1
-  - accuracy
-  pipeline_tag: text-classification
-  ---
-  # NAICS GitHub Repository Classifier
-  A fine-tuned RoBERTa-large model that classifies GitHub repositories into **19 NAICS (North American Industry
-  Classification System)** industry sectors based on repository metadata.
-  ## Model Description
-  This model takes GitHub repository information (name, description, topics, README) and predicts the most likely
-  industry sector the repository belongs to.
-  - **Model:** `roberta-large` (355M parameters)
-  - **Task:** Multi-class text classification (19 classes)
-  - **Language:** English
-  - **Training Data:** 6,588 labeled GitHub repositories
-  ## Intended Use
-  - Classifying GitHub repositories by industry sector
-  - Analyzing open-source software ecosystem by industry
-  - Research on technology adoption across industries
-  ## NAICS Classes
-  | Label | NAICS Code | Industry Sector |
-  |-------|------------|-----------------|
-  | 0 | 11 | Agriculture, Forestry, Fishing and Hunting |
-  | 1 | 21 | Mining, Quarrying, Oil and Gas Extraction |
-  | 2 | 22 | Utilities |
-  | 3 | 23 | Construction |
-  | 4 | 31-33 | Manufacturing |
-  | 5 | 42 | Wholesale Trade |
-  | 6 | 44-45 | Retail Trade |
-  | 7 | 48-49 | Transportation and Warehousing |
-  | 8 | 51 | Information |
-  | 9 | 52 | Finance and Insurance |
-  | 10 | 53 | Real Estate and Rental |
-  | 11 | 54 | Professional, Scientific, Technical Services |
-  | 12 | 56 | Administrative and Support Services |
-  | 13 | 61 | Educational Services |
-  | 14 | 62 | Health Care and Social Assistance |
-  | 15 | 71 | Arts, Entertainment, and Recreation |
-  | 16 | 72 | Accommodation and Food Services |
-  | 17 | 81 | Other Services |
-  | 18 | 92 | Public Administration |
-  ## Usage
-  ### Quick Start
-  ```python
-  from transformers import pipeline
-  classifier = pipeline(
-      "text-classification",
-      model="alexanderquispe/naics-github-classifier"
-  )
-  text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for
-  financial operations"
-  result = classifier(text)
-  print(result)
-  # [{'label': '52', 'score': 0.95}]  # Finance and Insurance
-  Full Example
-  from transformers import AutoModelForSequenceClassification, AutoTokenizer
-  import torch
-  model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
-  tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")
-  # Format input
-  text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare;
-  medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..."
-  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
-  outputs = model(**inputs)
-  predicted_class = torch.argmax(outputs.logits, dim=1).item()
-  # Map to NAICS code
-  id2label = model.config.id2label
-  print(f"Predicted NAICS: {id2label[predicted_class]}")  # 62 (Health Care)
-  Input Format
-  The model expects text in this format:
-  Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content}
-  ┌─────────────┬──────────┬───────────────────────────────────┐
-  │    Field    │ Required │            Description            │
-  ├─────────────┼──────────┼───────────────────────────────────┤
-  │ Repository  │ Yes      │ Repository name                   │
-  ├─────────────┼──────────┼───────────────────────────────────┤
-  │ Description │ No       │ Short description                 │
-  ├─────────────┼──────────┼───────────────────────────────────┤
-  │ Topics      │ No       │ Semicolon-separated tags          │
-  ├─────────────┼──────────┼───────────────────────────────────┤
-  │ README      │ No       │ README content (can be truncated) │
-  └─────────────┴──────────┴───────────────────────────────────┘
-  Training Details
-  Training Data
-  - Source: GitHub repositories labeled with NAICS codes
-  - Size: 6,588 examples
-  - Classes: 19 NAICS sectors
-  - Split: 70% train / 10% validation / 20% test
-  Training Hyperparameters
-  ┌─────────────────────────┬───────────────┐
-  │        Parameter        │     Value     │
-  ├─────────────────────────┼───────────────┤
-  │ Base Model              │ roberta-large │
-  ├─────────────────────────┼───────────────┤
-  │ Batch Size              │ 32            │
-  ├─────────────────────────┼───────────────┤
-  │ Learning Rate           │ 2e-5          │
-  ├─────────────────────────┼───────────────┤
-  │ Epochs                  │ 8             │
-  ├─────────────────────────┼───────────────┤
-  │ Max Sequence Length     │ 512           │
-  ├─────────────────────────┼───────────────┤
-  │ Optimizer               │ AdamW         │
-  ├─────────────────────────┼───────────────┤
-  │ Weight Decay            │ 0.01          │
-  ├─────────────────────────┼───────────────┤
-  │ Early Stopping Patience │ 5             │
-  └─────────────────────────┴───────────────┘
-  Preprocessing
-  Text preprocessing includes:
-  - Removal of markdown badges and formatting
-  - URL cleaning (keep domain names)
-  - License header removal
-  - Code block removal (keep language indicators)
-  - Technology term normalization (js → javascript, py → python)
-  - Whitespace normalization
-  Limitations
-  - Trained primarily on English repositories
-  - May not generalize to non-software repositories
-  - NAICS code 55 (Management of Companies) excluded due to limited training data
-  - Performance may vary for repositories with minimal README content
-  Citation
-  @misc{naics-github-classifier,
-    author = {Alexander Quispe},
-    title = {NAICS GitHub Repository Classifier},
-    year = {2025},
-    publisher = {Hugging Face},
-    url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
-  }
-  Repository
-  Training code and data preparation: https://github.com/alexanderquispe/naics-github-train
-  ---
-  **To upload:**
-  1. Go to https://huggingface.co/alexanderquispe/naics-github-classifier
-  2. Click the **"Files and versions"** tab
-  3. Click **"Edit"** on `README.md` (or create it)
-  4. Paste the content above
-  5. Click **"Commit changes"**
-  Or from Colab:
-  ```python
-  from huggingface_hub import upload_file
-  # Save the model card
-  model_card = """<paste the content above>"""
-  with open("README.md", "w") as f:
-      f.write(model_card)
-  upload_file(
-      path_or_fileobj="README.md",
-      path_in_repo="README.md",
-      repo_id="alexanderquispe/naics-github-classifier",
-      repo_type="model"
-  )

+---
+license: mit
+language:
+- en
+library_name: transformers
+tags:
+- text-classification
+- naics
+- industry-classification
+- github
+- roberta
+datasets:
+- custom
+metrics:
+- f1
+- accuracy
+pipeline_tag: text-classification
+---
+# NAICS GitHub Repository Classifier
+A fine-tuned RoBERTa-large model that classifies GitHub repositories into **19 NAICS (North American Industry Classification System)** industry sectors based on repository metadata.
+## Model Description
+This model takes GitHub repository information (name, description, topics, README) and predicts the most likely industry sector the repository belongs to.
+- **Model:** `roberta-large` (355M parameters)
+- **Task:** Multi-class text classification (19 classes)
+- **Language:** English
+- **Training Data:** 6,588 labeled GitHub repositories
+## Intended Use
+- Classifying GitHub repositories by industry sector
+- Analyzing open-source software ecosystem by industry
+- Research on technology adoption across industries
+## NAICS Classes
+| Label | NAICS Code | Industry Sector |
+|-------|------------|-----------------|
+| 0 | 11 | Agriculture, Forestry, Fishing and Hunting |
+| 1 | 21 | Mining, Quarrying, Oil and Gas Extraction |
+| 2 | 22 | Utilities |
+| 3 | 23 | Construction |
+| 4 | 31-33 | Manufacturing |
+| 5 | 42 | Wholesale Trade |
+| 6 | 44-45 | Retail Trade |
+| 7 | 48-49 | Transportation and Warehousing |
+| 8 | 51 | Information |
+| 9 | 52 | Finance and Insurance |
+| 10 | 53 | Real Estate and Rental |
+| 11 | 54 | Professional, Scientific, Technical Services |
+| 12 | 56 | Administrative and Support Services |
+| 13 | 61 | Educational Services |
+| 14 | 62 | Health Care and Social Assistance |
+| 15 | 71 | Arts, Entertainment, and Recreation |
+| 16 | 72 | Accommodation and Food Services |
+| 17 | 81 | Other Services |
+| 18 | 92 | Public Administration |
+## Usage
+### Quick Start
+```python
+from transformers import pipeline
+classifier = pipeline(
+    "text-classification",
+    model="alexanderquispe/naics-github-classifier"
+)
+text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for financial operations"
+result = classifier(text)
+print(result)
+# [{'label': '52', 'score': 0.95}]  # Finance and Insurance
+```
+### Full Example
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
+tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")
+# Format input
+text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare; medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+outputs = model(**inputs)
+predicted_class = torch.argmax(outputs.logits, dim=1).item()
+# Map to NAICS code
+id2label = model.config.id2label
+print(f"Predicted NAICS: {id2label[predicted_class]}")  # 62 (Health Care)
+```
+## Input Format
+The model expects text in this format:
+```
+Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content}
+```
+| Field | Required | Description |
+|-------|----------|-------------|
+| Repository | Yes | Repository name |
+| Description | No | Short description |
+| Topics | No | Semicolon-separated tags |
+| README | No | README content (can be truncated) |
+## Training Details
+### Training Data
+- **Source:** GitHub repositories labeled with NAICS codes
+- **Size:** 6,588 examples
+- **Classes:** 19 NAICS sectors
+- **Split:** 70% train / 10% validation / 20% test
+### Training Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Base Model | `roberta-large` |
+| Batch Size | 32 |
+| Learning Rate | 2e-5 |
+| Epochs | 8 |
+| Max Sequence Length | 512 |
+| Optimizer | AdamW |
+| Weight Decay | 0.01 |
+| Early Stopping Patience | 5 |
+### Preprocessing
+Text preprocessing includes:
+- Removal of markdown badges and formatting
+- URL cleaning (keep domain names)
+- License header removal
+- Code block removal (keep language indicators)
+- Technology term normalization (js → javascript, py → python)
+- Whitespace normalization
+## Limitations
+- Trained primarily on English repositories
+- May not generalize to non-software repositories
+- NAICS code 55 (Management of Companies) excluded due to limited training data
+- Performance may vary for repositories with minimal README content
+## Citation
+```bibtex
+@misc{naics-github-classifier,
+  author = {Alexander Quispe},
+  title = {NAICS GitHub Repository Classifier},
+  year = {2025},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
+}
+```
+## Repository
+Training code and data preparation: [github.com/alexanderquispe/naics-github-train](https://github.com/alexanderquispe/naics-github-train)