|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-classification |
|
|
- naics |
|
|
- industry-classification |
|
|
- github |
|
|
- roberta |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- f1 |
|
|
- accuracy |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# NAICS GitHub Repository Classifier |
|
|
|
|
|
A fine-tuned RoBERTa-large model that classifies GitHub repositories into **19 NAICS (North American Industry Classification System)** industry sectors based on repository metadata. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model takes GitHub repository information (name, description, topics, README) and predicts the most likely industry sector the repository belongs to. |
|
|
|
|
|
- **Model:** `roberta-large` (355M parameters) |
|
|
- **Task:** Multi-class text classification (19 classes) |
|
|
- **Language:** English |
|
|
- **Training Data:** 6,588 labeled GitHub repositories |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- Classifying GitHub repositories by industry sector |
|
|
- Analyzing open-source software ecosystem by industry |
|
|
- Research on technology adoption across industries |
|
|
|
|
|
## NAICS Classes |
|
|
|
|
|
| Label | NAICS Code | Industry Sector | |
|
|
|-------|------------|-----------------| |
|
|
| 0 | 11 | Agriculture, Forestry, Fishing and Hunting | |
|
|
| 1 | 21 | Mining, Quarrying, Oil and Gas Extraction | |
|
|
| 2 | 22 | Utilities | |
|
|
| 3 | 23 | Construction | |
|
|
| 4 | 31-33 | Manufacturing | |
|
|
| 5 | 42 | Wholesale Trade | |
|
|
| 6 | 44-45 | Retail Trade | |
|
|
| 7 | 48-49 | Transportation and Warehousing | |
|
|
| 8 | 51 | Information | |
|
|
| 9 | 52 | Finance and Insurance | |
|
|
| 10 | 53 | Real Estate and Rental | |
|
|
| 11 | 54 | Professional, Scientific, Technical Services | |
|
|
| 12 | 56 | Administrative and Support Services | |
|
|
| 13 | 61 | Educational Services | |
|
|
| 14 | 62 | Health Care and Social Assistance | |
|
|
| 15 | 71 | Arts, Entertainment, and Recreation | |
|
|
| 16 | 72 | Accommodation and Food Services | |
|
|
| 17 | 81 | Other Services | |
|
|
| 18 | 92 | Public Administration | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
classifier = pipeline( |
|
|
"text-classification", |
|
|
model="alexanderquispe/naics-github-classifier" |
|
|
) |
|
|
|
|
|
text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for financial operations" |
|
|
result = classifier(text) |
|
|
print(result) |
|
|
# [{'label': '52', 'score': 0.95}] # Finance and Insurance |
|
|
``` |
|
|
|
|
|
### Full Example |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier") |
|
|
tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier") |
|
|
|
|
|
# Format input |
|
|
text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare; medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..." |
|
|
|
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
outputs = model(**inputs) |
|
|
predicted_class = torch.argmax(outputs.logits, dim=1).item() |
|
|
|
|
|
# Map to NAICS code |
|
|
id2label = model.config.id2label |
|
|
print(f"Predicted NAICS: {id2label[predicted_class]}") # 62 (Health Care) |
|
|
``` |
|
|
|
|
|
## Input Format |
|
|
|
|
|
The model expects text in this format: |
|
|
|
|
|
``` |
|
|
Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content} |
|
|
``` |
|
|
|
|
|
| Field | Required | Description | |
|
|
|-------|----------|-------------| |
|
|
| Repository | Yes | Repository name | |
|
|
| Description | No | Short description | |
|
|
| Topics | No | Semicolon-separated tags | |
|
|
| README | No | README content (can be truncated) | |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Source:** GitHub repositories labeled with NAICS codes |
|
|
- **Size:** 6,588 examples |
|
|
- **Classes:** 19 NAICS sectors |
|
|
- **Split:** 70% train / 10% validation / 20% test |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Base Model | `roberta-large` | |
|
|
| Batch Size | 32 | |
|
|
| Learning Rate | 2e-5 | |
|
|
| Epochs | 8 | |
|
|
| Max Sequence Length | 512 | |
|
|
| Optimizer | AdamW | |
|
|
| Weight Decay | 0.01 | |
|
|
| Early Stopping Patience | 5 | |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
Text preprocessing includes: |
|
|
- Removal of markdown badges and formatting |
|
|
- URL cleaning (keep domain names) |
|
|
- License header removal |
|
|
- Code block removal (keep language indicators) |
|
|
- Technology term normalization (js → javascript, py → python) |
|
|
- Whitespace normalization |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained primarily on English repositories |
|
|
- May not generalize to non-software repositories |
|
|
- NAICS code 55 (Management of Companies) excluded due to limited training data |
|
|
- Performance may vary for repositories with minimal README content |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{naics-github-classifier, |
|
|
author = {Alexander Quispe}, |
|
|
title = {NAICS GitHub Repository Classifier}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/alexanderquispe/naics-github-classifier} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Repository |
|
|
|
|
|
Training code and data preparation: [github.com/alexanderquispe/naics-github-train](https://github.com/alexanderquispe/naics-github-train) |
|
|
|