aquiro1994's picture
Update README.md
ecd7012 verified
---
license: mit
language:
- en
library_name: transformers
tags:
- text-classification
- naics
- industry-classification
- github
- roberta
datasets:
- custom
metrics:
- f1
- accuracy
pipeline_tag: text-classification
---
# NAICS GitHub Repository Classifier
A fine-tuned RoBERTa-large model that classifies GitHub repositories into **19 NAICS (North American Industry Classification System)** industry sectors based on repository metadata.
## Model Description
This model takes GitHub repository information (name, description, topics, README) and predicts the most likely industry sector the repository belongs to.
- **Model:** `roberta-large` (355M parameters)
- **Task:** Multi-class text classification (19 classes)
- **Language:** English
- **Training Data:** 6,588 labeled GitHub repositories
## Intended Use
- Classifying GitHub repositories by industry sector
- Analyzing open-source software ecosystem by industry
- Research on technology adoption across industries
## NAICS Classes
| Label | NAICS Code | Industry Sector |
|-------|------------|-----------------|
| 0 | 11 | Agriculture, Forestry, Fishing and Hunting |
| 1 | 21 | Mining, Quarrying, Oil and Gas Extraction |
| 2 | 22 | Utilities |
| 3 | 23 | Construction |
| 4 | 31-33 | Manufacturing |
| 5 | 42 | Wholesale Trade |
| 6 | 44-45 | Retail Trade |
| 7 | 48-49 | Transportation and Warehousing |
| 8 | 51 | Information |
| 9 | 52 | Finance and Insurance |
| 10 | 53 | Real Estate and Rental |
| 11 | 54 | Professional, Scientific, Technical Services |
| 12 | 56 | Administrative and Support Services |
| 13 | 61 | Educational Services |
| 14 | 62 | Health Care and Social Assistance |
| 15 | 71 | Arts, Entertainment, and Recreation |
| 16 | 72 | Accommodation and Food Services |
| 17 | 81 | Other Services |
| 18 | 92 | Public Administration |
## Usage
### Quick Start
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="alexanderquispe/naics-github-classifier"
)
text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for financial operations"
result = classifier(text)
print(result)
# [{'label': '52', 'score': 0.95}] # Finance and Insurance
```
### Full Example
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")
# Format input
text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare; medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
# Map to NAICS code
id2label = model.config.id2label
print(f"Predicted NAICS: {id2label[predicted_class]}") # 62 (Health Care)
```
## Input Format
The model expects text in this format:
```
Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content}
```
| Field | Required | Description |
|-------|----------|-------------|
| Repository | Yes | Repository name |
| Description | No | Short description |
| Topics | No | Semicolon-separated tags |
| README | No | README content (can be truncated) |
## Training Details
### Training Data
- **Source:** GitHub repositories labeled with NAICS codes
- **Size:** 6,588 examples
- **Classes:** 19 NAICS sectors
- **Split:** 70% train / 10% validation / 20% test
### Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| Base Model | `roberta-large` |
| Batch Size | 32 |
| Learning Rate | 2e-5 |
| Epochs | 8 |
| Max Sequence Length | 512 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Early Stopping Patience | 5 |
### Preprocessing
Text preprocessing includes:
- Removal of markdown badges and formatting
- URL cleaning (keep domain names)
- License header removal
- Code block removal (keep language indicators)
- Technology term normalization (js → javascript, py → python)
- Whitespace normalization
## Limitations
- Trained primarily on English repositories
- May not generalize to non-software repositories
- NAICS code 55 (Management of Companies) excluded due to limited training data
- Performance may vary for repositories with minimal README content
## Citation
```bibtex
@misc{naics-github-classifier,
author = {Alexander Quispe},
title = {NAICS GitHub Repository Classifier},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
}
```
## Repository
Training code and data preparation: [github.com/alexanderquispe/naics-github-train](https://github.com/alexanderquispe/naics-github-train)