NAICS GitHub Repository Classifier
A fine-tuned RoBERTa-large model that classifies GitHub repositories into 19 NAICS (North American Industry Classification System) industry sectors based on repository metadata.
Model Description
This model takes GitHub repository information (name, description, topics, README) and predicts the most likely industry sector the repository belongs to.
- Model:
roberta-large(355M parameters) - Task: Multi-class text classification (19 classes)
- Language: English
- Training Data: 6,588 labeled GitHub repositories
Intended Use
- Classifying GitHub repositories by industry sector
- Analyzing open-source software ecosystem by industry
- Research on technology adoption across industries
NAICS Classes
| Label | NAICS Code | Industry Sector |
|---|---|---|
| 0 | 11 | Agriculture, Forestry, Fishing and Hunting |
| 1 | 21 | Mining, Quarrying, Oil and Gas Extraction |
| 2 | 22 | Utilities |
| 3 | 23 | Construction |
| 4 | 31-33 | Manufacturing |
| 5 | 42 | Wholesale Trade |
| 6 | 44-45 | Retail Trade |
| 7 | 48-49 | Transportation and Warehousing |
| 8 | 51 | Information |
| 9 | 52 | Finance and Insurance |
| 10 | 53 | Real Estate and Rental |
| 11 | 54 | Professional, Scientific, Technical Services |
| 12 | 56 | Administrative and Support Services |
| 13 | 61 | Educational Services |
| 14 | 62 | Health Care and Social Assistance |
| 15 | 71 | Arts, Entertainment, and Recreation |
| 16 | 72 | Accommodation and Food Services |
| 17 | 81 | Other Services |
| 18 | 92 | Public Administration |
Usage
Quick Start
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="alexanderquispe/naics-github-classifier"
)
text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for financial operations"
result = classifier(text)
print(result)
# [{'label': '52', 'score': 0.95}] # Finance and Insurance
Full Example
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")
# Format input
text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare; medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
# Map to NAICS code
id2label = model.config.id2label
print(f"Predicted NAICS: {id2label[predicted_class]}") # 62 (Health Care)
Input Format
The model expects text in this format:
Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content}
| Field | Required | Description |
|---|---|---|
| Repository | Yes | Repository name |
| Description | No | Short description |
| Topics | No | Semicolon-separated tags |
| README | No | README content (can be truncated) |
Training Details
Training Data
- Source: GitHub repositories labeled with NAICS codes
- Size: 6,588 examples
- Classes: 19 NAICS sectors
- Split: 70% train / 10% validation / 20% test
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base Model | roberta-large |
| Batch Size | 32 |
| Learning Rate | 2e-5 |
| Epochs | 8 |
| Max Sequence Length | 512 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Early Stopping Patience | 5 |
Preprocessing
Text preprocessing includes:
- Removal of markdown badges and formatting
- URL cleaning (keep domain names)
- License header removal
- Code block removal (keep language indicators)
- Technology term normalization (js โ javascript, py โ python)
- Whitespace normalization
Limitations
- Trained primarily on English repositories
- May not generalize to non-software repositories
- NAICS code 55 (Management of Companies) excluded due to limited training data
- Performance may vary for repositories with minimal README content
Citation
@misc{naics-github-classifier,
author = {Alexander Quispe},
title = {NAICS GitHub Repository Classifier},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
}
Repository
Training code and data preparation: github.com/alexanderquispe/naics-github-train
- Downloads last month
- 21