NAICS GitHub Repository Classifier

A fine-tuned RoBERTa-large model that classifies GitHub repositories into 19 NAICS (North American Industry Classification System) industry sectors based on repository metadata.

Model Description

This model takes GitHub repository information (name, description, topics, README) and predicts the most likely industry sector the repository belongs to.

Model: roberta-large (355M parameters)
Task: Multi-class text classification (19 classes)
Language: English
Training Data: 6,588 labeled GitHub repositories

Intended Use

Classifying GitHub repositories by industry sector
Analyzing open-source software ecosystem by industry
Research on technology adoption across industries

NAICS Classes

Label	NAICS Code	Industry Sector
0	11	Agriculture, Forestry, Fishing and Hunting
1	21	Mining, Quarrying, Oil and Gas Extraction
2	22	Utilities
3	23	Construction
4	31-33	Manufacturing
5	42	Wholesale Trade
6	44-45	Retail Trade
7	48-49	Transportation and Warehousing
8	51	Information
9	52	Finance and Insurance
10	53	Real Estate and Rental
11	54	Professional, Scientific, Technical Services
12	56	Administrative and Support Services
13	61	Educational Services
14	62	Health Care and Social Assistance
15	71	Arts, Entertainment, and Recreation
16	72	Accommodation and Food Services
17	81	Other Services
18	92	Public Administration

Usage

Quick Start

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="alexanderquispe/naics-github-classifier"
)

text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for financial operations"
result = classifier(text)
print(result)
# [{'label': '52', 'score': 0.95}]  # Finance and Insurance

Full Example

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")

# Format input
text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare; medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()

# Map to NAICS code
id2label = model.config.id2label
print(f"Predicted NAICS: {id2label[predicted_class]}")  # 62 (Health Care)

Input Format

The model expects text in this format:

Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content}

Field	Required	Description
Repository	Yes	Repository name
Description	No	Short description
Topics	No	Semicolon-separated tags
README	No	README content (can be truncated)

Training Details

Training Data

Source: GitHub repositories labeled with NAICS codes
Size: 6,588 examples
Classes: 19 NAICS sectors
Split: 70% train / 10% validation / 20% test

Training Hyperparameters

Parameter	Value
Base Model	`roberta-large`
Batch Size	32
Learning Rate	2e-5
Epochs	8
Max Sequence Length	512
Optimizer	AdamW
Weight Decay	0.01
Early Stopping Patience	5

Preprocessing

Text preprocessing includes:

Removal of markdown badges and formatting
URL cleaning (keep domain names)
License header removal
Code block removal (keep language indicators)
Technology term normalization (js → javascript, py → python)
Whitespace normalization

Limitations

Trained primarily on English repositories
May not generalize to non-software repositories
NAICS code 55 (Management of Companies) excluded due to limited training data
Performance may vary for repositories with minimal README content

Citation

@misc{naics-github-classifier,
  author = {{GitHub, Inc.} and Xu, Kevin and Quispe, Alexander},
  title = {NAICS GitHub Repository Classifier},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
}

Repository

Training code and data preparation: github.com/alexanderquispe/naics-github-train

Downloads last month: 7

Safetensors

Model size

0.4B params

Tensor type

F32