NAICS GitHub Repository Classifier

A fine-tuned RoBERTa-large model that classifies GitHub repositories into 19 NAICS (North American Industry Classification System) industry sectors based on repository metadata.

Model Description

This model takes GitHub repository information (name, description, topics, README) and predicts the most likely industry sector the repository belongs to.

  • Model: roberta-large (355M parameters)
  • Task: Multi-class text classification (19 classes)
  • Language: English
  • Training Data: 6,588 labeled GitHub repositories

Intended Use

  • Classifying GitHub repositories by industry sector
  • Analyzing open-source software ecosystem by industry
  • Research on technology adoption across industries

NAICS Classes

Label NAICS Code Industry Sector
0 11 Agriculture, Forestry, Fishing and Hunting
1 21 Mining, Quarrying, Oil and Gas Extraction
2 22 Utilities
3 23 Construction
4 31-33 Manufacturing
5 42 Wholesale Trade
6 44-45 Retail Trade
7 48-49 Transportation and Warehousing
8 51 Information
9 52 Finance and Insurance
10 53 Real Estate and Rental
11 54 Professional, Scientific, Technical Services
12 56 Administrative and Support Services
13 61 Educational Services
14 62 Health Care and Social Assistance
15 71 Arts, Entertainment, and Recreation
16 72 Accommodation and Food Services
17 81 Other Services
18 92 Public Administration

Usage

Quick Start

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="alexanderquispe/naics-github-classifier"
)

text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for financial operations"
result = classifier(text)
print(result)
# [{'label': '52', 'score': 0.95}]  # Finance and Insurance

Full Example

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")

# Format input
text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare; medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()

# Map to NAICS code
id2label = model.config.id2label
print(f"Predicted NAICS: {id2label[predicted_class]}")  # 62 (Health Care)

Input Format

The model expects text in this format:

Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content}
Field Required Description
Repository Yes Repository name
Description No Short description
Topics No Semicolon-separated tags
README No README content (can be truncated)

Training Details

Training Data

  • Source: GitHub repositories labeled with NAICS codes
  • Size: 6,588 examples
  • Classes: 19 NAICS sectors
  • Split: 70% train / 10% validation / 20% test

Training Hyperparameters

Parameter Value
Base Model roberta-large
Batch Size 32
Learning Rate 2e-5
Epochs 8
Max Sequence Length 512
Optimizer AdamW
Weight Decay 0.01
Early Stopping Patience 5

Preprocessing

Text preprocessing includes:

  • Removal of markdown badges and formatting
  • URL cleaning (keep domain names)
  • License header removal
  • Code block removal (keep language indicators)
  • Technology term normalization (js โ†’ javascript, py โ†’ python)
  • Whitespace normalization

Limitations

  • Trained primarily on English repositories
  • May not generalize to non-software repositories
  • NAICS code 55 (Management of Companies) excluded due to limited training data
  • Performance may vary for repositories with minimal README content

Citation

@misc{naics-github-classifier,
  author = {Alexander Quispe},
  title = {NAICS GitHub Repository Classifier},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
}

Repository

Training code and data preparation: github.com/alexanderquispe/naics-github-train

Downloads last month
21
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support