--- license: mit language: - en library_name: transformers tags: - text-classification - naics - industry-classification - github - roberta datasets: - custom metrics: - f1 - accuracy pipeline_tag: text-classification --- # NAICS GitHub Repository Classifier A fine-tuned RoBERTa-large model that classifies GitHub repositories into **19 NAICS (North American Industry Classification System)** industry sectors based on repository metadata. ## Model Description This model takes GitHub repository information (name, description, topics, README) and predicts the most likely industry sector the repository belongs to. - **Model:** `roberta-large` (355M parameters) - **Task:** Multi-class text classification (19 classes) - **Language:** English - **Training Data:** 6,588 labeled GitHub repositories ## Intended Use - Classifying GitHub repositories by industry sector - Analyzing open-source software ecosystem by industry - Research on technology adoption across industries ## NAICS Classes | Label | NAICS Code | Industry Sector | |-------|------------|-----------------| | 0 | 11 | Agriculture, Forestry, Fishing and Hunting | | 1 | 21 | Mining, Quarrying, Oil and Gas Extraction | | 2 | 22 | Utilities | | 3 | 23 | Construction | | 4 | 31-33 | Manufacturing | | 5 | 42 | Wholesale Trade | | 6 | 44-45 | Retail Trade | | 7 | 48-49 | Transportation and Warehousing | | 8 | 51 | Information | | 9 | 52 | Finance and Insurance | | 10 | 53 | Real Estate and Rental | | 11 | 54 | Professional, Scientific, Technical Services | | 12 | 56 | Administrative and Support Services | | 13 | 61 | Educational Services | | 14 | 62 | Health Care and Social Assistance | | 15 | 71 | Arts, Entertainment, and Recreation | | 16 | 72 | Accommodation and Food Services | | 17 | 81 | Other Services | | 18 | 92 | Public Administration | ## Usage ### Quick Start ```python from transformers import pipeline classifier = pipeline( "text-classification", model="alexanderquispe/naics-github-classifier" ) text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for financial operations" result = classifier(text) print(result) # [{'label': '52', 'score': 0.95}] # Finance and Insurance ``` ### Full Example ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier") tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier") # Format input text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare; medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) outputs = model(**inputs) predicted_class = torch.argmax(outputs.logits, dim=1).item() # Map to NAICS code id2label = model.config.id2label print(f"Predicted NAICS: {id2label[predicted_class]}") # 62 (Health Care) ``` ## Input Format The model expects text in this format: ``` Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content} ``` | Field | Required | Description | |-------|----------|-------------| | Repository | Yes | Repository name | | Description | No | Short description | | Topics | No | Semicolon-separated tags | | README | No | README content (can be truncated) | ## Training Details ### Training Data - **Source:** GitHub repositories labeled with NAICS codes - **Size:** 6,588 examples - **Classes:** 19 NAICS sectors - **Split:** 70% train / 10% validation / 20% test ### Training Hyperparameters | Parameter | Value | |-----------|-------| | Base Model | `roberta-large` | | Batch Size | 32 | | Learning Rate | 2e-5 | | Epochs | 8 | | Max Sequence Length | 512 | | Optimizer | AdamW | | Weight Decay | 0.01 | | Early Stopping Patience | 5 | ### Preprocessing Text preprocessing includes: - Removal of markdown badges and formatting - URL cleaning (keep domain names) - License header removal - Code block removal (keep language indicators) - Technology term normalization (js → javascript, py → python) - Whitespace normalization ## Limitations - Trained primarily on English repositories - May not generalize to non-software repositories - NAICS code 55 (Management of Companies) excluded due to limited training data - Performance may vary for repositories with minimal README content ## Citation ```bibtex @misc{naics-github-classifier, author = {Alexander Quispe}, title = {NAICS GitHub Repository Classifier}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/alexanderquispe/naics-github-classifier} } ``` ## Repository Training code and data preparation: [github.com/alexanderquispe/naics-github-train](https://github.com/alexanderquispe/naics-github-train)