aquiro1994
/

naics-github-classifier

@@ -1,199 +1,205 @@
----
-library_name: transformers
-tags: []
----
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

+ ---
+  license: mit
+  language:
+  - en
+  library_name: transformers
+  tags:
+  - text-classification
+  - naics
+  - industry-classification
+  - github
+  - roberta
+  datasets:
+  - custom
+  metrics:
+  - f1
+  - accuracy
+  pipeline_tag: text-classification
+  ---
+  # NAICS GitHub Repository Classifier
+  A fine-tuned RoBERTa-large model that classifies GitHub repositories into **19 NAICS (North American Industry
+  Classification System)** industry sectors based on repository metadata.
+  ## Model Description
+  This model takes GitHub repository information (name, description, topics, README) and predicts the most likely
+  industry sector the repository belongs to.
+  - **Model:** `roberta-large` (355M parameters)
+  - **Task:** Multi-class text classification (19 classes)
+  - **Language:** English
+  - **Training Data:** 6,588 labeled GitHub repositories
+  ## Intended Use
+  - Classifying GitHub repositories by industry sector
+  - Analyzing open-source software ecosystem by industry
+  - Research on technology adoption across industries
+  ## NAICS Classes
+  | Label | NAICS Code | Industry Sector |
+  |-------|------------|-----------------|
+  | 0 | 11 | Agriculture, Forestry, Fishing and Hunting |
+  | 1 | 21 | Mining, Quarrying, Oil and Gas Extraction |
+  | 2 | 22 | Utilities |
+  | 3 | 23 | Construction |
+  | 4 | 31-33 | Manufacturing |
+  | 5 | 42 | Wholesale Trade |
+  | 6 | 44-45 | Retail Trade |
+  | 7 | 48-49 | Transportation and Warehousing |
+  | 8 | 51 | Information |
+  | 9 | 52 | Finance and Insurance |
+  | 10 | 53 | Real Estate and Rental |
+  | 11 | 54 | Professional, Scientific, Technical Services |
+  | 12 | 56 | Administrative and Support Services |
+  | 13 | 61 | Educational Services |
+  | 14 | 62 | Health Care and Social Assistance |
+  | 15 | 71 | Arts, Entertainment, and Recreation |
+  | 16 | 72 | Accommodation and Food Services |
+  | 17 | 81 | Other Services |
+  | 18 | 92 | Public Administration |
+  ## Usage
+  ### Quick Start
+  ```python
+  from transformers import pipeline
+  classifier = pipeline(
+      "text-classification",
+      model="alexanderquispe/naics-github-classifier"
+  )
+  text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for
+  financial operations"
+  result = classifier(text)
+  print(result)
+  # [{'label': '52', 'score': 0.95}]  # Finance and Insurance
+  Full Example
+  from transformers import AutoModelForSequenceClassification, AutoTokenizer
+  import torch
+  model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
+  tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")
+  # Format input
+  text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare;
+  medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..."
+  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
+  outputs = model(**inputs)
+  predicted_class = torch.argmax(outputs.logits, dim=1).item()
+  # Map to NAICS code
+  id2label = model.config.id2label
+  print(f"Predicted NAICS: {id2label[predicted_class]}")  # 62 (Health Care)
+  Input Format
+  The model expects text in this format:
+  Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content}
+  ┌─────────────┬──────────┬───────────────────────────────────┐
+  │    Field    │ Required │            Description            │
+  ├─────────────┼──────────┼───────────────────────────────────┤
+  │ Repository  │ Yes      │ Repository name                   │
+  ├─────────────┼──────────┼───────────────────────────────────┤
+  │ Description │ No       │ Short description                 │
+  ├─────────────┼──────────┼───────────────────────────────────┤
+  │ Topics      │ No       │ Semicolon-separated tags          │
+  ├─────────────┼──────────┼───────────────────────────────────┤
+  │ README      │ No       │ README content (can be truncated) │
+  └─────────────┴──────────┴───────────────────────────────────┘
+  Training Details
+  Training Data
+  - Source: GitHub repositories labeled with NAICS codes
+  - Size: 6,588 examples
+  - Classes: 19 NAICS sectors
+  - Split: 70% train / 10% validation / 20% test
+  Training Hyperparameters
+  ┌─────────────────────────┬───────────────┐
+  │        Parameter        │     Value     │
+  ├─────────────────────────┼───────────────┤
+  │ Base Model              │ roberta-large │
+  ├─────────────────────────┼───────────────┤
+  │ Batch Size              │ 32            │
+  ├─────────────────────────┼───────────────┤
+  │ Learning Rate           │ 2e-5          │
+  ├─────────────────────────┼───────────────┤
+  │ Epochs                  │ 8             │
+  ├─────────────────────────┼───────────────┤
+  │ Max Sequence Length     │ 512           │
+  ├─────────────────────────┼───────────────┤
+  │ Optimizer               │ AdamW         │
+  ├─────────────────────────┼───────────────┤
+  │ Weight Decay            │ 0.01          │
+  ├─────────────────────────┼───────────────┤
+  │ Early Stopping Patience │ 5             │
+  └─────────────────────────┴───────────────┘
+  Preprocessing
+  Text preprocessing includes:
+  - Removal of markdown badges and formatting
+  - URL cleaning (keep domain names)
+  - License header removal
+  - Code block removal (keep language indicators)
+  - Technology term normalization (js → javascript, py → python)
+  - Whitespace normalization
+  Limitations
+  - Trained primarily on English repositories
+  - May not generalize to non-software repositories
+  - NAICS code 55 (Management of Companies) excluded due to limited training data
+  - Performance may vary for repositories with minimal README content
+  Citation
+  @misc{naics-github-classifier,
+    author = {Alexander Quispe},
+    title = {NAICS GitHub Repository Classifier},
+    year = {2025},
+    publisher = {Hugging Face},
+    url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
+  }
+  Repository
+  Training code and data preparation: https://github.com/alexanderquispe/naics-github-train
+  ---
+  **To upload:**
+  1. Go to https://huggingface.co/alexanderquispe/naics-github-classifier
+  2. Click the **"Files and versions"** tab
+  3. Click **"Edit"** on `README.md` (or create it)
+  4. Paste the content above
+  5. Click **"Commit changes"**
+  Or from Colab:
+  ```python
+  from huggingface_hub import upload_file
+  # Save the model card
+  model_card = """<paste the content above>"""
+  with open("README.md", "w") as f:
+      f.write(model_card)
+  upload_file(
+      path_or_fileobj="README.md",
+      path_in_repo="README.md",
+      repo_id="alexanderquispe/naics-github-classifier",
+      repo_type="model"
+  )