aquiro1994 commited on
Commit
ecd7012
·
verified ·
1 Parent(s): 915e4bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +170 -205
README.md CHANGED
@@ -1,205 +1,170 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- library_name: transformers
6
- tags:
7
- - text-classification
8
- - naics
9
- - industry-classification
10
- - github
11
- - roberta
12
- datasets:
13
- - custom
14
- metrics:
15
- - f1
16
- - accuracy
17
- pipeline_tag: text-classification
18
- ---
19
-
20
- # NAICS GitHub Repository Classifier
21
-
22
- A fine-tuned RoBERTa-large model that classifies GitHub repositories into **19 NAICS (North American Industry
23
- Classification System)** industry sectors based on repository metadata.
24
-
25
- ## Model Description
26
-
27
- This model takes GitHub repository information (name, description, topics, README) and predicts the most likely
28
- industry sector the repository belongs to.
29
-
30
- - **Model:** `roberta-large` (355M parameters)
31
- - **Task:** Multi-class text classification (19 classes)
32
- - **Language:** English
33
- - **Training Data:** 6,588 labeled GitHub repositories
34
-
35
- ## Intended Use
36
-
37
- - Classifying GitHub repositories by industry sector
38
- - Analyzing open-source software ecosystem by industry
39
- - Research on technology adoption across industries
40
-
41
- ## NAICS Classes
42
-
43
- | Label | NAICS Code | Industry Sector |
44
- |-------|------------|-----------------|
45
- | 0 | 11 | Agriculture, Forestry, Fishing and Hunting |
46
- | 1 | 21 | Mining, Quarrying, Oil and Gas Extraction |
47
- | 2 | 22 | Utilities |
48
- | 3 | 23 | Construction |
49
- | 4 | 31-33 | Manufacturing |
50
- | 5 | 42 | Wholesale Trade |
51
- | 6 | 44-45 | Retail Trade |
52
- | 7 | 48-49 | Transportation and Warehousing |
53
- | 8 | 51 | Information |
54
- | 9 | 52 | Finance and Insurance |
55
- | 10 | 53 | Real Estate and Rental |
56
- | 11 | 54 | Professional, Scientific, Technical Services |
57
- | 12 | 56 | Administrative and Support Services |
58
- | 13 | 61 | Educational Services |
59
- | 14 | 62 | Health Care and Social Assistance |
60
- | 15 | 71 | Arts, Entertainment, and Recreation |
61
- | 16 | 72 | Accommodation and Food Services |
62
- | 17 | 81 | Other Services |
63
- | 18 | 92 | Public Administration |
64
-
65
- ## Usage
66
-
67
- ### Quick Start
68
-
69
- ```python
70
- from transformers import pipeline
71
-
72
- classifier = pipeline(
73
- "text-classification",
74
- model="alexanderquispe/naics-github-classifier"
75
- )
76
-
77
- text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for
78
- financial operations"
79
- result = classifier(text)
80
- print(result)
81
- # [{'label': '52', 'score': 0.95}] # Finance and Insurance
82
-
83
- Full Example
84
-
85
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
86
- import torch
87
-
88
- model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
89
- tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")
90
-
91
- # Format input
92
- text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare;
93
- medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..."
94
-
95
- inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
96
- outputs = model(**inputs)
97
- predicted_class = torch.argmax(outputs.logits, dim=1).item()
98
-
99
- # Map to NAICS code
100
- id2label = model.config.id2label
101
- print(f"Predicted NAICS: {id2label[predicted_class]}") # 62 (Health Care)
102
-
103
- Input Format
104
-
105
- The model expects text in this format:
106
-
107
- Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content}
108
- ┌─────────────┬──────────┬───────────────────────────────────┐
109
- │ Field │ Required │ Description │
110
- ├─────────────┼──────────┼───────────────────────────────────┤
111
- │ Repository │ Yes │ Repository name │
112
- ├─────────────┼──────────┼───────────────────────────────────┤
113
- Description No Short description
114
- ├─────────────┼──────────┼───────────────────────────────────┤
115
- Topics │ No Semicolon-separated tags │
116
- ├─────────────┼──────────┼───────────────────────────────────┤
117
- README │ No │ README content (can be truncated) │
118
- └─────────────┴──────────┴───────────────────────────────────┘
119
- Training Details
120
-
121
- Training Data
122
-
123
- - Source: GitHub repositories labeled with NAICS codes
124
- - Size: 6,588 examples
125
- - Classes: 19 NAICS sectors
126
- - Split: 70% train / 10% validation / 20% test
127
-
128
- Training Hyperparameters
129
- ┌─────────────────────────┬───────────────┐
130
- │ Parameter │ Value │
131
- ├─────────────────────────┼───────────────┤
132
- Base Model │ roberta-large
133
- ├─────────────────────────┼───────────────┤
134
- Batch Size │ 32 │
135
- ├─────────────────────────┼───────────────┤
136
- Learning Rate │ 2e-5 │
137
- ├─────────────────────────┼───────────────┤
138
- │ Epochs │ 8 │
139
- ├─────────────────────────┼───────────────┤
140
- │ Max Sequence Length │ 512 │
141
- ├─────────────────────────┼───────────────┤
142
- Optimizer │ AdamW │
143
- ├─────────────────────────┼───────────────┤
144
- Weight Decay │ 0.01 │
145
- ├─────────────────────────┼───────────────┤
146
- Early Stopping Patience 5 │
147
- └─────────────────────────┴───────────────┘
148
- Preprocessing
149
-
150
- Text preprocessing includes:
151
- - Removal of markdown badges and formatting
152
- - URL cleaning (keep domain names)
153
- - License header removal
154
- - Code block removal (keep language indicators)
155
- - Technology term normalization (js → javascript, py → python)
156
- - Whitespace normalization
157
-
158
- Limitations
159
-
160
- - Trained primarily on English repositories
161
- - May not generalize to non-software repositories
162
- - NAICS code 55 (Management of Companies) excluded due to limited training data
163
- - Performance may vary for repositories with minimal README content
164
-
165
- Citation
166
-
167
- @misc{naics-github-classifier,
168
- author = {Alexander Quispe},
169
- title = {NAICS GitHub Repository Classifier},
170
- year = {2025},
171
- publisher = {Hugging Face},
172
- url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
173
- }
174
-
175
- Repository
176
-
177
- Training code and data preparation: https://github.com/alexanderquispe/naics-github-train
178
-
179
- ---
180
-
181
- **To upload:**
182
-
183
- 1. Go to https://huggingface.co/alexanderquispe/naics-github-classifier
184
- 2. Click the **"Files and versions"** tab
185
- 3. Click **"Edit"** on `README.md` (or create it)
186
- 4. Paste the content above
187
- 5. Click **"Commit changes"**
188
-
189
- Or from Colab:
190
-
191
- ```python
192
- from huggingface_hub import upload_file
193
-
194
- # Save the model card
195
- model_card = """<paste the content above>"""
196
-
197
- with open("README.md", "w") as f:
198
- f.write(model_card)
199
-
200
- upload_file(
201
- path_or_fileobj="README.md",
202
- path_in_repo="README.md",
203
- repo_id="alexanderquispe/naics-github-classifier",
204
- repo_type="model"
205
- )
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - text-classification
8
+ - naics
9
+ - industry-classification
10
+ - github
11
+ - roberta
12
+ datasets:
13
+ - custom
14
+ metrics:
15
+ - f1
16
+ - accuracy
17
+ pipeline_tag: text-classification
18
+ ---
19
+
20
+ # NAICS GitHub Repository Classifier
21
+
22
+ A fine-tuned RoBERTa-large model that classifies GitHub repositories into **19 NAICS (North American Industry Classification System)** industry sectors based on repository metadata.
23
+
24
+ ## Model Description
25
+
26
+ This model takes GitHub repository information (name, description, topics, README) and predicts the most likely industry sector the repository belongs to.
27
+
28
+ - **Model:** `roberta-large` (355M parameters)
29
+ - **Task:** Multi-class text classification (19 classes)
30
+ - **Language:** English
31
+ - **Training Data:** 6,588 labeled GitHub repositories
32
+
33
+ ## Intended Use
34
+
35
+ - Classifying GitHub repositories by industry sector
36
+ - Analyzing open-source software ecosystem by industry
37
+ - Research on technology adoption across industries
38
+
39
+ ## NAICS Classes
40
+
41
+ | Label | NAICS Code | Industry Sector |
42
+ |-------|------------|-----------------|
43
+ | 0 | 11 | Agriculture, Forestry, Fishing and Hunting |
44
+ | 1 | 21 | Mining, Quarrying, Oil and Gas Extraction |
45
+ | 2 | 22 | Utilities |
46
+ | 3 | 23 | Construction |
47
+ | 4 | 31-33 | Manufacturing |
48
+ | 5 | 42 | Wholesale Trade |
49
+ | 6 | 44-45 | Retail Trade |
50
+ | 7 | 48-49 | Transportation and Warehousing |
51
+ | 8 | 51 | Information |
52
+ | 9 | 52 | Finance and Insurance |
53
+ | 10 | 53 | Real Estate and Rental |
54
+ | 11 | 54 | Professional, Scientific, Technical Services |
55
+ | 12 | 56 | Administrative and Support Services |
56
+ | 13 | 61 | Educational Services |
57
+ | 14 | 62 | Health Care and Social Assistance |
58
+ | 15 | 71 | Arts, Entertainment, and Recreation |
59
+ | 16 | 72 | Accommodation and Food Services |
60
+ | 17 | 81 | Other Services |
61
+ | 18 | 92 | Public Administration |
62
+
63
+ ## Usage
64
+
65
+ ### Quick Start
66
+
67
+ ```python
68
+ from transformers import pipeline
69
+
70
+ classifier = pipeline(
71
+ "text-classification",
72
+ model="alexanderquispe/naics-github-classifier"
73
+ )
74
+
75
+ text = "Repository: bank-api | Description: REST API for banking transactions | README: A secure API for financial operations"
76
+ result = classifier(text)
77
+ print(result)
78
+ # [{'label': '52', 'score': 0.95}] # Finance and Insurance
79
+ ```
80
+
81
+ ### Full Example
82
+
83
+ ```python
84
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
85
+ import torch
86
+
87
+ model = AutoModelForSequenceClassification.from_pretrained("alexanderquispe/naics-github-classifier")
88
+ tokenizer = AutoTokenizer.from_pretrained("alexanderquispe/naics-github-classifier")
89
+
90
+ # Format input
91
+ text = "Repository: mediscan | Description: AI diagnostic tool for radiology | Topics: healthcare; medical-imaging; deep-learning | README: MediScan uses computer vision to assist radiologists..."
92
+
93
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
94
+ outputs = model(**inputs)
95
+ predicted_class = torch.argmax(outputs.logits, dim=1).item()
96
+
97
+ # Map to NAICS code
98
+ id2label = model.config.id2label
99
+ print(f"Predicted NAICS: {id2label[predicted_class]}") # 62 (Health Care)
100
+ ```
101
+
102
+ ## Input Format
103
+
104
+ The model expects text in this format:
105
+
106
+ ```
107
+ Repository: {repo_name} | Description: {description} | Topics: {topics} | README: {readme_content}
108
+ ```
109
+
110
+ | Field | Required | Description |
111
+ |-------|----------|-------------|
112
+ | Repository | Yes | Repository name |
113
+ | Description | No | Short description |
114
+ | Topics | No | Semicolon-separated tags |
115
+ | README | No | README content (can be truncated) |
116
+
117
+ ## Training Details
118
+
119
+ ### Training Data
120
+
121
+ - **Source:** GitHub repositories labeled with NAICS codes
122
+ - **Size:** 6,588 examples
123
+ - **Classes:** 19 NAICS sectors
124
+ - **Split:** 70% train / 10% validation / 20% test
125
+
126
+ ### Training Hyperparameters
127
+
128
+ | Parameter | Value |
129
+ |-----------|-------|
130
+ | Base Model | `roberta-large` |
131
+ | Batch Size | 32 |
132
+ | Learning Rate | 2e-5 |
133
+ | Epochs | 8 |
134
+ | Max Sequence Length | 512 |
135
+ | Optimizer | AdamW |
136
+ | Weight Decay | 0.01 |
137
+ | Early Stopping Patience | 5 |
138
+
139
+ ### Preprocessing
140
+
141
+ Text preprocessing includes:
142
+ - Removal of markdown badges and formatting
143
+ - URL cleaning (keep domain names)
144
+ - License header removal
145
+ - Code block removal (keep language indicators)
146
+ - Technology term normalization (js → javascript, py → python)
147
+ - Whitespace normalization
148
+
149
+ ## Limitations
150
+
151
+ - Trained primarily on English repositories
152
+ - May not generalize to non-software repositories
153
+ - NAICS code 55 (Management of Companies) excluded due to limited training data
154
+ - Performance may vary for repositories with minimal README content
155
+
156
+ ## Citation
157
+
158
+ ```bibtex
159
+ @misc{naics-github-classifier,
160
+ author = {Alexander Quispe},
161
+ title = {NAICS GitHub Repository Classifier},
162
+ year = {2025},
163
+ publisher = {Hugging Face},
164
+ url = {https://huggingface.co/alexanderquispe/naics-github-classifier}
165
+ }
166
+ ```
167
+
168
+ ## Repository
169
+
170
+ Training code and data preparation: [github.com/alexanderquispe/naics-github-train](https://github.com/alexanderquispe/naics-github-train)