Create README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,20 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
datasets:
|
| 4 |
+
- christopher/rosetta-code
|
| 5 |
+
pipeline_tag: text-classification
|
| 6 |
+
tags:
|
| 7 |
+
- code
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
This is a CoreML model for identification of following programming languages:
|
| 11 |
+
|
| 12 |
+
```go, lua, perl, python, apl, shell, c, c#, c++, cobol, lisp, erlang, fortran, groovy, haskell, java, javascript, kotlin, objective-c, pascal, php, powershell, r, ruby, rust, scala, scheme, swift, dart, sql, text, mysql, typescript, ecma, cmake, html, latex, jinja, json, toml, css```
|
| 13 |
+
|
| 14 |
+
It was trained on a cleaned up and filtered rosetta-code dataset (more precisely: https://huggingface.co/datasets/christopher/rosetta-code, but cleaned up).
|
| 15 |
+
|
| 16 |
+
## ProgrammingLanguageIdentificationV1
|
| 17 |
+
First version of PIL model. It was trained on 20 362 data points (including validation, which was picked automatically).
|
| 18 |
+
Because each programming language has a different number of snippets (lowest: css, ecma, toml (1), highest: go (1110)) its accuracy varies a lot between languages. It's general accuracy is 98,8% for training and validation.
|
| 19 |
+
|
| 20 |
+
Future versions will focus on increasing dataset size.
|