jimnoneill commited on
Commit
3ee6292
·
verified ·
1 Parent(s): 2c23cee

Upload model card

Browse files
Files changed (1) hide show
  1. README.md +28 -15
README.md CHANGED
@@ -1,10 +1,12 @@
1
  ---
2
  language: en
3
  license: mit
4
- library_name: model2vec
5
  tags:
6
- - model2vec
7
- - static-embeddings
 
 
8
  - topic-classification
9
  - openalex
10
  - scientific-papers
@@ -13,11 +15,18 @@ datasets:
13
  - jimnoneill/paper-to-field-training
14
  ---
15
 
16
- # Paper-to-Field Classifier
17
 
18
- Lightweight CPU-based topic classifier for scientific paper abstracts using the
19
  [OpenAlex taxonomy](https://docs.openalex.org/api-entities/topics) (4,516 topics → 245 subfields → 26 fields → 4 domains).
20
 
 
 
 
 
 
 
 
21
  ## Usage
22
 
23
  ```python
@@ -35,26 +44,30 @@ print(result)
35
  # {
36
  # 'topic': {'id': 10209, 'name': 'Neural Machine Translation and Sequence Models', 'score': 0.87},
37
  # 'subfield': {'id': 1702, 'name': 'Artificial Intelligence'},
38
- # 'field': {'id': 17, 'name': 'Computer Science'},
39
  # 'domain': {'id': 3, 'name': 'Physical Sciences'}
40
  # }
41
  ```
42
 
43
  ## Model Details
44
 
45
- - **Base model**: [minishlab/potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) (Model2Vec)
46
- - **Fine-tuned on**: ~50K domain-balanced paper abstracts from OpenAlex
 
47
  - **Taxonomy**: OpenAlex (4,516 topics, 245 subfields, 26 fields, 4 domains)
48
- - **Input**: Paper title + abstract (truncated to 500 chars)
49
- - **Inference**: CPU-only, ~3,000 papers/second
 
 
50
 
51
  ## Training
52
 
53
- Trained on OpenAlex bulk data with papers filtered for:
54
- - English language
55
- - Has abstract
56
- - Primary topic confidence score > 0.8
57
- - Domain-balanced sampling (~12.5K per domain)
 
58
 
59
  ## Install
60
 
 
1
  ---
2
  language: en
3
  license: mit
4
+ library_name: transformers
5
  tags:
6
+ - transformers
7
+ - electra
8
+ - biomedical
9
+ - text-classification
10
  - topic-classification
11
  - openalex
12
  - scientific-papers
 
15
  - jimnoneill/paper-to-field-training
16
  ---
17
 
18
+ # Paper-to-Field Classifier (v3)
19
 
20
+ Transformer-based topic classifier for scientific paper abstracts using the
21
  [OpenAlex taxonomy](https://docs.openalex.org/api-entities/topics) (4,516 topics → 245 subfields → 26 fields → 4 domains).
22
 
23
+ ## Performance
24
+
25
+ | Metric | Accuracy |
26
+ |--------|----------|
27
+ | Field (26 classes) | **86.3%** |
28
+ | Domain (4 classes) | **94.4%** |
29
+
30
  ## Usage
31
 
32
  ```python
 
44
  # {
45
  # 'topic': {'id': 10209, 'name': 'Neural Machine Translation and Sequence Models', 'score': 0.87},
46
  # 'subfield': {'id': 1702, 'name': 'Artificial Intelligence'},
47
+ # 'field': {'id': 17, 'name': 'Computer Science', 'score': 0.95},
48
  # 'domain': {'id': 3, 'name': 'Physical Sciences'}
49
  # }
50
  ```
51
 
52
  ## Model Details
53
 
54
+ - **Architecture**: BioM-ELECTRA-Large (~335M params) fine-tuned for 26-class field classification
55
+ - **Fine-tuned on**: ~200K paper abstracts with DeepSeek-verified field labels (domain-balanced)
56
+ - **Label quality**: Training labels verified by DeepSeek LLM, replacing noisy OpenAlex labels (~50% error rate)
57
  - **Taxonomy**: OpenAlex (4,516 topics, 245 subfields, 26 fields, 4 domains)
58
+ - **Input**: Paper title + abstract (tokenizer truncates at 384 tokens)
59
+ - **Field prediction**: Classification head (26 classes with sqrt-weighted cross-entropy for class imbalance)
60
+ - **Topic resolution**: [CLS] embeddings + FAISS nearest-neighbor within predicted field
61
+ - **GPU recommended** for inference (works on CPU but slower)
62
 
63
  ## Training
64
 
65
+ Trained on 200K domain-balanced paper abstracts from OpenAlex bulk data, re-annotated with
66
+ DeepSeek LLM for high-quality field labels (confidence >= 0.7 filter applied).
67
+
68
+ Hyperparameters: lr=1e-5, cosine schedule, batch=32 (grad accum 2 = effective 64), epochs=8,
69
+ warmup=6%, label smoothing=0.1, fp16, early stopping (patience 5), sqrt inverse-frequency
70
+ class weights.
71
 
72
  ## Install
73