theluantran commited on
Commit
fa6d1a5
·
verified ·
1 Parent(s): a3f6203

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -0
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-classification
3
+
4
+ tags:
5
+ - text-classification
6
+ - cefr
7
+ - word2vec
8
+ - doc2vec
9
+ - nlp
10
+ language:
11
+ - en
12
+ license: mit
13
+ ---
14
+ # CEFR Doc2Vec Classifier
15
+
16
+ A Doc2Vec-based neural network model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.
17
+
18
+ The source code to train this model can be found at:
19
+ https://github.com/luantran/One-model-to-grade-them-all
20
+
21
+ ## Model Description
22
+
23
+ This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels.
24
+ The Doc2Vec classifier uses document embeddings fed into a fully connected neural network to capture semantic patterns characteristic of different proficiency levels.
25
+
26
+ The other models part of this ensemble are:
27
+ - https://huggingface.co/theluantran/cefr-naive-bayes
28
+ - https://huggingface.co/theluantran/cefr-bert-classifier
29
+
30
+ ## Labels
31
+
32
+ The model classifies text into 5 CEFR proficiency levels:
33
+
34
+ * **A1**: Beginner
35
+ * **A2**: Elementary
36
+ * **B1**: Intermediate
37
+ * **B2**: Upper Intermediate
38
+ * **C1/C2**: Advanced
39
+
40
+ ## Model Details
41
+
42
+ * **Type**: Doc2Vec + Fully Connected Neural Network
43
+ * **Frameworks**: gensim (Doc2Vec), PyTorch (Neural Network)
44
+ * **Task**: Multi-class text classification
45
+ * **Architecture**:
46
+ * Doc2Vec embedding: 300-dimensional document vectors
47
+ * Neural network: 128 hidden units with dropout (0.3)
48
+ * Output: 5-class softmax classification
49
+ * **Input**: Raw text strings
50
+ * **Output**: Class predictions (0-4) with probability distributions
51
+ * **Files**:
52
+ * `doc2vec_model.bin`: Trained Doc2Vec model (gensim binary format)
53
+ * `nn_weights.pth`: Neural network state dictionary (PyTorch)
54
+ * `config.json`: Model configuration (embedding_dim, hidden_dim, num_classes, dropout_rate)
55
+
56
+ ## Usage
57
+
58
+ ### Basic Prediction
59
+ ```python
60
+ from huggingface_hub import snapshot_download
61
+ from gensim.models import Doc2Vec
62
+ import torch
63
+ import torch.nn as nn
64
+ import numpy as np
65
+ import json
66
+ import os
67
+
68
+ # Download model files
69
+ local_dir = "./doc2vec_model"
70
+ snapshot_download(
71
+ repo_id="theluantran/cefr-doc2vec",
72
+ local_dir=local_dir,
73
+ local_dir_use_symlinks=False,
74
+ allow_patterns=[
75
+ "doc2vec_model*",
76
+ "*.json",
77
+ "nn_weights.pth"
78
+ ]
79
+ )
80
+
81
+ # Define neural network architecture
82
+ class Doc2VecClassifier(nn.Module):
83
+ def __init__(self, embedding_dim, hidden_dim, num_classes, dropout=0.3):
84
+ super(Doc2VecClassifier, self).__init__()
85
+ self.fc1 = nn.Linear(embedding_dim, hidden_dim)
86
+ self.relu = nn.ReLU()
87
+ self.dropout = nn.Dropout(dropout)
88
+ self.fc2 = nn.Linear(hidden_dim, num_classes)
89
+
90
+ def forward(self, x):
91
+ x = self.fc1(x)
92
+ x = self.relu(x)
93
+ x = self.dropout(x)
94
+ x = self.fc2(x)
95
+ return x
96
+
97
+ # Load Doc2Vec model
98
+ doc2vec_model = Doc2Vec.load(os.path.join(local_dir, "doc2vec_model.bin"))
99
+
100
+ # Load configuration
101
+ with open(os.path.join(local_dir, "config.json"), 'r') as f:
102
+ config = json.load(f)
103
+
104
+ # Reconstruct and load neural network
105
+ neural_network = Doc2VecClassifier(
106
+ embedding_dim=config['embedding_dim'],
107
+ hidden_dim=config['hidden_dim'],
108
+ num_classes=config['num_classes'],
109
+ dropout=config['dropout_rate']
110
+ )
111
+ neural_network.load_state_dict(
112
+ torch.load(os.path.join(local_dir, "nn_weights.pth"))
113
+ )
114
+ neural_network.eval()
115
+
116
+ # Predict
117
+ text = "This is a sample text to classify"
118
+ vector = doc2vec_model.infer_vector(text.split())
119
+
120
+ with torch.no_grad():
121
+ tensor = torch.FloatTensor(vector).unsqueeze(0)
122
+ output = neural_network(tensor)
123
+ probabilities = torch.softmax(output, dim=1)
124
+
125
+ probs_array = probabilities.numpy()[0]
126
+ prediction = int(np.argmax(probs_array))
127
+
128
+ # Map numeric prediction to CEFR level
129
+ level_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
130
+ predicted_level = level_map[prediction]
131
+
132
+ print(f"Predicted level: {predicted_level}")
133
+ print(f"Confidence: {max(probs_array):.2%}")
134
+ print(f"All probabilities: {dict(zip(level_map.values(), probs_array))}")
135
+ ```
136
+
137
+ ## Model Configuration
138
+
139
+ The `config.json` file contains the following parameters:
140
+ ```json
141
+ {
142
+ "embedding_dim": 100,
143
+ "hidden_dim": 128,
144
+ "num_classes": 5,
145
+ "dropout_rate": 0.3
146
+ }
147
+ ```
148
+
149
+ ## Training
150
+
151
+ This model was trained using proprietary CEFR-labeled text data. The training process involves:
152
+
153
+ 1. **Doc2Vec Embedding Training**: Training Doc2Vec embeddings on the corpus with 10 epochs and minimum word count of 1
154
+ 2. **Document Vector Generation**: Generating 300-dimensional document vectors for all training samples
155
+ 3. **Neural Network Training**: Training a fully connected neural network classifier on these embeddings
156
+
157
+ ## License
158
+
159
+ This model is released for research and educational purposes. The training data is proprietary and not included.