pipeline_tag: text-classification
tags:
- text-classification
- cefr
- word2vec
- doc2vec
- nlp
language:
- en
license: mit
CEFR Doc2Vec Classifier
A Doc2Vec-based neural network model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.
The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all
Model Description
This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The Doc2Vec classifier uses document embeddings fed into a fully connected neural network to capture semantic patterns characteristic of different proficiency levels.
The other models part of this ensemble are:
- https://huggingface.co/theluantran/cefr-naive-bayes
- https://huggingface.co/theluantran/cefr-bert-classifier
Labels
The model classifies text into 5 CEFR proficiency levels:
- A1: Beginner
- A2: Elementary
- B1: Intermediate
- B2: Upper Intermediate
- C1/C2: Advanced
Model Details
- Type: Doc2Vec + Fully Connected Neural Network
- Frameworks: gensim (Doc2Vec), PyTorch (Neural Network)
- Task: Multi-class text classification
- Architecture:
- Doc2Vec embedding: 300-dimensional document vectors
- Neural network: 128 hidden units with dropout (0.3)
- Output: 5-class softmax classification
- Input: Raw text strings
- Output: Class predictions (0-4) with probability distributions
- Files:
doc2vec_model.bin: Trained Doc2Vec model (gensim binary format)nn_weights.pth: Neural network state dictionary (PyTorch)config.json: Model configuration (embedding_dim, hidden_dim, num_classes, dropout_rate)
Usage
Basic Prediction
from huggingface_hub import snapshot_download
from gensim.models import Doc2Vec
import torch
import torch.nn as nn
import numpy as np
import json
import os
# Download model files
local_dir = "./doc2vec_model"
snapshot_download(
repo_id="theluantran/cefr-doc2vec",
local_dir=local_dir,
local_dir_use_symlinks=False,
allow_patterns=[
"doc2vec_model*",
"*.json",
"nn_weights.pth"
]
)
# Define neural network architecture
class Doc2VecClassifier(nn.Module):
def __init__(self, embedding_dim, hidden_dim, num_classes, dropout=0.3):
super(Doc2VecClassifier, self).__init__()
self.fc1 = nn.Linear(embedding_dim, hidden_dim)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout)
self.fc2 = nn.Linear(hidden_dim, num_classes)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.dropout(x)
x = self.fc2(x)
return x
# Load Doc2Vec model
doc2vec_model = Doc2Vec.load(os.path.join(local_dir, "doc2vec_model.bin"))
# Load configuration
with open(os.path.join(local_dir, "config.json"), 'r') as f:
config = json.load(f)
# Reconstruct and load neural network
neural_network = Doc2VecClassifier(
embedding_dim=config['embedding_dim'],
hidden_dim=config['hidden_dim'],
num_classes=config['num_classes'],
dropout=config['dropout_rate']
)
neural_network.load_state_dict(
torch.load(os.path.join(local_dir, "nn_weights.pth"))
)
neural_network.eval()
# Predict
text = "This is a sample text to classify"
vector = doc2vec_model.infer_vector(text.split())
with torch.no_grad():
tensor = torch.FloatTensor(vector).unsqueeze(0)
output = neural_network(tensor)
probabilities = torch.softmax(output, dim=1)
probs_array = probabilities.numpy()[0]
prediction = int(np.argmax(probs_array))
# Map numeric prediction to CEFR level
level_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
predicted_level = level_map[prediction]
print(f"Predicted level: {predicted_level}")
print(f"Confidence: {max(probs_array):.2%}")
print(f"All probabilities: {dict(zip(level_map.values(), probs_array))}")
Model Configuration
The config.json file contains the following parameters:
{
"embedding_dim": 100,
"hidden_dim": 128,
"num_classes": 5,
"dropout_rate": 0.3
}
Training
This model was trained using proprietary CEFR-labeled text data. The training process involves:
- Doc2Vec Embedding Training: Training Doc2Vec embeddings on the corpus with 10 epochs and minimum word count of 1
- Document Vector Generation: Generating 300-dimensional document vectors for all training samples
- Neural Network Training: Training a fully connected neural network classifier on these embeddings
License
This model is released for research and educational purposes. The training data is proprietary and not included.