cefr-doc2vec / README.md

theluantran

Create README.md

fa6d1a5 verified 2 days ago

preview code

raw

history blame contribute delete

4.74 kB

metadata

pipeline_tag: text-classification
tags:
  - text-classification
  - cefr
  - word2vec
  - doc2vec
  - nlp
language:
  - en
license: mit

CEFR Doc2Vec Classifier

A Doc2Vec-based neural network model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.

The source code to train this model can be found at: https://github.com/luantran/One-model-to-grade-them-all

Model Description

This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The Doc2Vec classifier uses document embeddings fed into a fully connected neural network to capture semantic patterns characteristic of different proficiency levels.

The other models part of this ensemble are:

Labels

The model classifies text into 5 CEFR proficiency levels:

A1: Beginner
A2: Elementary
B1: Intermediate
B2: Upper Intermediate
C1/C2: Advanced

Model Details

Type: Doc2Vec + Fully Connected Neural Network
Frameworks: gensim (Doc2Vec), PyTorch (Neural Network)
Task: Multi-class text classification
Architecture:
- Doc2Vec embedding: 300-dimensional document vectors
- Neural network: 128 hidden units with dropout (0.3)
- Output: 5-class softmax classification
Input: Raw text strings
Output: Class predictions (0-4) with probability distributions
Files:
- doc2vec_model.bin: Trained Doc2Vec model (gensim binary format)
- nn_weights.pth: Neural network state dictionary (PyTorch)
- config.json: Model configuration (embedding_dim, hidden_dim, num_classes, dropout_rate)

Usage

Basic Prediction

from huggingface_hub import snapshot_download
from gensim.models import Doc2Vec
import torch
import torch.nn as nn
import numpy as np
import json
import os

# Download model files
local_dir = "./doc2vec_model"
snapshot_download(
    repo_id="theluantran/cefr-doc2vec",
    local_dir=local_dir,
    local_dir_use_symlinks=False,
    allow_patterns=[
        "doc2vec_model*",
        "*.json",
        "nn_weights.pth"
    ]
)

# Define neural network architecture
class Doc2VecClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, num_classes, dropout=0.3):
        super(Doc2VecClassifier, self).__init__()
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        self.fc2 = nn.Linear(hidden_dim, num_classes)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Load Doc2Vec model
doc2vec_model = Doc2Vec.load(os.path.join(local_dir, "doc2vec_model.bin"))

# Load configuration
with open(os.path.join(local_dir, "config.json"), 'r') as f:
    config = json.load(f)

# Reconstruct and load neural network
neural_network = Doc2VecClassifier(
    embedding_dim=config['embedding_dim'],
    hidden_dim=config['hidden_dim'],
    num_classes=config['num_classes'],
    dropout=config['dropout_rate']
)
neural_network.load_state_dict(
    torch.load(os.path.join(local_dir, "nn_weights.pth"))
)
neural_network.eval()

# Predict
text = "This is a sample text to classify"
vector = doc2vec_model.infer_vector(text.split())

with torch.no_grad():
    tensor = torch.FloatTensor(vector).unsqueeze(0)
    output = neural_network(tensor)
    probabilities = torch.softmax(output, dim=1)

probs_array = probabilities.numpy()[0]
prediction = int(np.argmax(probs_array))

# Map numeric prediction to CEFR level
level_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
predicted_level = level_map[prediction]

print(f"Predicted level: {predicted_level}")
print(f"Confidence: {max(probs_array):.2%}")
print(f"All probabilities: {dict(zip(level_map.values(), probs_array))}")

Model Configuration

The config.json file contains the following parameters:

{
  "embedding_dim": 100,
  "hidden_dim": 128,
  "num_classes": 5,
  "dropout_rate": 0.3
}

Training

This model was trained using proprietary CEFR-labeled text data. The training process involves:

Doc2Vec Embedding Training: Training Doc2Vec embeddings on the corpus with 10 epochs and minimum word count of 1
Document Vector Generation: Generating 300-dimensional document vectors for all training samples
Neural Network Training: Training a fully connected neural network classifier on these embeddings

License

This model is released for research and educational purposes. The training data is proprietary and not included.