MothMalone commited on
Commit
5b82c6a
·
verified ·
1 Parent(s): 366a570

Upload transformer-vi2en-v2 model

Browse files
Files changed (6) hide show
  1. README.md +125 -0
  2. best_model.pt +3 -0
  3. config.yaml +82 -0
  4. requirements.txt +4 -0
  5. src_vocab.json +0 -0
  6. tgt_vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ - en
5
+ tags:
6
+ - translation
7
+ - transformer
8
+ - seq2seq
9
+ license: mit
10
+ datasets:
11
+ - iwslt2015
12
+ metrics:
13
+ - bleu
14
+ ---
15
+
16
+ # v2_vi2en - Vietnamese-English Translation
17
+
18
+ ## Model Description
19
+
20
+ Improved Vi→En training with label smoothing and AdamW
21
+
22
+ This model is trained from scratch using the Transformer architecture for machine translation.
23
+
24
+ ### Model Details
25
+
26
+ - **Language pair**: Vietnamese → English
27
+ - **Architecture**: Transformer (Encoder-Decoder)
28
+ - **Parameters**:
29
+ - d_model: 512
30
+ - n_heads: 8
31
+ - n_encoder_layers: 6
32
+ - n_decoder_layers: 6
33
+ - d_ff: 2048
34
+ - dropout: 0.1
35
+
36
+ ### Training Details
37
+
38
+ - **Optimizer**: ADAMW
39
+ - **Learning Rate**: 0.0001
40
+ - **Batch Size**: 32
41
+ - **Label Smoothing**: 0.1
42
+ - **Scheduler**: warmup
43
+ - **Dataset**: IWSLT 2015 Vi-En
44
+
45
+ ### Performance
46
+
47
+
48
+ ### Improvements
49
+
50
+ - Label smoothing (0.1)
51
+ - AdamW optimizer with weight decay
52
+ - Beam search (size=5)
53
+ - Gradient accumulation
54
+ - Early stopping
55
+
56
+ ## Usage
57
+
58
+ ```python
59
+ # Load model and translate
60
+ from src.models.transformer import Transformer
61
+ from src.inference.translator import Translator
62
+ from src.data.vocabulary import Vocabulary
63
+ import torch
64
+
65
+ # Load vocabularies
66
+ src_vocab = Vocabulary.load('src_vocab.json')
67
+ tgt_vocab = Vocabulary.load('tgt_vocab.json')
68
+
69
+ # Load model
70
+ model = Transformer(
71
+ src_vocab_size=len(src_vocab),
72
+ tgt_vocab_size=len(tgt_vocab),
73
+ d_model=512,
74
+ n_heads=8,
75
+ n_encoder_layers=6,
76
+ n_decoder_layers=6,
77
+ d_ff=2048,
78
+ dropout=0.1,
79
+ max_seq_length=512,
80
+ pad_idx=0
81
+ )
82
+
83
+ checkpoint = torch.load('best_model.pt')
84
+ model.load_state_dict(checkpoint['model_state_dict'])
85
+
86
+ # Create translator
87
+ translator = Translator(
88
+ model=model,
89
+ src_vocab=src_vocab,
90
+ tgt_vocab=tgt_vocab,
91
+ device='cuda',
92
+ decoding_method='beam',
93
+ beam_size=5
94
+ )
95
+
96
+ # Translate
97
+ vietnamese_text = "Xin chào, bạn khỏe không?"
98
+ translation = translator.translate(vietnamese_text)
99
+ print(translation)
100
+ ```
101
+
102
+ ## Training Data
103
+
104
+ - **Dataset**: IWSLT 2015 Vietnamese-English parallel corpus
105
+ - **Training pairs**: ~500,000 sentence pairs
106
+ - **Validation pairs**: ~50,000 sentence pairs
107
+ - **Test pairs**: ~3,000 sentence pairs
108
+
109
+ ## Limitations
110
+
111
+ - Trained specifically for Vietnamese to English translation
112
+ - Performance may vary on out-of-domain text
113
+ - Medical/technical domains may require fine-tuning
114
+
115
+ ## Citation
116
+
117
+ ```bibtex
118
+ @misc{nlp-transformer-mt,
119
+ author = {MothMalone},
120
+ title = {Transformer Machine Translation Vi-En},
121
+ year = {2025},
122
+ publisher = {HuggingFace},
123
+ howpublished = {\url{https://huggingface.co/MothMalone}}
124
+ }
125
+ ```
best_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b3688b771169052d1bc9937c7bd3a218b2ac7a191094ce9f5ff3423e7a72239
3
+ size 1022502766
config.yaml ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Version 2: Improved Transformer
2
+ # With label smoothing, better optimizer, and beam search
3
+
4
+ # Data Configuration
5
+ data:
6
+ src_lang: "vi"
7
+ tgt_lang: "en"
8
+ train_src: "data/raw_opus100/train.vi.txt"
9
+ train_tgt: "data/raw_opus100/train.en.txt"
10
+ # Validation files (if not exist, will auto-split from training)
11
+ val_src: "data/raw_opus100/val.vi.txt"
12
+ val_tgt: "data/raw_opus100/val.en.txt"
13
+ val_split: 0.1 # 10% of training data for validation if val files don't exist
14
+ test_src: "data/raw_opus100/public_test.vi.txt"
15
+ test_tgt: "data/raw_opus100/public_test.en.txt"
16
+ max_seq_length: 128
17
+
18
+ # Vocabulary
19
+ vocab:
20
+ src_vocab_size: 40000 # Increased for better Vietnamese coverage
21
+ tgt_vocab_size: 40000
22
+ min_freq: 2
23
+
24
+ # Model - Same architecture
25
+ model:
26
+ d_model: 512
27
+ n_heads: 8
28
+ n_encoder_layers: 6
29
+ n_decoder_layers: 6
30
+ d_ff: 2048
31
+ dropout: 0.1
32
+ max_seq_length: 512
33
+
34
+ # Training - Improved
35
+ training:
36
+ batch_size: 32
37
+ epochs: 8 # Good balance of quality and time
38
+ optimizer: "adamw" # Changed to AdamW
39
+ learning_rate: 0.0001
40
+ weight_decay: 0.01 # Added weight decay
41
+ scheduler: "warmup"
42
+ warmup_steps: 4000
43
+ label_smoothing: 0.1 # Added label smoothing
44
+ gradient_accumulation_steps: 2 # Effective batch size = 64
45
+ max_grad_norm: 1.0
46
+
47
+ use_wandb: true
48
+ save_every: 1000
49
+ eval_every: 500
50
+ log_every: 100
51
+ early_stopping_patience: 5
52
+
53
+ # Inference
54
+ inference:
55
+ beam_size: 5 # Beam search
56
+ max_decode_length: 128
57
+ length_penalty: 0.6
58
+
59
+ # Paths
60
+ paths:
61
+ checkpoint_dir: "experiments/v2_vi2en/checkpoints"
62
+ log_dir: "experiments/v2_vi2en/logs"
63
+ vocab_dir: "data/vocab_v2_vi2en"
64
+
65
+ device: "cuda"
66
+ seed: 42
67
+
68
+ # Weights & Biases
69
+ wandb:
70
+ project: "nlp-transformer-mt"
71
+ entity: null
72
+
73
+ # Version info
74
+ version:
75
+ name: "v2_vi2en"
76
+ description: "Improved Vi→En training with label smoothing and AdamW"
77
+ improvements:
78
+ - "Label smoothing (0.1)"
79
+ - "AdamW optimizer with weight decay"
80
+ - "Beam search (size=5)"
81
+ - "Gradient accumulation"
82
+ - "Early stopping"
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ torch>=2.0.0
2
+ numpy>=1.21.0
3
+ pyyaml>=6.0
4
+ tqdm>=4.65.0
src_vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
tgt_vocab.json ADDED
The diff for this file is too large to render. See raw diff