MothMalone commited on
Commit
bdb13ea
·
verified ·
1 Parent(s): 08fe31b

Upload v3_vi2en_transformer model

Browse files
Files changed (7) hide show
  1. README.md +126 -0
  2. best_model.pt +3 -0
  3. config.yaml +88 -0
  4. requirements.txt +4 -0
  5. src_vocab.json +0 -0
  6. tgt_vocab.json +0 -0
  7. training_metrics.json +0 -0
README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ - en
5
+ tags:
6
+ - translation
7
+ - transformer
8
+ - seq2seq
9
+ license: mit
10
+ datasets:
11
+ - iwslt2015
12
+ metrics:
13
+ - bleu
14
+ ---
15
+
16
+ # v3_vi2en - Vietnamese-English Translation
17
+
18
+ ## Model Description
19
+
20
+ Vi→En optimized large model with advanced techniques
21
+
22
+ This model is trained from scratch using the Transformer architecture for machine translation.
23
+
24
+ ### Model Details
25
+
26
+ - **Language pair**: Vietnamese → English
27
+ - **Architecture**: Transformer (Encoder-Decoder)
28
+ - **Parameters**:
29
+ - d_model: 1024
30
+ - n_heads: 16
31
+ - n_encoder_layers: 6
32
+ - n_decoder_layers: 6
33
+ - d_ff: 4096
34
+ - dropout: 0.3
35
+
36
+ ### Training Details
37
+
38
+ - **Optimizer**: ADAMW
39
+ - **Learning Rate**: 0.0001
40
+ - **Batch Size**: 48
41
+ - **Label Smoothing**: 0.1
42
+ - **Scheduler**: cosine
43
+ - **Dataset**: IWSLT 2015 Vi-En
44
+
45
+ ### Performance
46
+
47
+
48
+ ### Improvements
49
+
50
+ - Larger model (1024-dim, 16 heads)
51
+ - BPE tokenization
52
+ - Cosine learning rate schedule
53
+ - Mixed precision training
54
+ - Larger beam search (10)
55
+ - Longer sequences (256)
56
+
57
+ ## Usage
58
+
59
+ ```python
60
+ # Load model and translate
61
+ from src.models.transformer import Transformer
62
+ from src.inference.translator import Translator
63
+ from src.data.vocabulary import Vocabulary
64
+ import torch
65
+
66
+ # Load vocabularies
67
+ src_vocab = Vocabulary.load('src_vocab.json')
68
+ tgt_vocab = Vocabulary.load('tgt_vocab.json')
69
+
70
+ # Load model
71
+ model = Transformer(
72
+ src_vocab_size=len(src_vocab),
73
+ tgt_vocab_size=len(tgt_vocab),
74
+ d_model=512,
75
+ n_heads=8,
76
+ n_encoder_layers=6,
77
+ n_decoder_layers=6,
78
+ d_ff=2048,
79
+ dropout=0.1,
80
+ max_seq_length=512,
81
+ pad_idx=0
82
+ )
83
+
84
+ checkpoint = torch.load('best_model.pt')
85
+ model.load_state_dict(checkpoint['model_state_dict'])
86
+
87
+ # Create translator
88
+ translator = Translator(
89
+ model=model,
90
+ src_vocab=src_vocab,
91
+ tgt_vocab=tgt_vocab,
92
+ device='cuda',
93
+ decoding_method='beam',
94
+ beam_size=5
95
+ )
96
+
97
+ # Translate
98
+ vietnamese_text = "Xin chào, bạn khỏe không?"
99
+ translation = translator.translate(vietnamese_text)
100
+ print(translation)
101
+ ```
102
+
103
+ ## Training Data
104
+
105
+ - **Dataset**: IWSLT 2015 Vietnamese-English parallel corpus
106
+ - **Training pairs**: ~500,000 sentence pairs
107
+ - **Validation pairs**: ~50,000 sentence pairs
108
+ - **Test pairs**: ~3,000 sentence pairs
109
+
110
+ ## Limitations
111
+
112
+ - Trained specifically for Vietnamese to English translation
113
+ - Performance may vary on out-of-domain text
114
+ - Medical/technical domains may require fine-tuning
115
+
116
+ ## Citation
117
+
118
+ ```bibtex
119
+ @misc{nlp-transformer-mt,
120
+ author = {MothMalone},
121
+ title = {Transformer Machine Translation Vi-En},
122
+ year = {2025},
123
+ publisher = {HuggingFace},
124
+ howpublished = {\url{https://huggingface.co/MothMalone}}
125
+ }
126
+ ```
best_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf5d084d0f5e95a3110c4c9ac57905cc477f8abd93c3fb067824eeb9a1a4b17f
3
+ size 3347137710
config.yaml ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Version 3: Vi→En Optimized Transformer
2
+ # Larger model with advanced techniques
3
+
4
+ # Data Configuration
5
+ data:
6
+ src_lang: "vi"
7
+ tgt_lang: "en"
8
+ train_src: "data/raw_opus100/train.vi.txt"
9
+ train_tgt: "data/raw_opus100/train.en.txt"
10
+ # Validation files (if not exist, will auto-split from training)
11
+ val_src: "data/raw_opus100/val.vi.txt"
12
+ val_tgt: "data/raw_opus100/val.en.txt"
13
+ val_split: 0.1 # 10% of training data for validation if val files don't exist
14
+ test_src: "data/raw_opus100/public_test.vi.txt"
15
+ test_tgt: "data/raw_opus100/public_test.en.txt"
16
+ max_seq_length: 256 # Increased
17
+
18
+ # Vocabulary - BPE tokenization
19
+ vocab:
20
+ src_vocab_size: 50000 # Larger vocabulary
21
+ tgt_vocab_size: 50000
22
+ min_freq: 1
23
+ tokenization: "bpe" # Use BPE instead of word-level
24
+
25
+ # Model - Larger (Transformer Big config)
26
+ model:
27
+ d_model: 1024 # Increased
28
+ n_heads: 16 # Increased
29
+ n_encoder_layers: 6
30
+ n_decoder_layers: 6
31
+ d_ff: 4096 # Increased
32
+ dropout: 0.3 # Higher dropout for regularization
33
+ max_seq_length: 512
34
+
35
+ # Training - Advanced
36
+ training:
37
+ batch_size: 48 # Smaller due to larger model
38
+ epochs: 10 # Good balance of quality and time
39
+ optimizer: "adamw"
40
+ learning_rate: 0.0001
41
+ weight_decay: 0.01
42
+ betas: [0.9, 0.98]
43
+ scheduler: "cosine" # Cosine annealing
44
+ warmup_steps: 8000 # Longer warmup
45
+ label_smoothing: 0.1
46
+ gradient_accumulation_steps: 8 # Effective batch size = 64
47
+ max_grad_norm: 1.0
48
+
49
+ # Mixed precision training (if available)
50
+ use_amp: true
51
+ use_wandb: true # Enable Weights & Biases logging
52
+
53
+ save_every: 20000 # Save less frequently to save disk space
54
+ eval_every: 100
55
+ log_every: 100
56
+ early_stopping_patience: 10
57
+
58
+ # Inference
59
+ inference:
60
+ beam_size: 10 # Larger beam
61
+ max_decode_length: 256
62
+ length_penalty: 0.8
63
+
64
+ # Paths
65
+ paths:
66
+ checkpoint_dir: "experiments/v3_vi2en/checkpoints"
67
+ log_dir: "experiments/v3_vi2en/logs"
68
+ vocab_dir: "data/vocab_v3_vi2en"
69
+
70
+ device: "cuda"
71
+ seed: 42
72
+
73
+ # Weights & Biases
74
+ wandb:
75
+ project: "nlp-transformer-mt"
76
+ entity: null # Your wandb username (optional)
77
+
78
+ # Version info
79
+ version:
80
+ name: "v3_vi2en"
81
+ description: "Vi→En optimized large model with advanced techniques"
82
+ improvements:
83
+ - "Larger model (1024-dim, 16 heads)"
84
+ - "BPE tokenization"
85
+ - "Cosine learning rate schedule"
86
+ - "Mixed precision training"
87
+ - "Larger beam search (10)"
88
+ - "Longer sequences (256)"
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ torch>=2.0.0
2
+ numpy>=1.21.0
3
+ pyyaml>=6.0
4
+ tqdm>=4.65.0
src_vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
tgt_vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
training_metrics.json ADDED
The diff for this file is too large to render. See raw diff