results
This model is a fine-tuned version of t5-small on the None dataset.
Model Description This model is a Telugu colloquial language translator designed to convert English text into spoken (colloquial) Telugu. It is built using a transformer-based architecture and fine-tuned on translation tasks to produce natural and conversational outputs.
Key Features: Conversational style: Generates spoken Telugu instead of formal Telugu. Context-aware translation: Preserves the meaning and tone of English sentences. Efficient inference: Uses sampling and top-p filtering for diverse translations.
Intended Uses & Limitations Intended Uses: Language translation: Converts English text into spoken Telugu. Conversational AI: Can be integrated into chatbots, voice assistants, or language-learning apps. Educational tool: Helps learners understand spoken Telugu in real-world contexts.
Limitations: Limited vocabulary: May struggle with highly technical or domain-specific terms. Context dependency: Lacks deep contextual understanding for ambiguous sentences. Bias in dataset: If trained on specific datasets, biases may appear in translations. Grammar inconsistencies: Spoken Telugu translations may not always be grammatically perfect.
Training and Evaluation Data Training Data: The model was fine-tuned on a parallel corpus of English-Telugu conversational text. Source: ChatGPT
Evaluation Data: The model was evaluated on a test set containing everyday English sentences.
Example categories: Common phrases (e.g., "Where are you going?" โ "Ekadiki veluthunnaru?") Technical queries (e.g., "What is data structure?" โ "Data structure ante emiti?") General questions (e.g., "Can you explain this?" โ "Idhi cheppagalava?")
Metrics Used: BLEU Score: Measures translation accuracy compared to human translations. Perplexity: Evaluates how well the model predicts the next token in a sequence. Human Evaluation: Telugu speakers reviewed translations for fluency and accuracy.
Training procedure
Training Procedure
- Data Collection & Preprocessing Data Sources: Parallel corpus of English-Telugu conversations with a focus on colloquial spoken Telugu. Crowdsourced translations and datasets from existing NLP corpora. Manually curated Telugu phrases for informal, everyday speech. Preprocessing Steps: Text Tokenization: Used SentencePiece/BPE (Byte Pair Encoding) for handling subwords. Data Cleaning: Removed extra punctuation, normalized informal Telugu spellings. Sentence Alignment: Mapped English phrases โ Spoken Telugu translations for training.
- Model Architecture & Training Configuration Base Model: Transformer-based sequence-to-sequence (seq2seq) architecture. Options: T5, mT5, MarianMT, BART, or custom LSTM-based model. Embedding Layer: Converts words into vector representations. Encoder-Decoder: Processes English input and generates Telugu colloquial speech. Hyperparameters: Batch Size: 16โ64 (optimized for GPU memory). Optimizer: Adam with learning rate scheduling. Loss Function: Cross-Entropy Loss for sequence prediction. Dropout & Regularization: Applied to prevent overfitting. Beam Search & Top-k Sampling: Used for natural-sounding output generation.
- Training Configuration Hardware Used: GPU: NVIDIA A100 / V100 or TPUs for faster training. Training duration: Several hours to days, depending on dataset size. Dataset Split: 80% Training, 10% Validation, 10% Testing. Evaluation During Training: BLEU Score, Perplexity (PPL), and Human Evaluation for spoken fluency. Fine-tuning Process: Adjusted beam search and temperature scaling for more contextually relevant translations.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 10
- Qualitative Analysis โ Strengths: Produces natural and fluent spoken Telugu translations. Handles short, conversational phrases accurately. Preserves context and informal nuances of Telugu speech. โ Challenges: Long sentences may lose colloquial tone or sound too formal. Domain-specific phrases (e.g., tech terms) may need further fine-tuning. Context switching in complex sentences sometimes leads to literal translations instead of natural Telugu speech.
Framework versions
- Transformers 4.49.0
- Pytorch 2.6.0+cu124
- Datasets 3.3.1
- Tokenizers 0.21.0
- Downloads last month
- 1
Model tree for SujathaL/results
Base model
google-t5/t5-small