| | --- |
| | library_name: transformers |
| | license: apache-2.0 |
| | base_model: t5-small |
| | tags: |
| | - generated_from_trainer |
| | model-index: |
| | - name: results |
| | results: [] |
| | --- |
| | |
| | ## results |
| |
|
| | This model is a fine-tuned version of [t5-small](https://huggingface.co/t5-small) on the None dataset. |
| |
|
| | Model Description |
| | This model is a Telugu colloquial language translator designed to convert English text into spoken (colloquial) Telugu. |
| | It is built using a transformer-based architecture and fine-tuned on translation tasks to produce natural and conversational outputs. |
| |
|
| | Key Features: |
| | Conversational style: Generates spoken Telugu instead of formal Telugu. |
| | Context-aware translation: Preserves the meaning and tone of English sentences. |
| | Efficient inference: Uses sampling and top-p filtering for diverse translations. |
| |
|
| | Intended Uses & Limitations |
| | Intended Uses: |
| | Language translation: Converts English text into spoken Telugu. |
| | Conversational AI: Can be integrated into chatbots, voice assistants, or language-learning apps. |
| | Educational tool: Helps learners understand spoken Telugu in real-world contexts. |
| |
|
| | Limitations: |
| | Limited vocabulary: May struggle with highly technical or domain-specific terms. |
| | Context dependency: Lacks deep contextual understanding for ambiguous sentences. |
| | Bias in dataset: If trained on specific datasets, biases may appear in translations. |
| | Grammar inconsistencies: Spoken Telugu translations may not always be grammatically perfect. |
| |
|
| | Training and Evaluation Data |
| | Training Data: |
| | The model was fine-tuned on a parallel corpus of English-Telugu conversational text. |
| | Source: ChatGPT |
| |
|
| | Evaluation Data: |
| | The model was evaluated on a test set containing everyday English sentences. |
| |
|
| | Example categories: |
| | Common phrases (e.g., "Where are you going?" β "Ekadiki veluthunnaru?") |
| | Technical queries (e.g., "What is data structure?" β "Data structure ante emiti?") |
| | General questions (e.g., "Can you explain this?" β "Idhi cheppagalava?") |
| |
|
| | Metrics Used: |
| | BLEU Score: Measures translation accuracy compared to human translations. |
| | Perplexity: Evaluates how well the model predicts the next token in a sequence. |
| | Human Evaluation: Telugu speakers reviewed translations for fluency and accuracy. |
| |
|
| | ## Training procedure |
| |
|
| | Training Procedure |
| | 1. Data Collection & Preprocessing |
| | Data Sources: |
| | Parallel corpus of English-Telugu conversations with a focus on colloquial spoken Telugu. |
| | Crowdsourced translations and datasets from existing NLP corpora. |
| | Manually curated Telugu phrases for informal, everyday speech. |
| | Preprocessing Steps: |
| | Text Tokenization: Used SentencePiece/BPE (Byte Pair Encoding) for handling subwords. |
| | Data Cleaning: Removed extra punctuation, normalized informal Telugu spellings. |
| | Sentence Alignment: Mapped English phrases β Spoken Telugu translations for training. |
| | 2. Model Architecture & Training Configuration |
| | Base Model: Transformer-based sequence-to-sequence (seq2seq) architecture. |
| | Options: T5, mT5, MarianMT, BART, or custom LSTM-based model. |
| | Embedding Layer: Converts words into vector representations. |
| | Encoder-Decoder: Processes English input and generates Telugu colloquial speech. |
| | Hyperparameters: |
| | Batch Size: 16β64 (optimized for GPU memory). |
| | Optimizer: Adam with learning rate scheduling. |
| | Loss Function: Cross-Entropy Loss for sequence prediction. |
| | Dropout & Regularization: Applied to prevent overfitting. |
| | Beam Search & Top-k Sampling: Used for natural-sounding output generation. |
| | 3. Training Configuration |
| | Hardware Used: |
| | GPU: NVIDIA A100 / V100 or TPUs for faster training. |
| | Training duration: Several hours to days, depending on dataset size. |
| | Dataset Split: |
| | 80% Training, 10% Validation, 10% Testing. |
| | Evaluation During Training: |
| | BLEU Score, Perplexity (PPL), and Human Evaluation for spoken fluency. |
| | Fine-tuning Process: |
| | Adjusted beam search and temperature scaling for more contextually relevant translations. |
| |
|
| |
|
| | ### Training hyperparameters |
| |
|
| | The following hyperparameters were used during training: |
| | - learning_rate: 5e-05 |
| | - train_batch_size: 8 |
| | - eval_batch_size: 8 |
| | - seed: 42 |
| | - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
| | - lr_scheduler_type: linear |
| | - num_epochs: 10 |
| |
|
| | 2. Qualitative Analysis |
| | β
Strengths: |
| | Produces natural and fluent spoken Telugu translations. |
| | Handles short, conversational phrases accurately. |
| | Preserves context and informal nuances of Telugu speech. |
| | β Challenges: |
| | Long sentences may lose colloquial tone or sound too formal. |
| | Domain-specific phrases (e.g., tech terms) may need further fine-tuning. |
| | Context switching in complex sentences sometimes leads to literal translations instead of natural Telugu speech. |
| |
|
| |
|
| | ### Framework versions |
| |
|
| | - Transformers 4.49.0 |
| | - Pytorch 2.6.0+cu124 |
| | - Datasets 3.3.1 |
| | - Tokenizers 0.21.0 |
| |
|