results

This model is a fine-tuned version of t5-small on the None dataset.

Model Description This model is a Telugu colloquial language translator designed to convert English text into spoken (colloquial) Telugu. It is built using a transformer-based architecture and fine-tuned on translation tasks to produce natural and conversational outputs.

Key Features: Conversational style: Generates spoken Telugu instead of formal Telugu. Context-aware translation: Preserves the meaning and tone of English sentences. Efficient inference: Uses sampling and top-p filtering for diverse translations.

Intended Uses & Limitations Intended Uses: Language translation: Converts English text into spoken Telugu. Conversational AI: Can be integrated into chatbots, voice assistants, or language-learning apps. Educational tool: Helps learners understand spoken Telugu in real-world contexts.

Limitations: Limited vocabulary: May struggle with highly technical or domain-specific terms. Context dependency: Lacks deep contextual understanding for ambiguous sentences. Bias in dataset: If trained on specific datasets, biases may appear in translations. Grammar inconsistencies: Spoken Telugu translations may not always be grammatically perfect.

Training and Evaluation Data Training Data: The model was fine-tuned on a parallel corpus of English-Telugu conversational text. Source: ChatGPT

Evaluation Data: The model was evaluated on a test set containing everyday English sentences.

Example categories: Common phrases (e.g., "Where are you going?" โ†’ "Ekadiki veluthunnaru?") Technical queries (e.g., "What is data structure?" โ†’ "Data structure ante emiti?") General questions (e.g., "Can you explain this?" โ†’ "Idhi cheppagalava?")

Metrics Used: BLEU Score: Measures translation accuracy compared to human translations. Perplexity: Evaluates how well the model predicts the next token in a sequence. Human Evaluation: Telugu speakers reviewed translations for fluency and accuracy.

Training procedure

Training Procedure

  1. Data Collection & Preprocessing Data Sources: Parallel corpus of English-Telugu conversations with a focus on colloquial spoken Telugu. Crowdsourced translations and datasets from existing NLP corpora. Manually curated Telugu phrases for informal, everyday speech. Preprocessing Steps: Text Tokenization: Used SentencePiece/BPE (Byte Pair Encoding) for handling subwords. Data Cleaning: Removed extra punctuation, normalized informal Telugu spellings. Sentence Alignment: Mapped English phrases โ†’ Spoken Telugu translations for training.
  2. Model Architecture & Training Configuration Base Model: Transformer-based sequence-to-sequence (seq2seq) architecture. Options: T5, mT5, MarianMT, BART, or custom LSTM-based model. Embedding Layer: Converts words into vector representations. Encoder-Decoder: Processes English input and generates Telugu colloquial speech. Hyperparameters: Batch Size: 16โ€“64 (optimized for GPU memory). Optimizer: Adam with learning rate scheduling. Loss Function: Cross-Entropy Loss for sequence prediction. Dropout & Regularization: Applied to prevent overfitting. Beam Search & Top-k Sampling: Used for natural-sounding output generation.
  3. Training Configuration Hardware Used: GPU: NVIDIA A100 / V100 or TPUs for faster training. Training duration: Several hours to days, depending on dataset size. Dataset Split: 80% Training, 10% Validation, 10% Testing. Evaluation During Training: BLEU Score, Perplexity (PPL), and Human Evaluation for spoken fluency. Fine-tuning Process: Adjusted beam search and temperature scaling for more contextually relevant translations.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 10
  1. Qualitative Analysis โœ… Strengths: Produces natural and fluent spoken Telugu translations. Handles short, conversational phrases accurately. Preserves context and informal nuances of Telugu speech. โŒ Challenges: Long sentences may lose colloquial tone or sound too formal. Domain-specific phrases (e.g., tech terms) may need further fine-tuning. Context switching in complex sentences sometimes leads to literal translations instead of natural Telugu speech.

Framework versions

  • Transformers 4.49.0
  • Pytorch 2.6.0+cu124
  • Datasets 3.3.1
  • Tokenizers 0.21.0
Downloads last month
1
Safetensors
Model size
60.5M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for SujathaL/results

Base model

google-t5/t5-small
Finetuned
(2218)
this model