Spaces:
Build error
Build error
| title: Language Detection App | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: gradio | |
| python_version: 3.9 | |
| app_file: app.py | |
| license: mit | |
| # π Language Detection App | |
| A powerful and elegant language detection application built with Gradio frontend and a modular backend featuring multiple state-of-the-art ML models organized by architecture and training dataset. | |
| ## β¨ Features | |
| - **Clean Gradio Interface**: Simple, intuitive web interface for language detection | |
| - **Multiple Model Architectures**: Choose between XLM-RoBERTa (Model A) and BERT (Model B) architectures | |
| - **Multiple Training Datasets**: Models trained on standard (Dataset A) and enhanced (Dataset B) datasets | |
| - **Centralized Configuration**: All model configurations and settings in one place | |
| - **Modular Backend**: Easy-to-extend architecture for plugging in your own ML models | |
| - **Real-time Detection**: Instant language detection with confidence scores | |
| - **Multiple Predictions**: Shows top 5 language predictions with confidence levels | |
| - **100+ Languages**: Support for major world languages (varies by model) | |
| - **Example Texts**: Pre-loaded examples in various languages for testing | |
| - **Model Switching**: Seamlessly switch between different models | |
| - **Extensible**: Abstract base class for implementing custom models | |
| ## π Quick Start | |
| ### 1. Setup Environment | |
| ```bash | |
| # Create virtual environment | |
| python -m venv venv | |
| # Activate environment | |
| # On macOS/Linux: | |
| source venv/bin/activate | |
| # On Windows: | |
| venv\Scripts\activate | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2. Test the Backend | |
| ```bash | |
| # Run tests to verify everything works | |
| python test_app.py | |
| # Test specific model combinations | |
| python test_model_a_dataset_a.py | |
| python test_model_b_dataset_b.py | |
| ``` | |
| ### 3. Launch the App | |
| ```bash | |
| # Start the Gradio app | |
| python app.py | |
| ``` | |
| The app will be available at `http://localhost:7860` | |
| ## π§© Model Architecture | |
| The system is organized around two dimensions: | |
| ### ποΈ Model Architectures | |
| - **Model A**: XLM-RoBERTa based architectures - Excellent cross-lingual transfer capabilities | |
| - **Model B**: BERT based architectures - Efficient and fast processing | |
| ### π Training Datasets | |
| - **Dataset A**: Standard multilingual language detection dataset - Broad language coverage | |
| - **Dataset B**: Enhanced/specialized language detection dataset - Ultra-high accuracy focus | |
| ### π€ Available Model Combinations | |
| 1. **Model A Dataset A** - XLM-RoBERTa + Standard Dataset β | |
| - **Architecture**: XLM-RoBERTa (Model A) | |
| - **Training**: Dataset A (standard multilingual) | |
| - **Accuracy**: 97.9% | |
| - **Size**: 278M parameters | |
| - **Languages**: 100+ languages | |
| - **Strengths**: Balanced performance, robust cross-lingual capabilities, comprehensive language coverage | |
| - **Use Cases**: General-purpose language detection, multilingual content processing | |
| 2. **Model B Dataset A** - BERT + Standard Dataset β | |
| - **Architecture**: BERT (Model B) | |
| - **Training**: Dataset A (standard multilingual) | |
| - **Accuracy**: 96.17% | |
| - **Size**: 178M parameters | |
| - **Languages**: 100+ languages | |
| - **Strengths**: Fast inference, broad language support, efficient processing | |
| - **Use Cases**: High-throughput detection, real-time applications, resource-constrained environments | |
| 3. **Model A Dataset B** - XLM-RoBERTa + Enhanced Dataset β | |
| - **Architecture**: XLM-RoBERTa (Model A) | |
| - **Training**: Dataset B (enhanced/specialized) | |
| - **Accuracy**: 99.72% | |
| - **Size**: 278M parameters | |
| - **Training Loss**: 0.0176 | |
| - **Languages**: 20 carefully selected languages | |
| - **Strengths**: Exceptional accuracy, focused language support, state-of-the-art results | |
| - **Use Cases**: Research applications, high-precision detection, critical accuracy requirements | |
| 4. **Model B Dataset B** - BERT + Enhanced Dataset β | |
| - **Architecture**: BERT (Model B) | |
| - **Training**: Dataset B (enhanced/specialized) | |
| - **Accuracy**: 99.85% | |
| - **Size**: 178M parameters | |
| - **Training Loss**: 0.0125 | |
| - **Languages**: 20 carefully selected languages | |
| - **Strengths**: Highest accuracy, ultra-low training loss, precision-optimized | |
| - **Use Cases**: Maximum precision applications, research requiring highest accuracy | |
| ### ποΈ Core Components | |
| - **`BaseLanguageModel`**: Abstract interface that all models must implement | |
| - **`ModelRegistry`**: Manages model registration and creation with centralized configuration | |
| - **`LanguageDetector`**: Main orchestrator for language detection | |
| - **`model_config.py`**: Centralized configuration for all models and language mappings | |
| ### π§ Adding New Models | |
| To add a new model combination, simply: | |
| 1. Create a new file in `backend/models/` (e.g., `model_c_dataset_a.py`) | |
| 2. Inherit from `BaseLanguageModel` | |
| 3. Implement the required methods | |
| 4. Add configuration to `model_config.py` | |
| 5. Register it in `ModelRegistry` | |
| Example: | |
| ```python | |
| # backend/models/model_c_dataset_a.py | |
| from .base_model import BaseLanguageModel | |
| from .model_config import get_model_config | |
| class ModelCDatasetA(BaseLanguageModel): | |
| def __init__(self): | |
| self.model_key = "model-c-dataset-a" | |
| self.config = get_model_config(self.model_key) | |
| # Initialize your model | |
| def predict(self, text: str) -> Dict[str, Any]: | |
| # Implement prediction logic | |
| pass | |
| def get_supported_languages(self) -> List[str]: | |
| # Return supported language codes | |
| pass | |
| def get_model_info(self) -> Dict[str, Any]: | |
| # Return model metadata from config | |
| pass | |
| ``` | |
| Then add configuration in `model_config.py` and register in `language_detector.py`. | |
| ## π§ͺ Testing | |
| The project includes comprehensive test suites: | |
| - **`test_app.py`**: General app functionality tests | |
| - **`test_model_a_dataset_a.py`**: Tests for XLM-RoBERTa + standard dataset | |
| - **`test_model_b_dataset_b.py`**: Tests for BERT + enhanced dataset (highest accuracy) | |
| - **Model comparison tests**: Automated testing across all model combinations | |
| - **Model switching tests**: Verify seamless model switching | |
| ## π Supported Languages | |
| The models support different language sets based on their training: | |
| - **Model A/B + Dataset A**: 100+ languages including major European, Asian, African, and other world languages based on the CC-100 dataset | |
| - **Model A/B + Dataset B**: 20 carefully selected high-performance languages (Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese) | |
| ## π Model Comparison | |
| | Feature | Model A Dataset A | Model B Dataset A | Model A Dataset B | Model B Dataset B | | |
| |---------|-------------------|-------------------|-------------------|-------------------| | |
| | **Architecture** | XLM-RoBERTa | BERT | XLM-RoBERTa | BERT | | |
| | **Dataset** | Standard | Standard | Enhanced | Enhanced | | |
| | **Accuracy** | 97.9% | 96.17% | 99.72% | **99.85%** π | | |
| | **Model Size** | 278M | 178M | 278M | 178M | | |
| | **Languages** | 100+ | 100+ | 20 (curated) | 20 (curated) | | |
| | **Training Loss** | N/A | N/A | 0.0176 | **0.0125** | | |
| | **Speed** | Moderate | **Fast** | Moderate | **Fast** | | |
| | **Memory Usage** | Higher | **Lower** | Higher | **Lower** | | |
| | **Best For** | Balanced performance | Speed & broad coverage | Ultra-high accuracy | **Maximum precision** | | |
| ### π― Model Selection Guide | |
| - **π Model B Dataset B**: Choose for maximum accuracy on 20 core languages (99.85%) | |
| - **π¬ Model A Dataset B**: Choose for ultra-high accuracy on 20 core languages (99.72%) | |
| - **βοΈ Model A Dataset A**: Choose for balanced performance and comprehensive language coverage (97.9%) | |
| - **β‘ Model B Dataset A**: Choose for fast inference and broad language coverage (96.17%) | |
| ## π§ Configuration | |
| You can configure models using the centralized configuration system: | |
| ```python | |
| # Default model selection | |
| detector = LanguageDetector(model_key="model-a-dataset-a") # Balanced XLM-RoBERTa | |
| detector = LanguageDetector(model_key="model-b-dataset-a") # Fast BERT | |
| detector = LanguageDetector(model_key="model-a-dataset-b") # Ultra-high accuracy XLM-RoBERTa | |
| detector = LanguageDetector(model_key="model-b-dataset-b") # Maximum precision BERT | |
| # All configurations are centralized in backend/models/model_config.py | |
| ``` | |
| ## π Project Structure | |
| ``` | |
| language-detection/ | |
| βββ backend/ | |
| β βββ models/ | |
| β β βββ model_config.py # Centralized configuration | |
| β β βββ base_model.py # Abstract base class | |
| β β βββ model_a_dataset_a.py # XLM-RoBERTa + Standard | |
| β β βββ model_b_dataset_a.py # BERT + Standard | |
| β β βββ model_a_dataset_b.py # XLM-RoBERTa + Enhanced | |
| β β βββ model_b_dataset_b.py # BERT + Enhanced | |
| β β βββ __init__.py | |
| β βββ language_detector.py # Main orchestrator | |
| βββ tests/ | |
| βββ app.py # Gradio interface | |
| βββ README.md | |
| ``` | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create your feature branch (`git checkout -b feature/new-model-combination`) | |
| 3. Implement your model following the `BaseLanguageModel` interface | |
| 4. Add configuration to `model_config.py` | |
| 5. Add tests for your implementation | |
| 6. Commit your changes (`git commit -m 'Add new model combination'`) | |
| 7. Push to the branch (`git push origin feature/new-model-combination`) | |
| 8. Open a Pull Request | |
| ## π License | |
| This project is open source and available under the MIT License. | |
| ## π Acknowledgments | |
| - **Hugging Face** for the transformers library and model hosting platform | |
| - **Model providers** for the fine-tuned language detection models used in this project | |
| - **Gradio** for the excellent web interface framework | |
| - **Open source community** for the foundational technologies that make this project possible |