| # BERT-Based Language Classification Model | |
| This repository contains a fine-tuned BERT-based model for classifying text into different languages. The model is designed to identify the language of a given sentence and has been trained using the Hugging Face Transformers library. It supports post-training dynamic quantization for optimized performance in deployment environments. | |
| --- | |
| ## Model Details | |
| - **Model Name:** BERT Base for Language Classification | |
| - **Model Architecture:** BERT Base | |
| - **Task:** Language Identification | |
| - **Dataset:** Custom Dataset with multilingual text samples | |
| - **Quantization:** Dynamic Quantization (INT8) | |
| - **Fine-tuning Framework:** Hugging Face Transformers | |
| --- | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers torch | |
| ``` | |
| ### Loading the Fine-tuned Model | |
| ```python | |
| from transformers import pipeline | |
| # Load the model and tokenizer from saved directory | |
| classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model") | |
| # Example input | |
| text = "Bonjour, comment allez-vous?" | |
| # Get prediction | |
| prediction = classifier(text) | |
| print(f"Prediction: {prediction}") | |
| ``` | |
| --- | |
| ## Saving and Testing the Model | |
| ### Saving | |
| ```python | |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer | |
| model_checkpoint = "bert-base-uncased" # or your fine-tuned model path | |
| tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint) | |
| # Save model and tokenizer | |
| model.save_pretrained("./saved_model") | |
| tokenizer.save_pretrained("./saved_model") | |
| ``` | |
| ### Testing | |
| ```python | |
| from transformers import pipeline | |
| classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model") | |
| text = "Ceci est un exemple de texte." | |
| print(classifier(text)) | |
| ``` | |
| --- | |
| ## Quantization | |
| ### Apply Dynamic Quantization | |
| ```python | |
| import torch | |
| from transformers import AutoModelForSequenceClassification | |
| model = AutoModelForSequenceClassification.from_pretrained("./saved_model") | |
| # Apply dynamic quantization | |
| quantized_model = torch.quantization.quantize_dynamic( | |
| model, {torch.nn.Linear}, dtype=torch.qint8 | |
| ) | |
| # Save quantized model | |
| quantized_model.save_pretrained("./quantized_model") | |
| ``` | |
| ### Load and Test Quantized Model | |
| ```python | |
| from transformers import AutoTokenizer, pipeline | |
| from transformers import AutoModelForSequenceClassification | |
| tokenizer = AutoTokenizer.from_pretrained("./saved_model") | |
| quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized_model") | |
| classifier = pipeline("text-classification", model=quantized_model, tokenizer=tokenizer) | |
| text = "Hola, ¿cómo estás?" | |
| print(classifier(text)) | |
| ``` | |
| --- | |
| ## Repository Structure | |
| ``` | |
| . | |
| ├── saved_model/ # Fine-tuned Model | |
| ├── quantized_model/ # Quantized Model | |
| ├── language-clasification.ipynb | |
| ├── README.md # Documentation | |
| ``` | |
| --- | |
| ## Limitations | |
| - The model performance may vary for low-resource or underrepresented languages in the training dataset. | |
| - Quantization may slightly reduce accuracy, but improves inference efficiency. | |
| --- | |
| ## Contributing | |
| Feel free to submit issues or pull requests to enhance performance, accuracy, or add new language support. | |
| --- |