YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
Bigram Language Model
A robust and feature-rich bigram language model implementation in Python with advanced text preprocessing, smoothing techniques, and model persistence capabilities.
π Features
- Advanced Text Preprocessing: Automatic case normalization, punctuation handling, and tokenization
- Add-k Smoothing: Prevents zero probabilities and improves generation quality
- Sentence Boundary Handling: Proper start/end tokens for natural text generation
- Model Persistence: Save and load trained models in JSON format
- File Loading: Train on external text corpora
- Comprehensive Analysis: Probability analysis, model statistics, and evaluation metrics
- Error Handling: Robust error handling and graceful degradation
- Backward Compatibility: Legacy function support for existing code
π Requirements
- Python 3.7+
- Standard library modules:
random,re,json,collections,typing
π Quick Start
Basic Usage
from bigram import BigramLanguageModel
# Create and train model
model = BigramLanguageModel(smoothing_factor=1.0)
sentences = model.get_default_sentences()
model.train(sentences)
# Generate text
generated_text = model.generate_text(start_word="the", max_words=15)
print(generated_text)
Advanced Usage
# Load training data from file
sentences = model.load_data_from_file('your_corpus.txt')
model.train(sentences)
# Save trained model
model.save_model('my_bigram_model.json')
# Load model later
new_model = BigramLanguageModel()
new_model.load_model('my_bigram_model.json')
# Generate with different options
text = new_model.generate_text(
start_word=None, # Random start
max_words=20,
end_on_sentence=True
)
π API Reference
BigramLanguageModel Class
Constructor
BigramLanguageModel(smoothing_factor: float = 1.0)
smoothing_factor: Add-k smoothing parameter (default: 1.0)
Core Methods
Training
train(sentences: List[str]) -> None
Train the model on a list of sentences.
Text Generation
generate_text(
start_word: Optional[str] = None,
max_words: int = 20,
end_on_sentence: bool = True
) -> str
Generate text using the trained model.
Model Persistence
save_model(filename: str) -> None
load_model(filename: str) -> None
Save and load trained models.
Analysis
get_bigram_probability(word1: str, word2: str) -> float
get_next_word_candidates(current_word: str) -> Dict[str, float]
evaluate_perplexity(test_sentences: List[str]) -> float
Data Loading
load_data_from_file(filename: str) -> List[str]
get_default_sentences() -> List[str]
π§ Technical Details
Text Preprocessing Pipeline
- Case Normalization: Convert all text to lowercase
- Punctuation Handling: Preserve sentence-ending punctuation, remove others
- Tokenization: Split into individual words
- Boundary Tokens: Add
<START>and<END>markers
Smoothing Algorithm
Uses Add-k smoothing to handle unseen bigrams:
P(w2|w1) = (Count(w1,w2) + k) / (Count(w1) + k Γ |V|)
Where:
k= smoothing factor|V|= vocabulary size
Model Architecture
Input Text β Preprocessing β Bigram Counting β Probability Calculation β Text Generation
π Example Output
Training sentences:
1. The cat sat on the mat
2. The dog barked at the cat
3. The bird sang a beautiful song
...
Model trained on 10 sentences
Vocabulary size: 48
Generated text examples:
Starting with "the": the cat sat on a beautiful song
Starting with "a": a quick brown fox jumps over the lazy dog
π§ͺ Testing
Run the main script to see the model in action:
python bigram.py
This will:
- Train on default sentences
- Show model statistics
- Generate example text with different starting words
- Display bigram probabilities
- Demonstrate save/load functionality
π Performance Characteristics
- Training Time: O(n Γ m) where n = number of sentences, m = average sentence length
- Memory Usage: O(|V|Β²) for bigram storage
- Generation Speed: O(k Γ |V|) where k = number of words to generate
- Vocabulary Scalability: Efficiently handles vocabularies up to 100K+ words
π Legacy Compatibility
The original functional interface is still supported:
# Original functions still work
sentences = prepare_Data()
bigram_probs = build_bigram_model(sentences)
text = generate_text(bigram_probs, "the", 10)
π§ Limitations
- Context Window: Only considers the immediately preceding word
- Long-range Dependencies: Cannot capture long-distance relationships
- Vocabulary: Limited to words seen during training
- Generation Quality: May produce repetitive or incoherent long sequences
π£οΈ Future Improvements
- N-gram models (trigram, 4-gram)
- Neural language model integration
- Better evaluation metrics (BLEU, perplexity)
- Interactive text generation interface
- Support for different languages
- Parallel training for large corpora
π€ Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
π License
This project is open source and available under the MIT License.
π References
- Jurafsky, D. & Martin, J. H. (2023). Speech and Language Processing (3rd ed.)
- Manning, C. D. & SchΓΌtze, H. (1999). Foundations of Statistical Natural Language Processing
- Chen, S. F. & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling
π Contact
For questions, suggestions, or issues, please open an issue in the repository or contact the maintainer.
Built with β€οΈ for natural language processing education and research