YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Bigram Language Model

A robust and feature-rich bigram language model implementation in Python with advanced text preprocessing, smoothing techniques, and model persistence capabilities.

🌟 Features

Advanced Text Preprocessing: Automatic case normalization, punctuation handling, and tokenization
Add-k Smoothing: Prevents zero probabilities and improves generation quality
Sentence Boundary Handling: Proper start/end tokens for natural text generation
Model Persistence: Save and load trained models in JSON format
File Loading: Train on external text corpora
Comprehensive Analysis: Probability analysis, model statistics, and evaluation metrics
Error Handling: Robust error handling and graceful degradation
Backward Compatibility: Legacy function support for existing code

📋 Requirements

Python 3.7+
Standard library modules: random, re, json, collections, typing

🚀 Quick Start

Basic Usage

from bigram import BigramLanguageModel

# Create and train model
model = BigramLanguageModel(smoothing_factor=1.0)
sentences = model.get_default_sentences()
model.train(sentences)

# Generate text
generated_text = model.generate_text(start_word="the", max_words=15)
print(generated_text)

Advanced Usage

# Load training data from file
sentences = model.load_data_from_file('your_corpus.txt')
model.train(sentences)

# Save trained model
model.save_model('my_bigram_model.json')

# Load model later
new_model = BigramLanguageModel()
new_model.load_model('my_bigram_model.json')

# Generate with different options
text = new_model.generate_text(
    start_word=None,  # Random start
    max_words=20,
    end_on_sentence=True
)

📖 API Reference

BigramLanguageModel Class

Constructor

BigramLanguageModel(smoothing_factor: float = 1.0)

smoothing_factor: Add-k smoothing parameter (default: 1.0)

Core Methods

Training

train(sentences: List[str]) -> None

Train the model on a list of sentences.

Text Generation

generate_text(
    start_word: Optional[str] = None,
    max_words: int = 20,
    end_on_sentence: bool = True
) -> str

Generate text using the trained model.

Model Persistence

save_model(filename: str) -> None
load_model(filename: str) -> None

Save and load trained models.

Analysis

get_bigram_probability(word1: str, word2: str) -> float
get_next_word_candidates(current_word: str) -> Dict[str, float]
evaluate_perplexity(test_sentences: List[str]) -> float

Data Loading

load_data_from_file(filename: str) -> List[str]
get_default_sentences() -> List[str]

🔧 Technical Details

Text Preprocessing Pipeline

Case Normalization: Convert all text to lowercase
Punctuation Handling: Preserve sentence-ending punctuation, remove others
Tokenization: Split into individual words
Boundary Tokens: Add <START> and <END> markers

Smoothing Algorithm

Uses Add-k smoothing to handle unseen bigrams:

P(w2|w1) = (Count(w1,w2) + k) / (Count(w1) + k × |V|)

Where:

k = smoothing factor
|V| = vocabulary size

Model Architecture

Input Text → Preprocessing → Bigram Counting → Probability Calculation → Text Generation

📊 Example Output

Training sentences:
1. The cat sat on the mat
2. The dog barked at the cat
3. The bird sang a beautiful song
...

Model trained on 10 sentences
Vocabulary size: 48

Generated text examples:
Starting with "the": the cat sat on a beautiful song
Starting with "a": a quick brown fox jumps over the lazy dog

🧪 Testing

Run the main script to see the model in action:

python bigram.py

This will:

Train on default sentences
Show model statistics
Generate example text with different starting words
Display bigram probabilities
Demonstrate save/load functionality

📈 Performance Characteristics

Training Time: O(n × m) where n = number of sentences, m = average sentence length
Memory Usage: O(|V|²) for bigram storage
Generation Speed: O(k × |V|) where k = number of words to generate
Vocabulary Scalability: Efficiently handles vocabularies up to 100K+ words

🔄 Legacy Compatibility

The original functional interface is still supported:

# Original functions still work
sentences = prepare_Data()
bigram_probs = build_bigram_model(sentences)
text = generate_text(bigram_probs, "the", 10)

🚧 Limitations

Context Window: Only considers the immediately preceding word
Long-range Dependencies: Cannot capture long-distance relationships
Vocabulary: Limited to words seen during training
Generation Quality: May produce repetitive or incoherent long sequences

🛣️ Future Improvements

N-gram models (trigram, 4-gram)
Neural language model integration
Better evaluation metrics (BLEU, perplexity)
Interactive text generation interface
Support for different languages
Parallel training for large corpora

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📄 License

This project is open source and available under the MIT License.

📚 References

Jurafsky, D. & Martin, J. H. (2023). Speech and Language Processing (3rd ed.)
Manning, C. D. & Schütze, H. (1999). Foundations of Statistical Natural Language Processing
Chen, S. F. & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling

📞 Contact

For questions, suggestions, or issues, please open an issue in the repository or contact the maintainer.

Built with ❤️ for natural language processing education and research

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support