YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Bigram Language Model

A robust and feature-rich bigram language model implementation in Python with advanced text preprocessing, smoothing techniques, and model persistence capabilities.

🌟 Features

  • Advanced Text Preprocessing: Automatic case normalization, punctuation handling, and tokenization
  • Add-k Smoothing: Prevents zero probabilities and improves generation quality
  • Sentence Boundary Handling: Proper start/end tokens for natural text generation
  • Model Persistence: Save and load trained models in JSON format
  • File Loading: Train on external text corpora
  • Comprehensive Analysis: Probability analysis, model statistics, and evaluation metrics
  • Error Handling: Robust error handling and graceful degradation
  • Backward Compatibility: Legacy function support for existing code

πŸ“‹ Requirements

  • Python 3.7+
  • Standard library modules: random, re, json, collections, typing

πŸš€ Quick Start

Basic Usage

from bigram import BigramLanguageModel

# Create and train model
model = BigramLanguageModel(smoothing_factor=1.0)
sentences = model.get_default_sentences()
model.train(sentences)

# Generate text
generated_text = model.generate_text(start_word="the", max_words=15)
print(generated_text)

Advanced Usage

# Load training data from file
sentences = model.load_data_from_file('your_corpus.txt')
model.train(sentences)

# Save trained model
model.save_model('my_bigram_model.json')

# Load model later
new_model = BigramLanguageModel()
new_model.load_model('my_bigram_model.json')

# Generate with different options
text = new_model.generate_text(
    start_word=None,  # Random start
    max_words=20,
    end_on_sentence=True
)

πŸ“– API Reference

BigramLanguageModel Class

Constructor

BigramLanguageModel(smoothing_factor: float = 1.0)
  • smoothing_factor: Add-k smoothing parameter (default: 1.0)

Core Methods

Training

train(sentences: List[str]) -> None

Train the model on a list of sentences.

Text Generation

generate_text(
    start_word: Optional[str] = None,
    max_words: int = 20,
    end_on_sentence: bool = True
) -> str

Generate text using the trained model.

Model Persistence

save_model(filename: str) -> None
load_model(filename: str) -> None

Save and load trained models.

Analysis

get_bigram_probability(word1: str, word2: str) -> float
get_next_word_candidates(current_word: str) -> Dict[str, float]
evaluate_perplexity(test_sentences: List[str]) -> float

Data Loading

load_data_from_file(filename: str) -> List[str]
get_default_sentences() -> List[str]

πŸ”§ Technical Details

Text Preprocessing Pipeline

  1. Case Normalization: Convert all text to lowercase
  2. Punctuation Handling: Preserve sentence-ending punctuation, remove others
  3. Tokenization: Split into individual words
  4. Boundary Tokens: Add <START> and <END> markers

Smoothing Algorithm

Uses Add-k smoothing to handle unseen bigrams:

P(w2|w1) = (Count(w1,w2) + k) / (Count(w1) + k Γ— |V|)

Where:

  • k = smoothing factor
  • |V| = vocabulary size

Model Architecture

Input Text β†’ Preprocessing β†’ Bigram Counting β†’ Probability Calculation β†’ Text Generation

πŸ“Š Example Output

Training sentences:
1. The cat sat on the mat
2. The dog barked at the cat
3. The bird sang a beautiful song
...

Model trained on 10 sentences
Vocabulary size: 48

Generated text examples:
Starting with "the": the cat sat on a beautiful song
Starting with "a": a quick brown fox jumps over the lazy dog

πŸ§ͺ Testing

Run the main script to see the model in action:

python bigram.py

This will:

  1. Train on default sentences
  2. Show model statistics
  3. Generate example text with different starting words
  4. Display bigram probabilities
  5. Demonstrate save/load functionality

πŸ“ˆ Performance Characteristics

  • Training Time: O(n Γ— m) where n = number of sentences, m = average sentence length
  • Memory Usage: O(|V|Β²) for bigram storage
  • Generation Speed: O(k Γ— |V|) where k = number of words to generate
  • Vocabulary Scalability: Efficiently handles vocabularies up to 100K+ words

πŸ”„ Legacy Compatibility

The original functional interface is still supported:

# Original functions still work
sentences = prepare_Data()
bigram_probs = build_bigram_model(sentences)
text = generate_text(bigram_probs, "the", 10)

🚧 Limitations

  • Context Window: Only considers the immediately preceding word
  • Long-range Dependencies: Cannot capture long-distance relationships
  • Vocabulary: Limited to words seen during training
  • Generation Quality: May produce repetitive or incoherent long sequences

πŸ›£οΈ Future Improvements

  • N-gram models (trigram, 4-gram)
  • Neural language model integration
  • Better evaluation metrics (BLEU, perplexity)
  • Interactive text generation interface
  • Support for different languages
  • Parallel training for large corpora

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“„ License

This project is open source and available under the MIT License.

πŸ“š References

  • Jurafsky, D. & Martin, J. H. (2023). Speech and Language Processing (3rd ed.)
  • Manning, C. D. & SchΓΌtze, H. (1999). Foundations of Statistical Natural Language Processing
  • Chen, S. F. & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling

πŸ“ž Contact

For questions, suggestions, or issues, please open an issue in the repository or contact the maintainer.


Built with ❀️ for natural language processing education and research

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support