egumasa's picture
restart
f3e33b1
metadata
title: Simple Text Analyzer
emoji: πŸš€
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 8501
tags:
  - streamlit
  - nlp
  - linguistics
  - japanese
  - corpus-linguistics
pinned: false
short_description: Lexical sophistication analyzer for EN and JP texts
license: cc-by-nc-4.0

Simple Text Analyzer

A comprehensive web-based application for lexical sophistication analysis supporting both English and Japanese languages. This tool provides detailed linguistic analysis using corpus-based frequency data and advanced NLP techniques.

🌟 Features

Multi-Language Support

  • English: COCA corpus frequency analysis with unigrams, bigrams, and trigrams
  • Japanese: BCCWJ (written) and CSJ (spoken) corpus integration with POS-aware frequency matching

Analysis Capabilities

  • Lexical Sophistication: Frequency-based lexical complexity analysis
  • Part-of-Speech Analysis: Detailed POS tagging and classification
  • N-gram Analysis: Bigram and trigram frequency analysis
  • Content vs Function Words: Automatic classification and separate analysis
  • Batch Processing: Multiple file analysis with comparative results

Japanese Language Features ✨ NEW

  • BCCWJ Integration: Balanced Corpus of Contemporary Written Japanese
    • Raw frequency counts
    • Normalized frequency (per million words)
    • Frequency rankings
  • CSJ Integration: Corpus of Spontaneous Japanese (spoken data)
    • Academic and conversational speech patterns
    • Multiple speech style analysis
  • POS-Aware Matching: Composite key lookup using lemma + POS for accurate frequency matching
  • Robust Fallback System: Three-tier lookup strategy:
    1. Primary: lemma_pos composite key (e.g., "葌く_ε‹•θ©ž-θ‡ͺη«‹")
    2. Fallback 1: lemma only lookup
    3. Fallback 2: surface_form lookup

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • uv (recommended) or pip for package management

Installation

# Clone the repository
git clone https://github.com/your-repo/simple-text-analyzer.git
cd simple-text-analyzer

# Install dependencies using uv
uv sync

# Or using pip
pip install -r requirements.txt

# Install required SpaCy models
python -m spacy download en_core_web_trf
python -m spacy download ja_core_news_md  # For Japanese support

Running the Application

# Using uv
uv run streamlit run web_app/app.py

# Or directly
streamlit run web_app/app.py

πŸ“Š Supported Corpora

English

  • COCA Spoken: Corpus of Contemporary American English (spoken subcorpus)
  • COCA Magazine: Magazine text frequency data
  • Bigram/Trigram Analysis: Multi-word expression frequency and association measures

Japanese

  • BCCWJ (Balanced Corpus of Contemporary Written Japanese)

    • 182,604 unique word forms with POS tags
    • Multiple text registers (books, newspapers, magazines, etc.)
    • Comprehensive written language coverage
  • CSJ (Corpus of Spontaneous Japanese)

    • 41,892 unique word forms from spoken data
    • Academic presentations and casual conversations
    • Natural speech pattern analysis

πŸ”§ Architecture

Core Components

  • LexicalSophisticationAnalyzer: Main analysis engine with multi-language support
  • ConfigManager: Flexible configuration system for corpus integration
  • ReferenceManager: Dynamic reference list management
  • SessionManager: State management for web interface

Japanese Integration Features

  • Composite Key Matching: Precision matching using lemma and POS combinations
  • Extensible Design: Easy addition of new subcorpora via YAML configuration
  • Fallback Mechanisms: Robust lookup strategies for maximum coverage
  • Performance Optimized: Pre-computed lookup dictionaries for fast analysis

πŸ“ File Structure

simple-text-analyzer/
β”œβ”€β”€ web_app/                 # Streamlit web application
β”‚   β”œβ”€β”€ app.py              # Main application entry
β”‚   β”œβ”€β”€ config_manager.py   # Configuration management
β”‚   β”œβ”€β”€ reference_manager.py # Reference list handling
β”‚   └── components/         # UI components
β”œβ”€β”€ text_analyzer/          # Core analysis modules
β”‚   β”œβ”€β”€ lexical_sophistication.py  # Main analyzer
β”‚   β”œβ”€β”€ frequency_analyzer.py      # Frequency analysis
β”‚   └── pos_parser.py       # POS tagging utilities
β”œβ”€β”€ config/                 # Configuration files
β”‚   └── reference_lists.yaml       # Corpus configurations
β”œβ”€β”€ resources/              # Corpus data files
β”‚   └── reference_lists/
β”‚       β”œβ”€β”€ en/            # English corpus files
β”‚       └── ja/            # Japanese corpus files
└── test/                  # Test modules

πŸ§ͺ Testing

Test the Japanese integration:

uv run python test_japanese_integration.py

Expected output:

  • βœ… SpaCy model loading
  • βœ… Reference data loading (182K+ BCCWJ entries, 41K+ CSJ entries)
  • βœ… Composite key lookup functionality
  • βœ… Fallback mechanism verification
  • βœ… Complete text analysis pipeline

πŸ“ˆ Usage Examples

Japanese Text Analysis

from text_analyzer.lexical_sophistication import LexicalSophisticationAnalyzer

# Initialize Japanese analyzer
analyzer = LexicalSophisticationAnalyzer(language="ja", model_size="md")

# Load Japanese corpus references
selected_indices = ["BCCWJ_frequency", "CSJ_frequency"]

# Analyze Japanese text
results = analyzer.analyze_text(
    "私は毎ζ—₯ε­¦ζ ‘γ«θ‘ŒγγΎγ™γ€‚", 
    selected_indices
)

# Access frequency scores
for token in results['token_details']:
    print(f"{token['token']}: BCCWJ={token.get('BCCWJ_frequency_lemma', 'NA')}")

English Text Analysis

# Initialize English analyzer
analyzer = LexicalSophisticationAnalyzer(language="en", model_size="trf")

# Analyze with COCA frequency data
results = analyzer.analyze_text(
    "The students studied linguistics carefully.",
    ["COCA_spoken_frequency"]
)

πŸ”§ Configuration

Adding New Japanese Subcorpora

The system is designed for easy expansion. To add a new subcorpus (e.g., BCCWJ Books):

# config/reference_lists.yaml
japanese:
  unigrams:
    BCCWJ_books_frequency:
      display_name: "BCCWJ Books - Frequency"
      description: "BCCWJ books subcorpus frequency data"
      files:
        token: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
        lemma: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
      format: "tsv"
      has_header: true
      enabled: true
      japanese_corpus: true
      columns:
        surface_form: 1  # lForm column
        lemma: 2         # lemma column
        pos: 3           # pos column
        frequency: 10    # PB_frequency column (books subcorpus)

No code changes required - the system automatically detects and integrates new configurations!

πŸ“š Research Applications

This tool is ideal for:

  • Language Learning Research: Analyzing text complexity for Japanese learners
  • Corpus Linguistics: Cross-linguistic frequency analysis
  • Computational Linguistics: Lexical sophistication measurement
  • Educational Assessment: Text difficulty evaluation
  • Translation Studies: Comparative lexical analysis

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the CC BY-NC 4.0 License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • BCCWJ: National Institute for Japanese Language and Linguistics
  • CSJ: National Institute for Japanese Language and Linguistics
  • COCA: Mark Davies, Brigham Young University
  • SpaCy: Explosion AI for robust NLP models

πŸ“ž Support

For questions, issues, or contributions:

  • Open an issue on GitHub
  • Contact: [Your contact information]

Happy analyzing! πŸš€πŸ“Š