Spaces:

egumasa
/

simple-text-analyzer

Building

App Files Files Community

simple-text-analyzer / README.md

egumasa

restart

f3e33b1 3 months ago

preview code

raw

history blame contribute delete

8.08 kB

metadata

title: Simple Text Analyzer
emoji: 🚀
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 8501
tags:
  - streamlit
  - nlp
  - linguistics
  - japanese
  - corpus-linguistics
pinned: false
short_description: Lexical sophistication analyzer for EN and JP texts
license: cc-by-nc-4.0

Simple Text Analyzer

A comprehensive web-based application for lexical sophistication analysis supporting both English and Japanese languages. This tool provides detailed linguistic analysis using corpus-based frequency data and advanced NLP techniques.

🌟 Features

Multi-Language Support

English: COCA corpus frequency analysis with unigrams, bigrams, and trigrams
Japanese: BCCWJ (written) and CSJ (spoken) corpus integration with POS-aware frequency matching

Analysis Capabilities

Lexical Sophistication: Frequency-based lexical complexity analysis
Part-of-Speech Analysis: Detailed POS tagging and classification
N-gram Analysis: Bigram and trigram frequency analysis
Content vs Function Words: Automatic classification and separate analysis
Batch Processing: Multiple file analysis with comparative results

Japanese Language Features ✨ NEW

BCCWJ Integration: Balanced Corpus of Contemporary Written Japanese
- Raw frequency counts
- Normalized frequency (per million words)
- Frequency rankings
CSJ Integration: Corpus of Spontaneous Japanese (spoken data)
- Academic and conversational speech patterns
- Multiple speech style analysis
POS-Aware Matching: Composite key lookup using lemma + POS for accurate frequency matching
Robust Fallback System: Three-tier lookup strategy:
1. Primary: lemma_pos composite key (e.g., "行く_動詞-自立")
2. Fallback 1: lemma only lookup
3. Fallback 2: surface_form lookup

🚀 Quick Start

Prerequisites

Python 3.8+
uv (recommended) or pip for package management

Installation

# Clone the repository
git clone https://github.com/your-repo/simple-text-analyzer.git
cd simple-text-analyzer

# Install dependencies using uv
uv sync

# Or using pip
pip install -r requirements.txt

# Install required SpaCy models
python -m spacy download en_core_web_trf
python -m spacy download ja_core_news_md  # For Japanese support

Running the Application

# Using uv
uv run streamlit run web_app/app.py

# Or directly
streamlit run web_app/app.py

📊 Supported Corpora

English

COCA Spoken: Corpus of Contemporary American English (spoken subcorpus)
COCA Magazine: Magazine text frequency data
Bigram/Trigram Analysis: Multi-word expression frequency and association measures

Japanese

BCCWJ (Balanced Corpus of Contemporary Written Japanese)
- 182,604 unique word forms with POS tags
- Multiple text registers (books, newspapers, magazines, etc.)
- Comprehensive written language coverage
CSJ (Corpus of Spontaneous Japanese)
- 41,892 unique word forms from spoken data
- Academic presentations and casual conversations
- Natural speech pattern analysis

🔧 Architecture

Core Components

LexicalSophisticationAnalyzer: Main analysis engine with multi-language support
ConfigManager: Flexible configuration system for corpus integration
ReferenceManager: Dynamic reference list management
SessionManager: State management for web interface

Japanese Integration Features

Composite Key Matching: Precision matching using lemma and POS combinations
Extensible Design: Easy addition of new subcorpora via YAML configuration
Fallback Mechanisms: Robust lookup strategies for maximum coverage
Performance Optimized: Pre-computed lookup dictionaries for fast analysis

📁 File Structure

simple-text-analyzer/
├── web_app/                 # Streamlit web application
│   ├── app.py              # Main application entry
│   ├── config_manager.py   # Configuration management
│   ├── reference_manager.py # Reference list handling
│   └── components/         # UI components
├── text_analyzer/          # Core analysis modules
│   ├── lexical_sophistication.py  # Main analyzer
│   ├── frequency_analyzer.py      # Frequency analysis
│   └── pos_parser.py       # POS tagging utilities
├── config/                 # Configuration files
│   └── reference_lists.yaml       # Corpus configurations
├── resources/              # Corpus data files
│   └── reference_lists/
│       ├── en/            # English corpus files
│       └── ja/            # Japanese corpus files
└── test/                  # Test modules

🧪 Testing

Test the Japanese integration:

uv run python test_japanese_integration.py

Expected output:

✅ SpaCy model loading
✅ Reference data loading (182K+ BCCWJ entries, 41K+ CSJ entries)
✅ Composite key lookup functionality
✅ Fallback mechanism verification
✅ Complete text analysis pipeline

📈 Usage Examples

Japanese Text Analysis

from text_analyzer.lexical_sophistication import LexicalSophisticationAnalyzer

# Initialize Japanese analyzer
analyzer = LexicalSophisticationAnalyzer(language="ja", model_size="md")

# Load Japanese corpus references
selected_indices = ["BCCWJ_frequency", "CSJ_frequency"]

# Analyze Japanese text
results = analyzer.analyze_text(
    "私は毎日学校に行きます。", 
    selected_indices
)

# Access frequency scores
for token in results['token_details']:
    print(f"{token['token']}: BCCWJ={token.get('BCCWJ_frequency_lemma', 'NA')}")

English Text Analysis

# Initialize English analyzer
analyzer = LexicalSophisticationAnalyzer(language="en", model_size="trf")

# Analyze with COCA frequency data
results = analyzer.analyze_text(
    "The students studied linguistics carefully.",
    ["COCA_spoken_frequency"]
)

🔧 Configuration

Adding New Japanese Subcorpora

The system is designed for easy expansion. To add a new subcorpus (e.g., BCCWJ Books):

# config/reference_lists.yaml
japanese:
  unigrams:
    BCCWJ_books_frequency:
      display_name: "BCCWJ Books - Frequency"
      description: "BCCWJ books subcorpus frequency data"
      files:
        token: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
        lemma: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
      format: "tsv"
      has_header: true
      enabled: true
      japanese_corpus: true
      columns:
        surface_form: 1  # lForm column
        lemma: 2         # lemma column
        pos: 3           # pos column
        frequency: 10    # PB_frequency column (books subcorpus)

No code changes required - the system automatically detects and integrates new configurations!

📚 Research Applications

This tool is ideal for:

Language Learning Research: Analyzing text complexity for Japanese learners
Corpus Linguistics: Cross-linguistic frequency analysis
Computational Linguistics: Lexical sophistication measurement
Educational Assessment: Text difficulty evaluation
Translation Studies: Comparative lexical analysis

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the CC BY-NC 4.0 License - see the LICENSE file for details.

🙏 Acknowledgments

BCCWJ: National Institute for Japanese Language and Linguistics
CSJ: National Institute for Japanese Language and Linguistics
COCA: Mark Davies, Brigham Young University
SpaCy: Explosion AI for robust NLP models

📞 Support

For questions, issues, or contributions:

Open an issue on GitHub
Contact: [Your contact information]

Happy analyzing! 🚀📊