---
title: Simple Text Analyzer
emoji: 🚀
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 8501
tags:
- streamlit
- nlp
- linguistics
- japanese
- corpus-linguistics
pinned: false
short_description: Lexical sophistication analyzer for EN and JP texts
license: cc-by-nc-4.0
---

# Simple Text Analyzer

A comprehensive web-based application for lexical sophistication analysis supporting both English and Japanese languages. This tool provides detailed linguistic analysis using corpus-based frequency data and advanced NLP techniques.

## 🌟 Features 

### Multi-Language Support
- **English**: COCA corpus frequency analysis with unigrams, bigrams, and trigrams
- **Japanese**: BCCWJ (written) and CSJ (spoken) corpus integration with POS-aware frequency matching

### Analysis Capabilities
- **Lexical Sophistication**: Frequency-based lexical complexity analysis
- **Part-of-Speech Analysis**: Detailed POS tagging and classification
- **N-gram Analysis**: Bigram and trigram frequency analysis
- **Content vs Function Words**: Automatic classification and separate analysis
- **Batch Processing**: Multiple file analysis with comparative results

### Japanese Language Features ✨ **NEW**
- **BCCWJ Integration**: Balanced Corpus of Contemporary Written Japanese
  - Raw frequency counts
  - Normalized frequency (per million words)
  - Frequency rankings
- **CSJ Integration**: Corpus of Spontaneous Japanese (spoken data)
  - Academic and conversational speech patterns
  - Multiple speech style analysis
- **POS-Aware Matching**: Composite key lookup using `lemma + POS` for accurate frequency matching
- **Robust Fallback System**: Three-tier lookup strategy:
  1. Primary: `lemma_pos` composite key (e.g., "行く_動詞-自立")
  2. Fallback 1: `lemma` only lookup
  3. Fallback 2: `surface_form` lookup

## 🚀 Quick Start

### Prerequisites
- Python 3.8+
- uv (recommended) or pip for package management

### Installation

```bash
# Clone the repository
git clone https://github.com/your-repo/simple-text-analyzer.git
cd simple-text-analyzer

# Install dependencies using uv
uv sync

# Or using pip
pip install -r requirements.txt

# Install required SpaCy models
python -m spacy download en_core_web_trf
python -m spacy download ja_core_news_md  # For Japanese support
```

### Running the Application

```bash
# Using uv
uv run streamlit run web_app/app.py

# Or directly
streamlit run web_app/app.py
```

## 📊 Supported Corpora

### English
- **COCA Spoken**: Corpus of Contemporary American English (spoken subcorpus)
- **COCA Magazine**: Magazine text frequency data
- **Bigram/Trigram Analysis**: Multi-word expression frequency and association measures

### Japanese
- **BCCWJ (Balanced Corpus of Contemporary Written Japanese)**
  - 182,604 unique word forms with POS tags
  - Multiple text registers (books, newspapers, magazines, etc.)
  - Comprehensive written language coverage

- **CSJ (Corpus of Spontaneous Japanese)**
  - 41,892 unique word forms from spoken data
  - Academic presentations and casual conversations
  - Natural speech pattern analysis

## 🔧 Architecture

### Core Components
- **LexicalSophisticationAnalyzer**: Main analysis engine with multi-language support
- **ConfigManager**: Flexible configuration system for corpus integration
- **ReferenceManager**: Dynamic reference list management
- **SessionManager**: State management for web interface

### Japanese Integration Features
- **Composite Key Matching**: Precision matching using lemma and POS combinations
- **Extensible Design**: Easy addition of new subcorpora via YAML configuration
- **Fallback Mechanisms**: Robust lookup strategies for maximum coverage
- **Performance Optimized**: Pre-computed lookup dictionaries for fast analysis

## 📁 File Structure

```
simple-text-analyzer/
├── web_app/                 # Streamlit web application
│   ├── app.py              # Main application entry
│   ├── config_manager.py   # Configuration management
│   ├── reference_manager.py # Reference list handling
│   └── components/         # UI components
├── text_analyzer/          # Core analysis modules
│   ├── lexical_sophistication.py  # Main analyzer
│   ├── frequency_analyzer.py      # Frequency analysis
│   └── pos_parser.py       # POS tagging utilities
├── config/                 # Configuration files
│   └── reference_lists.yaml       # Corpus configurations
├── resources/              # Corpus data files
│   └── reference_lists/
│       ├── en/            # English corpus files
│       └── ja/            # Japanese corpus files
└── test/                  # Test modules
```

## 🧪 Testing

Test the Japanese integration:

```bash
uv run python test_japanese_integration.py
```

Expected output:
- ✅ SpaCy model loading
- ✅ Reference data loading (182K+ BCCWJ entries, 41K+ CSJ entries)
- ✅ Composite key lookup functionality
- ✅ Fallback mechanism verification
- ✅ Complete text analysis pipeline

## 📈 Usage Examples

### Japanese Text Analysis
```python
from text_analyzer.lexical_sophistication import LexicalSophisticationAnalyzer

# Initialize Japanese analyzer
analyzer = LexicalSophisticationAnalyzer(language="ja", model_size="md")

# Load Japanese corpus references
selected_indices = ["BCCWJ_frequency", "CSJ_frequency"]

# Analyze Japanese text
results = analyzer.analyze_text(
    "私は毎日学校に行きます。", 
    selected_indices
)

# Access frequency scores
for token in results['token_details']:
    print(f"{token['token']}: BCCWJ={token.get('BCCWJ_frequency_lemma', 'NA')}")
```

### English Text Analysis
```python
# Initialize English analyzer
analyzer = LexicalSophisticationAnalyzer(language="en", model_size="trf")

# Analyze with COCA frequency data
results = analyzer.analyze_text(
    "The students studied linguistics carefully.",
    ["COCA_spoken_frequency"]
)
```

## 🔧 Configuration

### Adding New Japanese Subcorpora

The system is designed for easy expansion. To add a new subcorpus (e.g., BCCWJ Books):

```yaml
# config/reference_lists.yaml
japanese:
  unigrams:
    BCCWJ_books_frequency:
      display_name: "BCCWJ Books - Frequency"
      description: "BCCWJ books subcorpus frequency data"
      files:
        token: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
        lemma: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
      format: "tsv"
      has_header: true
      enabled: true
      japanese_corpus: true
      columns:
        surface_form: 1  # lForm column
        lemma: 2         # lemma column
        pos: 3           # pos column
        frequency: 10    # PB_frequency column (books subcorpus)
```

No code changes required - the system automatically detects and integrates new configurations!

## 📚 Research Applications

This tool is ideal for:
- **Language Learning Research**: Analyzing text complexity for Japanese learners
- **Corpus Linguistics**: Cross-linguistic frequency analysis
- **Computational Linguistics**: Lexical sophistication measurement
- **Educational Assessment**: Text difficulty evaluation
- **Translation Studies**: Comparative lexical analysis

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the CC BY-NC 4.0 License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- **BCCWJ**: National Institute for Japanese Language and Linguistics
- **CSJ**: National Institute for Japanese Language and Linguistics  
- **COCA**: Mark Davies, Brigham Young University
- **SpaCy**: Explosion AI for robust NLP models

## 📞 Support

For questions, issues, or contributions:
- Open an issue on GitHub
- Contact: [Your contact information]

---

**Happy analyzing!** 🚀📊