egumasa's picture
restart
f3e33b1
---
title: Simple Text Analyzer
emoji: πŸš€
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 8501
tags:
- streamlit
- nlp
- linguistics
- japanese
- corpus-linguistics
pinned: false
short_description: Lexical sophistication analyzer for EN and JP texts
license: cc-by-nc-4.0
---
# Simple Text Analyzer
A comprehensive web-based application for lexical sophistication analysis supporting both English and Japanese languages. This tool provides detailed linguistic analysis using corpus-based frequency data and advanced NLP techniques.
## 🌟 Features
### Multi-Language Support
- **English**: COCA corpus frequency analysis with unigrams, bigrams, and trigrams
- **Japanese**: BCCWJ (written) and CSJ (spoken) corpus integration with POS-aware frequency matching
### Analysis Capabilities
- **Lexical Sophistication**: Frequency-based lexical complexity analysis
- **Part-of-Speech Analysis**: Detailed POS tagging and classification
- **N-gram Analysis**: Bigram and trigram frequency analysis
- **Content vs Function Words**: Automatic classification and separate analysis
- **Batch Processing**: Multiple file analysis with comparative results
### Japanese Language Features ✨ **NEW**
- **BCCWJ Integration**: Balanced Corpus of Contemporary Written Japanese
- Raw frequency counts
- Normalized frequency (per million words)
- Frequency rankings
- **CSJ Integration**: Corpus of Spontaneous Japanese (spoken data)
- Academic and conversational speech patterns
- Multiple speech style analysis
- **POS-Aware Matching**: Composite key lookup using `lemma + POS` for accurate frequency matching
- **Robust Fallback System**: Three-tier lookup strategy:
1. Primary: `lemma_pos` composite key (e.g., "葌く_ε‹•θ©ž-θ‡ͺη«‹")
2. Fallback 1: `lemma` only lookup
3. Fallback 2: `surface_form` lookup
## πŸš€ Quick Start
### Prerequisites
- Python 3.8+
- uv (recommended) or pip for package management
### Installation
```bash
# Clone the repository
git clone https://github.com/your-repo/simple-text-analyzer.git
cd simple-text-analyzer
# Install dependencies using uv
uv sync
# Or using pip
pip install -r requirements.txt
# Install required SpaCy models
python -m spacy download en_core_web_trf
python -m spacy download ja_core_news_md # For Japanese support
```
### Running the Application
```bash
# Using uv
uv run streamlit run web_app/app.py
# Or directly
streamlit run web_app/app.py
```
## πŸ“Š Supported Corpora
### English
- **COCA Spoken**: Corpus of Contemporary American English (spoken subcorpus)
- **COCA Magazine**: Magazine text frequency data
- **Bigram/Trigram Analysis**: Multi-word expression frequency and association measures
### Japanese
- **BCCWJ (Balanced Corpus of Contemporary Written Japanese)**
- 182,604 unique word forms with POS tags
- Multiple text registers (books, newspapers, magazines, etc.)
- Comprehensive written language coverage
- **CSJ (Corpus of Spontaneous Japanese)**
- 41,892 unique word forms from spoken data
- Academic presentations and casual conversations
- Natural speech pattern analysis
## πŸ”§ Architecture
### Core Components
- **LexicalSophisticationAnalyzer**: Main analysis engine with multi-language support
- **ConfigManager**: Flexible configuration system for corpus integration
- **ReferenceManager**: Dynamic reference list management
- **SessionManager**: State management for web interface
### Japanese Integration Features
- **Composite Key Matching**: Precision matching using lemma and POS combinations
- **Extensible Design**: Easy addition of new subcorpora via YAML configuration
- **Fallback Mechanisms**: Robust lookup strategies for maximum coverage
- **Performance Optimized**: Pre-computed lookup dictionaries for fast analysis
## πŸ“ File Structure
```
simple-text-analyzer/
β”œβ”€β”€ web_app/ # Streamlit web application
β”‚ β”œβ”€β”€ app.py # Main application entry
β”‚ β”œβ”€β”€ config_manager.py # Configuration management
β”‚ β”œβ”€β”€ reference_manager.py # Reference list handling
β”‚ └── components/ # UI components
β”œβ”€β”€ text_analyzer/ # Core analysis modules
β”‚ β”œβ”€β”€ lexical_sophistication.py # Main analyzer
β”‚ β”œβ”€β”€ frequency_analyzer.py # Frequency analysis
β”‚ └── pos_parser.py # POS tagging utilities
β”œβ”€β”€ config/ # Configuration files
β”‚ └── reference_lists.yaml # Corpus configurations
β”œβ”€β”€ resources/ # Corpus data files
β”‚ └── reference_lists/
β”‚ β”œβ”€β”€ en/ # English corpus files
β”‚ └── ja/ # Japanese corpus files
└── test/ # Test modules
```
## πŸ§ͺ Testing
Test the Japanese integration:
```bash
uv run python test_japanese_integration.py
```
Expected output:
- βœ… SpaCy model loading
- βœ… Reference data loading (182K+ BCCWJ entries, 41K+ CSJ entries)
- βœ… Composite key lookup functionality
- βœ… Fallback mechanism verification
- βœ… Complete text analysis pipeline
## πŸ“ˆ Usage Examples
### Japanese Text Analysis
```python
from text_analyzer.lexical_sophistication import LexicalSophisticationAnalyzer
# Initialize Japanese analyzer
analyzer = LexicalSophisticationAnalyzer(language="ja", model_size="md")
# Load Japanese corpus references
selected_indices = ["BCCWJ_frequency", "CSJ_frequency"]
# Analyze Japanese text
results = analyzer.analyze_text(
"私は毎ζ—₯ε­¦ζ ‘γ«θ‘ŒγγΎγ™γ€‚",
selected_indices
)
# Access frequency scores
for token in results['token_details']:
print(f"{token['token']}: BCCWJ={token.get('BCCWJ_frequency_lemma', 'NA')}")
```
### English Text Analysis
```python
# Initialize English analyzer
analyzer = LexicalSophisticationAnalyzer(language="en", model_size="trf")
# Analyze with COCA frequency data
results = analyzer.analyze_text(
"The students studied linguistics carefully.",
["COCA_spoken_frequency"]
)
```
## πŸ”§ Configuration
### Adding New Japanese Subcorpora
The system is designed for easy expansion. To add a new subcorpus (e.g., BCCWJ Books):
```yaml
# config/reference_lists.yaml
japanese:
unigrams:
BCCWJ_books_frequency:
display_name: "BCCWJ Books - Frequency"
description: "BCCWJ books subcorpus frequency data"
files:
token: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
lemma: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
format: "tsv"
has_header: true
enabled: true
japanese_corpus: true
columns:
surface_form: 1 # lForm column
lemma: 2 # lemma column
pos: 3 # pos column
frequency: 10 # PB_frequency column (books subcorpus)
```
No code changes required - the system automatically detects and integrates new configurations!
## πŸ“š Research Applications
This tool is ideal for:
- **Language Learning Research**: Analyzing text complexity for Japanese learners
- **Corpus Linguistics**: Cross-linguistic frequency analysis
- **Computational Linguistics**: Lexical sophistication measurement
- **Educational Assessment**: Text difficulty evaluation
- **Translation Studies**: Comparative lexical analysis
## 🀝 Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## πŸ“„ License
This project is licensed under the CC BY-NC 4.0 License - see the [LICENSE](LICENSE) file for details.
## πŸ™ Acknowledgments
- **BCCWJ**: National Institute for Japanese Language and Linguistics
- **CSJ**: National Institute for Japanese Language and Linguistics
- **COCA**: Mark Davies, Brigham Young University
- **SpaCy**: Explosion AI for robust NLP models
## πŸ“ž Support
For questions, issues, or contributions:
- Open an issue on GitHub
- Contact: [Your contact information]
---
**Happy analyzing!** πŸš€πŸ“Š