Spaces:

egumasa
/

simple-text-analyzer

Building

App Files Files Community

simple-text-analyzer / README.md

egumasa

restart

f3e33b1 3 months ago

preview code

raw

history blame contribute delete

8.08 kB

	---
	title: Simple Text Analyzer
	emoji: 🚀
	colorFrom: blue
	colorTo: blue
	sdk: docker
	app_port: 8501
	tags:
	- streamlit
	- nlp
	- linguistics
	- japanese
	- corpus-linguistics
	pinned: false
	short_description: Lexical sophistication analyzer for EN and JP texts
	license: cc-by-nc-4.0
	---

	# Simple Text Analyzer

	A comprehensive web-based application for lexical sophistication analysis supporting both English and Japanese languages. This tool provides detailed linguistic analysis using corpus-based frequency data and advanced NLP techniques.

	## 🌟 Features

	### Multi-Language Support
	- English: COCA corpus frequency analysis with unigrams, bigrams, and trigrams
	- Japanese: BCCWJ (written) and CSJ (spoken) corpus integration with POS-aware frequency matching

	### Analysis Capabilities
	- Lexical Sophistication: Frequency-based lexical complexity analysis
	- Part-of-Speech Analysis: Detailed POS tagging and classification
	- N-gram Analysis: Bigram and trigram frequency analysis
	- Content vs Function Words: Automatic classification and separate analysis
	- Batch Processing: Multiple file analysis with comparative results

	### Japanese Language Features ✨ NEW
	- BCCWJ Integration: Balanced Corpus of Contemporary Written Japanese
	- Raw frequency counts
	- Normalized frequency (per million words)
	- Frequency rankings
	- CSJ Integration: Corpus of Spontaneous Japanese (spoken data)
	- Academic and conversational speech patterns
	- Multiple speech style analysis
	- POS-Aware Matching: Composite key lookup using `lemma + POS` for accurate frequency matching
	- Robust Fallback System: Three-tier lookup strategy:
	1. Primary: `lemma_pos` composite key (e.g., "行く_動詞-自立")
	2. Fallback 1: `lemma` only lookup
	3. Fallback 2: `surface_form` lookup

	## 🚀 Quick Start

	### Prerequisites
	- Python 3.8+
	- uv (recommended) or pip for package management

	### Installation

	```bash
	# Clone the repository
	git clone https://github.com/your-repo/simple-text-analyzer.git
	cd simple-text-analyzer

	# Install dependencies using uv
	uv sync

	# Or using pip
	pip install -r requirements.txt

	# Install required SpaCy models
	python -m spacy download en_core_web_trf
	python -m spacy download ja_core_news_md # For Japanese support
	```

	### Running the Application

	```bash
	# Using uv
	uv run streamlit run web_app/app.py

	# Or directly
	streamlit run web_app/app.py
	```

	## 📊 Supported Corpora

	### English
	- COCA Spoken: Corpus of Contemporary American English (spoken subcorpus)
	- COCA Magazine: Magazine text frequency data
	- Bigram/Trigram Analysis: Multi-word expression frequency and association measures

	### Japanese
	- BCCWJ (Balanced Corpus of Contemporary Written Japanese)
	- 182,604 unique word forms with POS tags
	- Multiple text registers (books, newspapers, magazines, etc.)
	- Comprehensive written language coverage

	- CSJ (Corpus of Spontaneous Japanese)
	- 41,892 unique word forms from spoken data
	- Academic presentations and casual conversations
	- Natural speech pattern analysis

	## 🔧 Architecture

	### Core Components
	- LexicalSophisticationAnalyzer: Main analysis engine with multi-language support
	- ConfigManager: Flexible configuration system for corpus integration
	- ReferenceManager: Dynamic reference list management
	- SessionManager: State management for web interface

	### Japanese Integration Features
	- Composite Key Matching: Precision matching using lemma and POS combinations
	- Extensible Design: Easy addition of new subcorpora via YAML configuration
	- Fallback Mechanisms: Robust lookup strategies for maximum coverage
	- Performance Optimized: Pre-computed lookup dictionaries for fast analysis

	## 📁 File Structure

	```
	simple-text-analyzer/
	├── web_app/ # Streamlit web application
	│ ├── app.py # Main application entry
	│ ├── config_manager.py # Configuration management
	│ ├── reference_manager.py # Reference list handling
	│ └── components/ # UI components
	├── text_analyzer/ # Core analysis modules
	│ ├── lexical_sophistication.py # Main analyzer
	│ ├── frequency_analyzer.py # Frequency analysis
	│ └── pos_parser.py # POS tagging utilities
	├── config/ # Configuration files
	│ └── reference_lists.yaml # Corpus configurations
	├── resources/ # Corpus data files
	│ └── reference_lists/
	│ ├── en/ # English corpus files
	│ └── ja/ # Japanese corpus files
	└── test/ # Test modules
	```

	## 🧪 Testing

	Test the Japanese integration:

	```bash
	uv run python test_japanese_integration.py
	```

	Expected output:
	- ✅ SpaCy model loading
	- ✅ Reference data loading (182K+ BCCWJ entries, 41K+ CSJ entries)
	- ✅ Composite key lookup functionality
	- ✅ Fallback mechanism verification
	- ✅ Complete text analysis pipeline

	## 📈 Usage Examples

	### Japanese Text Analysis
	```python
	from text_analyzer.lexical_sophistication import LexicalSophisticationAnalyzer

	# Initialize Japanese analyzer
	analyzer = LexicalSophisticationAnalyzer(language="ja", model_size="md")

	# Load Japanese corpus references
	selected_indices = ["BCCWJ_frequency", "CSJ_frequency"]

	# Analyze Japanese text
	results = analyzer.analyze_text(
	"私は毎日学校に行きます。",
	selected_indices
	)

	# Access frequency scores
	for token in results['token_details']:
	print(f"{token['token']}: BCCWJ={token.get('BCCWJ_frequency_lemma', 'NA')}")
	```

	### English Text Analysis
	```python
	# Initialize English analyzer
	analyzer = LexicalSophisticationAnalyzer(language="en", model_size="trf")

	# Analyze with COCA frequency data
	results = analyzer.analyze_text(
	"The students studied linguistics carefully.",
	["COCA_spoken_frequency"]
	)
	```

	## 🔧 Configuration

	### Adding New Japanese Subcorpora

	The system is designed for easy expansion. To add a new subcorpus (e.g., BCCWJ Books):

	```yaml
	# config/reference_lists.yaml
	japanese:
	unigrams:
	BCCWJ_books_frequency:
	display_name: "BCCWJ Books - Frequency"
	description: "BCCWJ books subcorpus frequency data"
	files:
	token: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
	lemma: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
	format: "tsv"
	has_header: true
	enabled: true
	japanese_corpus: true
	columns:
	surface_form: 1 # lForm column
	lemma: 2 # lemma column
	pos: 3 # pos column
	frequency: 10 # PB_frequency column (books subcorpus)
	```

	No code changes required - the system automatically detects and integrates new configurations!

	## 📚 Research Applications

	This tool is ideal for:
	- Language Learning Research: Analyzing text complexity for Japanese learners
	- Corpus Linguistics: Cross-linguistic frequency analysis
	- Computational Linguistics: Lexical sophistication measurement
	- Educational Assessment: Text difficulty evaluation
	- Translation Studies: Comparative lexical analysis

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch (`git checkout -b feature/amazing-feature`)
	3. Commit your changes (`git commit -m 'Add amazing feature'`)
	4. Push to the branch (`git push origin feature/amazing-feature`)
	5. Open a Pull Request

	## 📄 License

	This project is licensed under the CC BY-NC 4.0 License - see the [LICENSE](LICENSE) file for details.

	## 🙏 Acknowledgments

	- BCCWJ: National Institute for Japanese Language and Linguistics
	- CSJ: National Institute for Japanese Language and Linguistics
	- COCA: Mark Davies, Brigham Young University
	- SpaCy: Explosion AI for robust NLP models

	## 📞 Support

	For questions, issues, or contributions:
	- Open an issue on GitHub
	- Contact: [Your contact information]

	---

	Happy analyzing! 🚀📊