Johnnyman1100
/

EZ-Tokenizer

Token Classification

code-generation

machine-learning

Model card Files Files and versions

EZ-Tokenizer / examples /README.md

Johnnyman1100's picture

Upload 38 files

4265aea verified 8 months ago

|

history blame contribute delete

2.04 kB

	# NexForge Tokenizer Examples

	This directory contains example scripts demonstrating advanced usage of the NexForge tokenizer.

	## Quick Start

	### Basic Tokenizer Creation

	```python
	from nexforgetokenizer import build_tokenizer

	# Create a tokenizer with default settings
	build_tokenizer(
	input_dir="path/to/your/files",
	output_path="custom_tokenizer.json",
	vocab_size=40000,
	min_frequency=2
	)
	```

	### Example Scripts

	1. Basic Example (`basic_usage.py`)
	- Simple tokenizer creation and usage
	- Basic encoding/decoding
	- Vocabulary inspection

	2. Advanced Usage (`advanced_usage.py`)
	- Custom special tokens
	- Batch processing
	- Performance optimization
	- Error handling

	## Running Examples

	```bash
	# Install in development mode
	pip install -e .

	# Run basic example
	python examples/basic_usage.py

	# Run advanced example
	python examples/advanced_usage.py --input-dir ../Dataset --output my_tokenizer.json
	```

	## Example: Creating a Custom Tokenizer

	```python
	from nexforgetokenizer import build_tokenizer

	# Create a tokenizer with custom settings
	build_tokenizer(
	input_dir="../Dataset",
	output_path="my_tokenizer.json",
	vocab_size=30000, # Smaller vocabulary for specific domain
	min_frequency=3, # Only include tokens appearing at least 3 times
	max_files=1000, # Limit number of files to process
	special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
	)
	```

	## Best Practices

	1. For General Use
	- Use default settings (40k vocab, min_freq=2)
	- Process all files in your dataset
	- Test with the built-in test suite

	2. For Specialized Domains
	- Adjust vocabulary size based on domain complexity
	- Consider increasing min_frequency for smaller vocabularies
	- Test with domain-specific files

	## Need Help?

	- Check the [main README](../README.md) for basic usage
	- Review the test cases in `Test_tokenizer/`
	- Open an issue on GitHub for support

	## License

	MIT License - See [LICENSE](../LICENSE) for details.