|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- AGofficial/AgGPT17 |
|
|
--- |
|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
<img src="banner.png" alt="AgGPT-18 Banner" width="100%"> |
|
|
|
|
|
# AgGPT-18 |
|
|
|
|
|
## Relentless. Scalable. True Intelligence. |
|
|
|
|
|
[](https://opensource.org/licenses/MIT) |
|
|
|
|
|
AgGPT-18 is a revolutionary AI training framework that implements a **Scalable Feather Architecture** for building efficient, modular AI models. This system breaks down large training datasets into manageable mini-models, each stored in highly optimized Feather format files for lightning-fast loading and inference. |
|
|
|
|
|
## ๐ Features |
|
|
|
|
|
- **Scalable Feather Architecture**: Modular mini-models stored in Apache Feather format for optimal performance |
|
|
- **Multi-Corpora Training**: Train on multiple datasets simultaneously with intelligent model merging |
|
|
- **Pattern-Based Learning**: Advanced pattern extraction and similarity matching |
|
|
- **Real-time Chat Interface**: Interactive chat system with context awareness |
|
|
- **Confidence Scoring**: Intelligent response confidence calculation |
|
|
- **Model Merging**: Automatic merging of similar models to optimize storage and performance |
|
|
- **YAML Export**: Human-readable model weights and patterns export |
|
|
- **Memory Efficient**: Chunked training approach prevents memory overflow |
|
|
|
|
|
## ๐ Project Structure |
|
|
|
|
|
``` |
|
|
AgGPT-18/ |
|
|
โโโ train.py # Main training script with multi-corpora support |
|
|
โโโ chat.py # Interactive chat interface |
|
|
โโโ feather.py # Feather format model management |
|
|
โโโ models/ # Trained mini-models (.feather files) |
|
|
โโโ readable_weights/ # Human-readable YAML model exports |
|
|
โโโ training_data/ # Training corpora files |
|
|
โ โโโ corpora.txt # Primary training dataset |
|
|
โ โโโ corpora2.txt # Secondary training dataset |
|
|
โโโ banner.png # Project banner |
|
|
โโโ README.md # This file |
|
|
``` |
|
|
|
|
|
## ๐ ๏ธ Installation |
|
|
|
|
|
1. **Clone the repository:** |
|
|
```bash |
|
|
git clone https://github.com/your-username/AgGPT-18.git |
|
|
cd AgGPT-18 |
|
|
``` |
|
|
|
|
|
2. **Install dependencies:** |
|
|
```bash |
|
|
pip install pandas pyarrow tqdm pyyaml |
|
|
``` |
|
|
|
|
|
3. **Prepare training data:** |
|
|
Place your training data in the `training_data/` directory. The format should be: |
|
|
``` |
|
|
user: [user input] |
|
|
<pad> |
|
|
ai: [ai response] |
|
|
<eos> |
|
|
``` |
|
|
|
|
|
## ๐ฏ Quick Start |
|
|
|
|
|
### Training the Model |
|
|
|
|
|
Train on multiple corpora: |
|
|
```bash |
|
|
python train.py |
|
|
``` |
|
|
|
|
|
The training process will: |
|
|
- Load and process multiple training files |
|
|
- Create optimized training chunks (target: 5MB each) |
|
|
- Train mini-models using the Feather architecture |
|
|
- Merge similar models for efficiency |
|
|
- Export readable model weights to YAML |
|
|
|
|
|
### Running the Chat Interface |
|
|
|
|
|
Start an interactive chat session: |
|
|
```bash |
|
|
python chat.py |
|
|
``` |
|
|
|
|
|
Features of the chat interface: |
|
|
- Real-time response generation |
|
|
- Context-aware conversations |
|
|
- Confidence scoring for responses |
|
|
- Model performance statistics |
|
|
|
|
|
## ๐๏ธ Architecture |
|
|
|
|
|
### Feather Architecture |
|
|
|
|
|
AgGPT-18 uses Apache Feather format for model storage, providing: |
|
|
- **Ultra-fast I/O**: 10x faster than traditional pickle files |
|
|
- **Cross-platform compatibility**: Works across Python, R, and other languages |
|
|
- **Memory efficiency**: Optimized binary format |
|
|
- **Scalability**: Easy to distribute and load individual models |
|
|
|
|
|
### Mini-Model System |
|
|
|
|
|
The training system creates specialized mini-models that: |
|
|
- **Focus on specific patterns**: Each model specializes in particular conversation types |
|
|
- **Enable parallel processing**: Models can be loaded and processed independently |
|
|
- **Support incremental learning**: New models can be added without retraining existing ones |
|
|
- **Provide confidence scoring**: Each model reports its confidence for given inputs |
|
|
|
|
|
### Pattern Extraction |
|
|
|
|
|
Advanced pattern recognition includes: |
|
|
- **Keyword extraction**: Identifies key terms and phrases |
|
|
- **Pattern similarity**: Calculates semantic similarity between inputs |
|
|
- **Context preservation**: Maintains conversation context across turns |
|
|
- **Grammar rule application**: Applies linguistic rules for better responses |
|
|
|
|
|
## ๐ Training Data Format |
|
|
|
|
|
Training data should follow this format: |
|
|
|
|
|
``` |
|
|
user: Hello, how are you? |
|
|
<pad> |
|
|
ai: I'm doing well, thank you! How can I help you today? |
|
|
<eos> |
|
|
|
|
|
user: What's the weather like? |
|
|
<pad> |
|
|
ai: I don't have access to real-time weather data, but I'd be happy to help you find weather information from a reliable source. |
|
|
<eos> |
|
|
``` |
|
|
|
|
|
- `user:` - Marks user input |
|
|
- `<pad>` - Padding token (optional) |
|
|
- `ai:` - Marks AI response |
|
|
- `<eos>` - End of sequence marker |
|
|
|
|
|
## โ๏ธ Configuration |
|
|
|
|
|
### Training Parameters |
|
|
|
|
|
Key parameters in `train.py`: |
|
|
- `target_size_mb`: Target size for training chunks (default: 5MB) |
|
|
- `chunk_size`: Number of training pairs per chunk |
|
|
- `merge_similar`: Enable automatic model merging (default: True) |
|
|
- `confidence_threshold`: Minimum confidence for pattern matching |
|
|
|
|
|
### Model Parameters |
|
|
|
|
|
Adjustable in the `MiniModelTrainer` class: |
|
|
- `confidence_threshold`: Pattern confidence threshold |
|
|
- `merge_threshold`: Similarity threshold for model merging |
|
|
- `max_context_length`: Maximum conversation context window |
|
|
|
|
|
## ๐ง API Reference |
|
|
|
|
|
### FeatherManager |
|
|
|
|
|
Core model management class: |
|
|
|
|
|
```python |
|
|
manager = FeatherManager("models/") |
|
|
manager.save_mini_model(model_data, model_id) |
|
|
model = manager.load_mini_model(model_id) |
|
|
all_models = manager.load_all_models() |
|
|
``` |
|
|
|
|
|
### AgGPTTrainer |
|
|
|
|
|
Main training interface: |
|
|
|
|
|
```python |
|
|
trainer = AgGPTTrainer() |
|
|
trainer.train_multiple_corpora(["data1.txt", "data2.txt"]) |
|
|
trainer.train("single_corpus.txt") |
|
|
``` |
|
|
|
|
|
### ResponseGenerator |
|
|
|
|
|
Chat interface: |
|
|
|
|
|
```python |
|
|
generator = ResponseGenerator(feather_manager) |
|
|
generator.load_models() |
|
|
response = generator.generate_response("Hello!") |
|
|
``` |
|
|
|
|
|
## ๐จ Customization |
|
|
|
|
|
### Adding New Training Data |
|
|
|
|
|
1. Format your data according to the specification above |
|
|
2. Place files in `training_data/` directory |
|
|
3. Add filenames to the training list in `main()` function |
|
|
4. Run training: `python train.py` |
|
|
|
|
|
### Extending Pattern Recognition |
|
|
|
|
|
Modify the `PatternExtractor` class to add: |
|
|
- Custom keyword extraction algorithms |
|
|
- Advanced similarity metrics |
|
|
- Domain-specific pattern matching |
|
|
- Multi-language support |
|
|
|
|
|
### Custom Response Generation |
|
|
|
|
|
Extend the `ResponseGenerator` class for: |
|
|
- Custom response ranking algorithms |
|
|
- Integration with external APIs |
|
|
- Multi-modal response generation |
|
|
- Specialized conversation flows |
|
|
|
|
|
## ๐ Performance |
|
|
|
|
|
### Benchmarks |
|
|
|
|
|
- **Training Speed**: ~100K conversations/minute |
|
|
- **Model Loading**: <1 second for 100+ mini-models |
|
|
- **Response Time**: <50ms average latency |
|
|
- **Memory Usage**: ~10MB per 1000 training examples |
|
|
|
|
|
### Optimization Tips |
|
|
|
|
|
1. **Chunk Size**: Adjust based on available memory |
|
|
2. **Model Merging**: Enable for storage efficiency |
|
|
3. **Pattern Complexity**: Balance specificity vs. generalization |
|
|
4. **Context Window**: Optimize for conversation quality vs. speed |
|
|
|
|
|
## ๐ค Contributing |
|
|
|
|
|
We welcome contributions! Please: |
|
|
|
|
|
1. Fork the repository |
|
|
2. Create a feature branch |
|
|
3. Add tests for new functionality |
|
|
4. Submit a pull request |
|
|
|
|
|
Areas for contribution: |
|
|
- Multi-language support |
|
|
- Advanced pattern recognition |
|
|
- Performance optimizations |
|
|
- Documentation improvements |
|
|
|
|
|
## ๐ Troubleshooting |
|
|
|
|
|
### Common Issues |
|
|
|
|
|
**Training hangs or crashes:** |
|
|
- Check available memory |
|
|
- Reduce chunk size |
|
|
- Verify training data format |
|
|
|
|
|
**Poor response quality:** |
|
|
- Increase training data size |
|
|
- Adjust confidence thresholds |
|
|
- Enable model merging |
|
|
|
|
|
**Slow performance:** |
|
|
- Update to latest Feather/Arrow versions |
|
|
- Check disk I/O performance |
|
|
- Optimize pattern extraction |
|
|
|
|
|
## ๐ Changelog |
|
|
|
|
|
### v1.0.0 (Current) |
|
|
- Initial release with Feather architecture |
|
|
- Multi-corpora training support |
|
|
- Interactive chat interface |
|
|
- YAML model export |
|
|
- Automatic model merging |
|
|
|
|
|
## ๐ฎ Roadmap |
|
|
|
|
|
- [ ] Multi-language support |
|
|
- [ ] GPU acceleration |
|
|
- [ ] Distributed training |
|
|
- [ ] Web interface |
|
|
- [ ] Model compression techniques |
|
|
- [ ] Integration with popular ML frameworks |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
This project is licensed under the MIT License โ see the [LICENSE](LICENSE) file for details. |
|
|
|
|
|
## ๐จโ๐ป Author |
|
|
|
|
|
**AG** - *Creator and Lead Developer* |
|
|
|
|
|
For questions, suggestions, or collaboration opportunities, please open an issue or contact the development team. |
|
|
|
|
|
--- |
|
|
|
|
|
*"Relentless. Scalable. True Intelligence."* - AgGPT-18 |