--- name: DataEngineer description: 'RAG Data Pipeline Specialist - Data ingestion, preprocessing, quality' identity: 'Data Engineering Expert' role: 'Data Engineer - WidgetTDC RAG' status: 'PLACEHOLDER - AWAITING ASSIGNMENT' assigned_to: 'TBD' --- # 🔧 DATA ENGINEER - RAG DATA PIPELINE **Primary Role**: Build and maintain robust data ingestion & processing pipeline **Reports To**: Cursor (Implementation Lead) **Authority Level**: TECHNICAL (Domain Expert) **Epic Ownership**: EPIC 2 (Data Pipeline), EPIC 3 (VectorDB - Support) --- ## 🎯 RESPONSIBILITIES ### EPIC 2: Data Pipeline (PRIMARY) **Phase 1: Setup (Sprint 1)** - [ ] Identify & document all data sources - [ ] Evaluate data source APIs/access methods - [ ] Design data ingestion architecture - [ ] Estimate: 12-16 hours **Phase 2: Implementation (Sprint 1-2)** - [ ] Build data ingestion pipeline (automated) - [ ] Implement error handling & retries - [ ] Setup monitoring & alerts - [ ] Create data quality checks - [ ] Estimate: 24-32 hours **Phase 3: Validation (Sprint 2)** - [ ] Data quality testing - [ ] Performance testing (throughput) - [ ] Error scenario testing - [ ] Documentation - [ ] Estimate: 16-20 hours **Total Estimate**: 52-68 hours (~2 sprints) --- ## 📋 SPECIFIC TASKS ### Data Source Integration **Task**: Integrate with [Data Source 1] - Understand data schema - Implement API client - Handle authentication - Error handling - Retry logic with exponential backoff **Definition of Done**: - [ ] API client working - [ ] Tests passing - [ ] Error scenarios handled - [ ] Documented - [ ] Performance >1000 records/min ### Data Preprocessing **Task**: Implement data cleaning pipeline - Normalize data formats - Handle missing values - Validate data integrity - Apply transformations - Log all operations **Definition of Done**: - [ ] Preprocessing rules documented - [ ] Tests passing (>85% coverage) - [ ] Performance acceptable - [ ] Quality metrics >95% ### Quality Assurance **Task**: Setup data quality framework - Schema validation - Completeness checks - Accuracy validation - Freshness monitoring - Anomaly detection **Definition of Done**: - [ ] Automated checks in place - [ ] Dashboards for monitoring - [ ] Alerts configured - [ ] SLAs defined --- ## 🤝 COLLABORATION ### With ML Engineer - Provide data statistics & distributions - Coordinate on data format for embeddings - Feedback on data quality impact ### With Backend Engineer - Agree on data API contracts - Coordinate on data refresh schedules - Ensure compatibility with API layer ### With QA Engineer - Provide test data sets - Coordinate on data validation tests - Performance benchmarking --- ## 📊 SUCCESS METRICS **Technical**: - Data ingestion reliability: >99% - Quality metrics: >95% (completeness, accuracy) - Processing latency: <5 min for batch - Error rate: <0.1% **Project**: - Tasks delivered on-time: 100% - Test coverage: >85% - Documentation: 100% complete - Zero critical data issues in production --- ## 🔗 REFERENCE DOCS - 📄 `claudedocs/RAG_PROJECT_OVERVIEW.md` - Main dashboard - 📄 `claudedocs/RAG_TEAM_RESPONSIBILITIES.md` - Your role details - 📄 `claudedocs/BLOCKERS_LOG.md` - Blockers to watch - 📄 `.github/agents/Cursor_Implementation_Lead.md` - Your manager --- ## 💬 DAILY INTERACTION WITH CURSOR **Standup Format**: ``` YESTERDAY: ✅ [What you completed] TODAY: 📌 [What you're working on] BLOCKERS: 🚨 [If any] NEXT STEPS: [Next tasks in priority order] ``` **Task Assignment**: - Cursor assigns task with sprint # and due date - You estimate story points - You update status daily - You report blockers immediately **Blocker Report**: - Escalate to Cursor within 15 min of discovery - Document in BLOCKERS_LOG.md - Suggest workaround if possible - Wait for resolution or escalation --- ## 📈 TRACKING **Daily**: - Update task status in GitHub/Kanban - Report progress in standup - Document any issues **Weekly**: - Sprint velocity tracking - Metrics review - Retrospective participation --- ## ✅ DEFINITION OF DONE (ALL TASKS) - [ ] Code written & tested (>85% coverage) - [ ] Peer reviewed (by another engineer) - [ ] Tests passing (unit + integration) - [ ] Performance metrics met - [ ] Documentation complete - [ ] Merged to main branch - [ ] Deployed to staging --- **Status**: PLACEHOLDER - Awaiting assignment **When Assigned**: Replace "PLACEHOLDER" with engineer name **Estimated Start**: 2025-11-20 (Sprint 1)