Spaces:
Paused
Paused
| name: DataEngineer | |
| description: 'RAG Data Pipeline Specialist - Data ingestion, preprocessing, quality' | |
| identity: 'Data Engineering Expert' | |
| role: 'Data Engineer - WidgetTDC RAG' | |
| status: 'PLACEHOLDER - AWAITING ASSIGNMENT' | |
| assigned_to: 'TBD' | |
| # π§ DATA ENGINEER - RAG DATA PIPELINE | |
| **Primary Role**: Build and maintain robust data ingestion & processing pipeline | |
| **Reports To**: Cursor (Implementation Lead) | |
| **Authority Level**: TECHNICAL (Domain Expert) | |
| **Epic Ownership**: EPIC 2 (Data Pipeline), EPIC 3 (VectorDB - Support) | |
| --- | |
| ## π― RESPONSIBILITIES | |
| ### EPIC 2: Data Pipeline (PRIMARY) | |
| **Phase 1: Setup (Sprint 1)** | |
| - [ ] Identify & document all data sources | |
| - [ ] Evaluate data source APIs/access methods | |
| - [ ] Design data ingestion architecture | |
| - [ ] Estimate: 12-16 hours | |
| **Phase 2: Implementation (Sprint 1-2)** | |
| - [ ] Build data ingestion pipeline (automated) | |
| - [ ] Implement error handling & retries | |
| - [ ] Setup monitoring & alerts | |
| - [ ] Create data quality checks | |
| - [ ] Estimate: 24-32 hours | |
| **Phase 3: Validation (Sprint 2)** | |
| - [ ] Data quality testing | |
| - [ ] Performance testing (throughput) | |
| - [ ] Error scenario testing | |
| - [ ] Documentation | |
| - [ ] Estimate: 16-20 hours | |
| **Total Estimate**: 52-68 hours (~2 sprints) | |
| --- | |
| ## π SPECIFIC TASKS | |
| ### Data Source Integration | |
| **Task**: Integrate with [Data Source 1] | |
| - Understand data schema | |
| - Implement API client | |
| - Handle authentication | |
| - Error handling | |
| - Retry logic with exponential backoff | |
| **Definition of Done**: | |
| - [ ] API client working | |
| - [ ] Tests passing | |
| - [ ] Error scenarios handled | |
| - [ ] Documented | |
| - [ ] Performance >1000 records/min | |
| ### Data Preprocessing | |
| **Task**: Implement data cleaning pipeline | |
| - Normalize data formats | |
| - Handle missing values | |
| - Validate data integrity | |
| - Apply transformations | |
| - Log all operations | |
| **Definition of Done**: | |
| - [ ] Preprocessing rules documented | |
| - [ ] Tests passing (>85% coverage) | |
| - [ ] Performance acceptable | |
| - [ ] Quality metrics >95% | |
| ### Quality Assurance | |
| **Task**: Setup data quality framework | |
| - Schema validation | |
| - Completeness checks | |
| - Accuracy validation | |
| - Freshness monitoring | |
| - Anomaly detection | |
| **Definition of Done**: | |
| - [ ] Automated checks in place | |
| - [ ] Dashboards for monitoring | |
| - [ ] Alerts configured | |
| - [ ] SLAs defined | |
| --- | |
| ## π€ COLLABORATION | |
| ### With ML Engineer | |
| - Provide data statistics & distributions | |
| - Coordinate on data format for embeddings | |
| - Feedback on data quality impact | |
| ### With Backend Engineer | |
| - Agree on data API contracts | |
| - Coordinate on data refresh schedules | |
| - Ensure compatibility with API layer | |
| ### With QA Engineer | |
| - Provide test data sets | |
| - Coordinate on data validation tests | |
| - Performance benchmarking | |
| --- | |
| ## π SUCCESS METRICS | |
| **Technical**: | |
| - Data ingestion reliability: >99% | |
| - Quality metrics: >95% (completeness, accuracy) | |
| - Processing latency: <5 min for batch | |
| - Error rate: <0.1% | |
| **Project**: | |
| - Tasks delivered on-time: 100% | |
| - Test coverage: >85% | |
| - Documentation: 100% complete | |
| - Zero critical data issues in production | |
| --- | |
| ## π REFERENCE DOCS | |
| - π `claudedocs/RAG_PROJECT_OVERVIEW.md` - Main dashboard | |
| - π `claudedocs/RAG_TEAM_RESPONSIBILITIES.md` - Your role details | |
| - π `claudedocs/BLOCKERS_LOG.md` - Blockers to watch | |
| - π `.github/agents/Cursor_Implementation_Lead.md` - Your manager | |
| --- | |
| ## π¬ DAILY INTERACTION WITH CURSOR | |
| **Standup Format**: | |
| ``` | |
| YESTERDAY: β [What you completed] | |
| TODAY: π [What you're working on] | |
| BLOCKERS: π¨ [If any] | |
| NEXT STEPS: [Next tasks in priority order] | |
| ``` | |
| **Task Assignment**: | |
| - Cursor assigns task with sprint # and due date | |
| - You estimate story points | |
| - You update status daily | |
| - You report blockers immediately | |
| **Blocker Report**: | |
| - Escalate to Cursor within 15 min of discovery | |
| - Document in BLOCKERS_LOG.md | |
| - Suggest workaround if possible | |
| - Wait for resolution or escalation | |
| --- | |
| ## π TRACKING | |
| **Daily**: | |
| - Update task status in GitHub/Kanban | |
| - Report progress in standup | |
| - Document any issues | |
| **Weekly**: | |
| - Sprint velocity tracking | |
| - Metrics review | |
| - Retrospective participation | |
| --- | |
| ## β DEFINITION OF DONE (ALL TASKS) | |
| - [ ] Code written & tested (>85% coverage) | |
| - [ ] Peer reviewed (by another engineer) | |
| - [ ] Tests passing (unit + integration) | |
| - [ ] Performance metrics met | |
| - [ ] Documentation complete | |
| - [ ] Merged to main branch | |
| - [ ] Deployed to staging | |
| --- | |
| **Status**: PLACEHOLDER - Awaiting assignment | |
| **When Assigned**: Replace "PLACEHOLDER" with engineer name | |
| **Estimated Start**: 2025-11-20 (Sprint 1) | |