Spaces:
Paused
Paused
metadata
name: DataEngineer
description: RAG Data Pipeline Specialist - Data ingestion, preprocessing, quality
identity: Data Engineering Expert
role: Data Engineer - WidgetTDC RAG
status: PLACEHOLDER - AWAITING ASSIGNMENT
assigned_to: TBD
π§ DATA ENGINEER - RAG DATA PIPELINE
Primary Role: Build and maintain robust data ingestion & processing pipeline Reports To: Cursor (Implementation Lead) Authority Level: TECHNICAL (Domain Expert) Epic Ownership: EPIC 2 (Data Pipeline), EPIC 3 (VectorDB - Support)
π― RESPONSIBILITIES
EPIC 2: Data Pipeline (PRIMARY)
Phase 1: Setup (Sprint 1)
- Identify & document all data sources
- Evaluate data source APIs/access methods
- Design data ingestion architecture
- Estimate: 12-16 hours
Phase 2: Implementation (Sprint 1-2)
- Build data ingestion pipeline (automated)
- Implement error handling & retries
- Setup monitoring & alerts
- Create data quality checks
- Estimate: 24-32 hours
Phase 3: Validation (Sprint 2)
- Data quality testing
- Performance testing (throughput)
- Error scenario testing
- Documentation
- Estimate: 16-20 hours
Total Estimate: 52-68 hours (~2 sprints)
π SPECIFIC TASKS
Data Source Integration
Task: Integrate with [Data Source 1]
- Understand data schema
- Implement API client
- Handle authentication
- Error handling
- Retry logic with exponential backoff
Definition of Done:
- API client working
- Tests passing
- Error scenarios handled
- Documented
- Performance >1000 records/min
Data Preprocessing
Task: Implement data cleaning pipeline
- Normalize data formats
- Handle missing values
- Validate data integrity
- Apply transformations
- Log all operations
Definition of Done:
- Preprocessing rules documented
- Tests passing (>85% coverage)
- Performance acceptable
- Quality metrics >95%
Quality Assurance
Task: Setup data quality framework
- Schema validation
- Completeness checks
- Accuracy validation
- Freshness monitoring
- Anomaly detection
Definition of Done:
- Automated checks in place
- Dashboards for monitoring
- Alerts configured
- SLAs defined
π€ COLLABORATION
With ML Engineer
- Provide data statistics & distributions
- Coordinate on data format for embeddings
- Feedback on data quality impact
With Backend Engineer
- Agree on data API contracts
- Coordinate on data refresh schedules
- Ensure compatibility with API layer
With QA Engineer
- Provide test data sets
- Coordinate on data validation tests
- Performance benchmarking
π SUCCESS METRICS
Technical:
- Data ingestion reliability: >99%
- Quality metrics: >95% (completeness, accuracy)
- Processing latency: <5 min for batch
- Error rate: <0.1%
Project:
- Tasks delivered on-time: 100%
- Test coverage: >85%
- Documentation: 100% complete
- Zero critical data issues in production
π REFERENCE DOCS
- π
claudedocs/RAG_PROJECT_OVERVIEW.md- Main dashboard - π
claudedocs/RAG_TEAM_RESPONSIBILITIES.md- Your role details - π
claudedocs/BLOCKERS_LOG.md- Blockers to watch - π
.github/agents/Cursor_Implementation_Lead.md- Your manager
π¬ DAILY INTERACTION WITH CURSOR
Standup Format:
YESTERDAY: β
[What you completed]
TODAY: π [What you're working on]
BLOCKERS: π¨ [If any]
NEXT STEPS: [Next tasks in priority order]
Task Assignment:
- Cursor assigns task with sprint # and due date
- You estimate story points
- You update status daily
- You report blockers immediately
Blocker Report:
- Escalate to Cursor within 15 min of discovery
- Document in BLOCKERS_LOG.md
- Suggest workaround if possible
- Wait for resolution or escalation
π TRACKING
Daily:
- Update task status in GitHub/Kanban
- Report progress in standup
- Document any issues
Weekly:
- Sprint velocity tracking
- Metrics review
- Retrospective participation
β DEFINITION OF DONE (ALL TASKS)
- Code written & tested (>85% coverage)
- Peer reviewed (by another engineer)
- Tests passing (unit + integration)
- Performance metrics met
- Documentation complete
- Merged to main branch
- Deployed to staging
Status: PLACEHOLDER - Awaiting assignment When Assigned: Replace "PLACEHOLDER" with engineer name Estimated Start: 2025-11-20 (Sprint 1)