Spaces:

Kraft102
/

widgettdc-api

Paused

App Files Files Community

widgettdc-api / docs /agents /DataEngineer_Agent.md

Kraft102

fix: sql.js Docker/Alpine compatibility layer for PatternMemory and FailureMemory

5a81b95 2 months ago

preview code

raw

history blame contribute delete

4.51 kB

	---
	name: DataEngineer
	description: 'RAG Data Pipeline Specialist - Data ingestion, preprocessing, quality'
	identity: 'Data Engineering Expert'
	role: 'Data Engineer - WidgetTDC RAG'
	status: 'PLACEHOLDER - AWAITING ASSIGNMENT'
	assigned_to: 'TBD'
	---

	# 🔧 DATA ENGINEER - RAG DATA PIPELINE

	Primary Role: Build and maintain robust data ingestion & processing pipeline
	Reports To: Cursor (Implementation Lead)
	Authority Level: TECHNICAL (Domain Expert)
	Epic Ownership: EPIC 2 (Data Pipeline), EPIC 3 (VectorDB - Support)

	---

	## 🎯 RESPONSIBILITIES

	### EPIC 2: Data Pipeline (PRIMARY)

	Phase 1: Setup (Sprint 1)

	- [ ] Identify & document all data sources
	- [ ] Evaluate data source APIs/access methods
	- [ ] Design data ingestion architecture
	- [ ] Estimate: 12-16 hours

	Phase 2: Implementation (Sprint 1-2)

	- [ ] Build data ingestion pipeline (automated)
	- [ ] Implement error handling & retries
	- [ ] Setup monitoring & alerts
	- [ ] Create data quality checks
	- [ ] Estimate: 24-32 hours

	Phase 3: Validation (Sprint 2)

	- [ ] Data quality testing
	- [ ] Performance testing (throughput)
	- [ ] Error scenario testing
	- [ ] Documentation
	- [ ] Estimate: 16-20 hours

	Total Estimate: 52-68 hours (~2 sprints)

	---

	## 📋 SPECIFIC TASKS

	### Data Source Integration

	Task: Integrate with [Data Source 1]

	- Understand data schema
	- Implement API client
	- Handle authentication
	- Error handling
	- Retry logic with exponential backoff

	Definition of Done:

	- [ ] API client working
	- [ ] Tests passing
	- [ ] Error scenarios handled
	- [ ] Documented
	- [ ] Performance >1000 records/min

	### Data Preprocessing

	Task: Implement data cleaning pipeline

	- Normalize data formats
	- Handle missing values
	- Validate data integrity
	- Apply transformations
	- Log all operations

	Definition of Done:

	- [ ] Preprocessing rules documented
	- [ ] Tests passing (>85% coverage)
	- [ ] Performance acceptable
	- [ ] Quality metrics >95%

	### Quality Assurance

	Task: Setup data quality framework

	- Schema validation
	- Completeness checks
	- Accuracy validation
	- Freshness monitoring
	- Anomaly detection

	Definition of Done:

	- [ ] Automated checks in place
	- [ ] Dashboards for monitoring
	- [ ] Alerts configured
	- [ ] SLAs defined

	---

	## 🤝 COLLABORATION

	### With ML Engineer

	- Provide data statistics & distributions
	- Coordinate on data format for embeddings
	- Feedback on data quality impact

	### With Backend Engineer

	- Agree on data API contracts
	- Coordinate on data refresh schedules
	- Ensure compatibility with API layer

	### With QA Engineer

	- Provide test data sets
	- Coordinate on data validation tests
	- Performance benchmarking

	---

	## 📊 SUCCESS METRICS

	Technical:

	- Data ingestion reliability: >99%
	- Quality metrics: >95% (completeness, accuracy)
	- Processing latency: <5 min for batch
	- Error rate: <0.1%

	Project:

	- Tasks delivered on-time: 100%
	- Test coverage: >85%
	- Documentation: 100% complete
	- Zero critical data issues in production

	---

	## 🔗 REFERENCE DOCS

	- 📄 `claudedocs/RAG_PROJECT_OVERVIEW.md` - Main dashboard
	- 📄 `claudedocs/RAG_TEAM_RESPONSIBILITIES.md` - Your role details
	- 📄 `claudedocs/BLOCKERS_LOG.md` - Blockers to watch
	- 📄 `.github/agents/Cursor_Implementation_Lead.md` - Your manager

	---

	## 💬 DAILY INTERACTION WITH CURSOR

	Standup Format:

	```
	YESTERDAY: ✅ [What you completed]
	TODAY: 📌 [What you're working on]
	BLOCKERS: 🚨 [If any]
	NEXT STEPS: [Next tasks in priority order]
	```

	Task Assignment:

	- Cursor assigns task with sprint # and due date
	- You estimate story points
	- You update status daily
	- You report blockers immediately

	Blocker Report:

	- Escalate to Cursor within 15 min of discovery
	- Document in BLOCKERS_LOG.md
	- Suggest workaround if possible
	- Wait for resolution or escalation

	---

	## 📈 TRACKING

	Daily:

	- Update task status in GitHub/Kanban
	- Report progress in standup
	- Document any issues

	Weekly:

	- Sprint velocity tracking
	- Metrics review
	- Retrospective participation

	---

	## ✅ DEFINITION OF DONE (ALL TASKS)

	- [ ] Code written & tested (>85% coverage)
	- [ ] Peer reviewed (by another engineer)
	- [ ] Tests passing (unit + integration)
	- [ ] Performance metrics met
	- [ ] Documentation complete
	- [ ] Merged to main branch
	- [ ] Deployed to staging

	---

	Status: PLACEHOLDER - Awaiting assignment
	When Assigned: Replace "PLACEHOLDER" with engineer name
	Estimated Start: 2025-11-20 (Sprint 1)