- ai-agents
- data-engineering
- automation
- productivity
- multi-agent-systems
From Firefighting to Building: How AI Agents Restored Team Productivity¶
Original URL: https://engineering.grab.com/from-firefighting-to-building
Introduction¶
Grab's Analytics Data Warehouse (ADW) team was spending 40% of their time (approximately 2 days per week) on repetitive tasks like answering data definition questions, tracing data sources, running quality checks, and handling basic enhancement requests. This created a significant bottleneck in operations. To address this, the team implemented a multi-agent AI system that autonomously handles simpler questions and collaboratively addresses complex requests, reclaiming significant engineering bandwidth.
The Problem¶
The team was overwhelmed by "quick questions" that required: - Answering the same questions about data definitions - Tracing data sources and troubleshooting - Running quality checks to verify data integrity - Handling basic enhancement requests
These questions consumed nearly half of the team's bandwidth, diverting high-value engineering time from complex challenges to repetitive investigative work.
Solution Overview¶
The team deployed a multi-agent AI system using: - FastAPI & LangGraph: For handling requests and managing complex state and cyclical logic - Redis & PostgreSQL: For caching, real-time sessions, and storing conversation history - Specialized platforms: Hubble (metadata management), Genchi (data quality observability), and Lighthouse (pipeline health monitoring)
Multi-Agent Architecture¶
The system routes requests through two main pathways:
Enhancement Pathway¶
For requests like adding new columns or changing aggregation logic: 1. User creates a JIRA request 2. Enhancement Agent analyzes requirements and gathers context 3. Agent creates a merge request with suggested code 4. Engineer reviews the MR 5. If valid, agent runs changes in test environment 6. Engineer reviews results and merges if tests pass
Investigation Pathway¶
For questions about data anomalies: 1. Classifier: Parses questions, detects violations, and determines needed agents 2. Data Agent: Investigates data, validates schemas, and retrieves samples 3. Code Search Agent: Traces column transformations through the codebase 4. On-call Agent: Monitors production systems and checks for incidents 5. Summarizer: Combines responses into a coherent narrative
Challenges and Solutions¶
Context Management¶
- Challenge: Excessive context accumulation causing performance degradation
- Solution: Implemented intelligent context pruning using token tracking and RAG techniques
Tool Usage Optimization¶
- Challenge: Excessive tool usage degrading efficiency
- Solution: Streamlined tool descriptions and outputs to focus only on relevant information
Risk Mitigation¶
- Challenge: AI agents with database access pose security risks
- Solution: Implemented multiple safety layers:
- Input classification for PII and out-of-scope queries
- SQL validation before execution
- Timeout protection for queries
- Controls for code generation
Trust and Quality¶
- Challenge: Ensuring reliable responses while maintaining speed
- Solution: Implemented human-in-the-loop review with options to approve, reject, refine, re-route, or annotate responses
Results¶
The multi-agent system yielded transformative results: - Automated resolution of the majority of standard user inquiries - Order-of-magnitude reduction in issue resolution time - Reclaimed several full-time equivalents of engineering bandwidth - Shifted team focus from reactive support to proactive, high-value work
Key Takeaways¶
- Multi-Agent Architecture: Specialized agents outperform generalists by mastering specific domains
- Strategic Human Oversight: Building trust through transparency and continuous improvement
- Focus on Augmentation: Automating repetitive tasks while humans handle complex decisions
This case study demonstrates how AI agents can transform overwhelmed data engineering teams into focused, productive units that deliver higher value to their organizations.