ai-agents
data-engineering
automation
productivity
multi-agent-systems

From Firefighting to Building: How AI Agents Restored Team Productivity¶

Original URL: https://engineering.grab.com/from-firefighting-to-building

Introduction¶

Grab's Analytics Data Warehouse (ADW) team was spending 40% of their time (approximately 2 days per week) on repetitive tasks like answering data definition questions, tracing data sources, running quality checks, and handling basic enhancement requests. This created a significant bottleneck in operations. To address this, the team implemented a multi-agent AI system that autonomously handles simpler questions and collaboratively addresses complex requests, reclaiming significant engineering bandwidth.

The Problem¶

The team was overwhelmed by "quick questions" that required: - Answering the same questions about data definitions - Tracing data sources and troubleshooting - Running quality checks to verify data integrity - Handling basic enhancement requests

These questions consumed nearly half of the team's bandwidth, diverting high-value engineering time from complex challenges to repetitive investigative work.

Solution Overview¶

The team deployed a multi-agent AI system using: - FastAPI & LangGraph: For handling requests and managing complex state and cyclical logic - Redis & PostgreSQL: For caching, real-time sessions, and storing conversation history - Specialized platforms: Hubble (metadata management), Genchi (data quality observability), and Lighthouse (pipeline health monitoring)

Multi-Agent Architecture¶

The system routes requests through two main pathways:

Enhancement Pathway¶

For requests like adding new columns or changing aggregation logic: 1. User creates a JIRA request 2. Enhancement Agent analyzes requirements and gathers context 3. Agent creates a merge request with suggested code 4. Engineer reviews the MR 5. If valid, agent runs changes in test environment 6. Engineer reviews results and merges if tests pass

Investigation Pathway¶

For questions about data anomalies: 1. Classifier: Parses questions, detects violations, and determines needed agents 2. Data Agent: Investigates data, validates schemas, and retrieves samples 3. Code Search Agent: Traces column transformations through the codebase 4. On-call Agent: Monitors production systems and checks for incidents 5. Summarizer: Combines responses into a coherent narrative

Challenges and Solutions¶

Context Management¶

Challenge: Excessive context accumulation causing performance degradation
Solution: Implemented intelligent context pruning using token tracking and RAG techniques

Tool Usage Optimization¶

Challenge: Excessive tool usage degrading efficiency
Solution: Streamlined tool descriptions and outputs to focus only on relevant information

Risk Mitigation¶

Challenge: AI agents with database access pose security risks
Solution: Implemented multiple safety layers:
Input classification for PII and out-of-scope queries
SQL validation before execution
Timeout protection for queries
Controls for code generation

Trust and Quality¶

Challenge: Ensuring reliable responses while maintaining speed
Solution: Implemented human-in-the-loop review with options to approve, reject, refine, re-route, or annotate responses

Results¶

The multi-agent system yielded transformative results: - Automated resolution of the majority of standard user inquiries - Order-of-magnitude reduction in issue resolution time - Reclaimed several full-time equivalents of engineering bandwidth - Shifted team focus from reactive support to proactive, high-value work

Key Takeaways¶

Multi-Agent Architecture: Specialized agents outperform generalists by mastering specific domains
Strategic Human Oversight: Building trust through transparency and continuous improvement
Focus on Augmentation: Automating repetitive tasks while humans handle complex decisions

This case study demonstrates how AI agents can transform overwhelmed data engineering teams into focused, productive units that deliver higher value to their organizations.