Skip to content
  • ai-agents
  • data-engineering
  • automation
  • productivity
  • multi-agent-systems

From Firefighting to Building: How AI Agents Restored Team Productivity

Original URL: https://engineering.grab.com/from-firefighting-to-building

Introduction

Grab's Analytics Data Warehouse (ADW) team was spending 40% of their time (approximately 2 days per week) on repetitive tasks like answering data definition questions, tracing data sources, running quality checks, and handling basic enhancement requests. This created a significant bottleneck in operations. To address this, the team implemented a multi-agent AI system that autonomously handles simpler questions and collaboratively addresses complex requests, reclaiming significant engineering bandwidth.

The Problem

The team was overwhelmed by "quick questions" that required: - Answering the same questions about data definitions - Tracing data sources and troubleshooting - Running quality checks to verify data integrity - Handling basic enhancement requests

These questions consumed nearly half of the team's bandwidth, diverting high-value engineering time from complex challenges to repetitive investigative work.

Solution Overview

The team deployed a multi-agent AI system using: - FastAPI & LangGraph: For handling requests and managing complex state and cyclical logic - Redis & PostgreSQL: For caching, real-time sessions, and storing conversation history - Specialized platforms: Hubble (metadata management), Genchi (data quality observability), and Lighthouse (pipeline health monitoring)

Multi-Agent Architecture

The system routes requests through two main pathways:

Enhancement Pathway

For requests like adding new columns or changing aggregation logic: 1. User creates a JIRA request 2. Enhancement Agent analyzes requirements and gathers context 3. Agent creates a merge request with suggested code 4. Engineer reviews the MR 5. If valid, agent runs changes in test environment 6. Engineer reviews results and merges if tests pass

Investigation Pathway

For questions about data anomalies: 1. Classifier: Parses questions, detects violations, and determines needed agents 2. Data Agent: Investigates data, validates schemas, and retrieves samples 3. Code Search Agent: Traces column transformations through the codebase 4. On-call Agent: Monitors production systems and checks for incidents 5. Summarizer: Combines responses into a coherent narrative

Challenges and Solutions

Context Management

  • Challenge: Excessive context accumulation causing performance degradation
  • Solution: Implemented intelligent context pruning using token tracking and RAG techniques

Tool Usage Optimization

  • Challenge: Excessive tool usage degrading efficiency
  • Solution: Streamlined tool descriptions and outputs to focus only on relevant information

Risk Mitigation

  • Challenge: AI agents with database access pose security risks
  • Solution: Implemented multiple safety layers:
  • Input classification for PII and out-of-scope queries
  • SQL validation before execution
  • Timeout protection for queries
  • Controls for code generation

Trust and Quality

  • Challenge: Ensuring reliable responses while maintaining speed
  • Solution: Implemented human-in-the-loop review with options to approve, reject, refine, re-route, or annotate responses

Results

The multi-agent system yielded transformative results: - Automated resolution of the majority of standard user inquiries - Order-of-magnitude reduction in issue resolution time - Reclaimed several full-time equivalents of engineering bandwidth - Shifted team focus from reactive support to proactive, high-value work

Key Takeaways

  1. Multi-Agent Architecture: Specialized agents outperform generalists by mastering specific domains
  2. Strategic Human Oversight: Building trust through transparency and continuous improvement
  3. Focus on Augmentation: Automating repetitive tasks while humans handle complex decisions

This case study demonstrates how AI agents can transform overwhelmed data engineering teams into focused, productive units that deliver higher value to their organizations.