Skip to content

Conversation

@saahil-mehta
Copy link

@saahil-mehta saahil-mehta commented Oct 21, 2025

Add Monitoring & Alerting for Agent Deployments

Overview

Pertaining #142. This PR adds monitoring and alerting infrastructure to the agent-starter-pack, giving users production-grade observability out of the box. The implementation is platform-aware, agent-aware, and fully configurable through user prompts during project creation.

What's New

User-Facing Features

1. Optional Email Alert Notifications

  • Interactive prompt during agent-starter-pack create asks for an email address
  • If provided, alerts are delivered via email + Cloud Console
  • If skipped, alerts are console-only (no email noise for dev environments)
  • User sees clear confirmation of their choice with visual feedback

2. Configurable Alert Thresholds
All thresholds are exposed as Terraform variables with sensible defaults:

  • Latency alerts: P95 threshold (default: 3000ms)
  • Error rate alerts: Error count per 5-min window (default: 10 errors)
  • Retriever latency alerts (Agentic RAG only): P99 threshold (default: 10000ms)
  • Agent error rate (Cloud Run): Errors per second (default: 0.5/sec)

Users can customise these in deployment/terraform/dev/vars/env.tfvars after project creation.

Infrastructure Added

Universal Log-Based Metrics (All Agents, All Platforms)

  1. Agent Operation Count: Tracks all agent operations with operation type labels
  2. Agent Error Count by Category: Categorised errors (LLM_FAILURE, TOOL_FAILURE, RETRIEVER_FAILURE, etc.)

Agentic RAG-Specific Metrics

  1. Retriever Latency Distribution: P50/P95/P99 retrieval performance with histogram buckets
  2. Document Count Distribution: Number of documents retrieved per call
  3. Retriever Latency Alert: Fires when P99 > threshold (default 10s)

Agent Engine (Reasoning Engine) Platform

  • Latency Alert: P95 request latency monitoring using native platform metrics
  • Error Rate Alert: Fires when log-based error count exceeds threshold in 5-min window
  • Dashboard: 5-7 chart dashboard including:
    • Request count (requests/sec)
    • Request latency (P50/P95/P99)
    • CPU allocation
    • Memory allocation
    • Agent errors by category
    • Retriever latency (Agentic RAG only)
    • Documents retrieved per call (Agentic RAG only)

Cloud Run Platform

  • Latency Alert: P95 request latency using Cloud Run native metrics
  • 5xx Error Rate Alert: Monitors 5xx response codes
  • Agent Error Alert: Log-based agent errors with rate threshold

Technical Details

Terraform Structure

  • New file: deployment/terraform/dev/monitoring.tf (757 lines)
  • New file: deployment/terraform/monitoring.tf (prod equivalent)
  • Modified: deployment/terraform/dev/variables.tf (added 4 monitoring variables)
  • Modified: deployment/terraform/dev/vars/env.tfvars (added default threshold values)
  • Modified: deployment/terraform/dev/apis.tf (added monitoring.googleapis.com)

Python CLI Integration

  • agent_starter_pack/cli/commands/create.py: Added interactive email prompt
  • agent_starter_pack/cli/utils/template.py: Thread alert_notification_email through template processing
  • tests/cli/commands/test_create.py: Updated test mocks to handle new prompt

Smart Templating

  • Uses Jinja2 conditionals to render appropriate resources based on:
    • cookiecutter.deployment_target (agent_engine vs cloud_run)
    • cookiecutter.agent_name (agentic_rag gets extra retriever metrics)
  • Notification channel only created if email provided (using Terraform count)

Metric Design Decisions

Why log-based metrics for agent telemetry?

  • Platform-agnostic: Works on both Agent Engine and Cloud Run
  • Flexible: Can extract any JSON payload attribute from structured logs
  • Extensible: Users can add custom agent metrics by logging with the right labels

Why native metrics for platform SLOs?

  • Accuracy: Platform-provided metrics are the source of truth
  • Performance: No additional overhead from log processing
  • Consistency: Aligns with Google Cloud best practices

Alert auto-close: 30 minutes

  • Prevents alert fatigue from transient issues
  • Long enough to investigate without losing context
  • Configurable via alert_strategy.auto_close if users want different behaviour

Test Coverage

All tests pass:

  • ✅ 95/95 CLI tests (including new email prompt flow)
  • ✅ Ruff linting
  • ✅ Mypy type checking
  • ✅ Import ordering fixed

Migration Notes

Existing Projects

  • This is a template change only - existing deployed agents are unaffected
  • Users can retrofit monitoring by:
    1. Copying the new monitoring.tf files
    2. Adding the monitoring variables
    3. Running terraform apply

New Projects

  • Zero additional effort required
  • Users just need to answer the email prompt during creation
  • Monitoring deploys automatically with the agent infrastructure

Example Usage

$ agent-starter-pack create my-agent

# ... after other prompts ...

Monitoring & Alerting Setup
Configure email notifications for production alerts (optional).
Email for alert notifications: [email protected]
✓ Alerts will be sent to: [email protected]

# Or skip it:
Email for alert notifications:
⚠ Email notifications disabled. Alerts will only appear in Cloud Console.

After deployment, users get:

  • Real-time dashboards in Cloud Monitoring
  • Automatic alerts when thresholds are breached
  • Structured logs for debugging with labels.service_name filtering

Related Documentation

The monitoring infrastructure automatically creates:

  • Cloud Monitoring dashboard (named "Reasoning Engine - {project_name}")
  • Alert policies with descriptive documentation
  • Notification channel (email) if configured

Users can find their dashboard at:

https://console.cloud.google.com/monitoring/dashboards

Checklist

  • Added user-facing email prompt with clear feedback
  • Created comprehensive monitoring.tf for both dev and prod
  • Added configurable threshold variables with sensible defaults
  • Platform-specific alerts (Agent Engine vs Cloud Run)
  • Agent-specific metrics (Agentic RAG retriever monitoring)
  • Updated CLI tests with new prompt flow
  • All linting and type checking passes
  • Tested with both email provided and skipped scenarios

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @saahil-mehta, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of agents deployed via the starter pack by integrating robust monitoring and alerting capabilities. It provides a foundational Terraform setup for tracking agent performance, identifying errors, and receiving timely notifications, ensuring better operational insights and reliability for deployed agents.

Highlights

  • Monitoring Infrastructure: Introduced new Terraform files (monitoring.tf) to establish comprehensive monitoring and alerting for deployed agents, including log-based metrics, alert policies, and a pre-configured Google Cloud Monitoring dashboard.
  • Service Enablement: Enabled the monitoring.googleapis.com service in apis.tf files to support the new monitoring features.
  • Configurable Alerts: Added new Terraform variables (variables.tf, vars/env.tfvars) to allow users to configure alert notification emails and customize thresholds for latency and error rates.
  • CLI Integration: Updated the create command in the CLI to interactively prompt users for an alert notification email during agent setup, streamlining the configuration process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive monitoring and alerting solution for the agent starter pack. It adds log-based metrics, alert policies for key performance indicators like latency and error rates, and a pre-configured monitoring dashboard. The changes are applied to both dev and prod/staging environments. The CLI has also been updated to allow users to configure an email address for alert notifications.

The implementation is solid, but there is a significant amount of code duplication between the Terraform configurations for the different environments. I have added a comment suggesting a refactoring into a reusable Terraform module to improve maintainability. Other than that, the changes are well-executed and a valuable addition to the project.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-structured monitoring and alerting capability for the agent starter pack. The changes are extensive, covering Terraform infrastructure for metrics, alerts, and dashboards, as well as updates to the Python CLI to support configuration during project creation. The implementation is robust, leveraging a smart combination of native platform metrics and flexible log-based metrics, with thoughtful considerations for different deployment targets and agent types. The code is of high quality, and the feature is a valuable addition. I have one suggestion to improve the robustness of the Terraform dependencies.

…etrics

Add conditional depends_on entries for agent_retriever_latency and agent_retriever_document_count metrics when agent_name is agentic_rag. This prevents potential race conditions during terraform apply where the dashboard could be created before the log-based metrics exist. The dashboard references these metrics by name in filter strings, so Terraform cannot automatically detect the dependency. Explicit depends_on ensures proper resource creation order.
@saahil-mehta
Copy link
Author

Prompt for monitoring email addition:
email-prompt

Passing tests:
make-res

@eliasecchig @allen-stephen

@allen-stephen
Copy link
Collaborator

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants