The AI SRE Blueprint: Securely Automating Incident Response on Azure

May 9, 2026 min read

It’s 2:47am. Your AKS node pool has exhausted its memory. Azure Monitor fires an alert. Your phone screams. You fumble for your laptop, SSH into the cluster, run kubectl top pods, identify the offending deployment, and scale it down. Eleven minutes of groggy, reactive work—for a problem the system could have diagnosed and fixed in 90 seconds.

The agent to do that job isn’t hard to build. Building it so that a compromised agent, a hallucinated action, or a reasoning error cannot take down the rest of production—that’s the actual engineering challenge. In 2026, we’re moving from chatbots to Agentic SREs—non-human identities capable of executing Azure Resource Manager calls and KQL queries at 3am without waking anyone. Security for these agents isn’t a footnote. It’s the primary design constraint.

This article walks through a “Zero Trust for AI” (ZT4AI) framework using Azure Durable Functions, Privileged Identity Management (PIM), and Teams Adaptive Cards—an architecture where the agent has just enough access to understand what went wrong, and needs your explicit sign-off to fix it.

AI SRE Orchestration Workflow

A(1zA.ulreAerlteMroFtniirWteeodbr)hookA(2(34678ID.L.....uoSrRgSAPEWRaeewIxrEbAQnaMeilcudi(ctOetetPEuerrAAltcFripRUeeIhueepeSvmenasrsEaFmscs,opDtiuttovo)ixtrinManoaaoielsnbtnntelo)grCeriaLcrLosdoo)gp5T(.eHauCmmLsaInC(-KOinn[--ACtPahPleRl-O)LVoEo]p)

1. The AI SRE Dual Identity Model

The most common security failure in agentic DevOps is granting an AI agent persistent Contributor access. That makes the agent’s Managed Identity the highest-value target in your subscription. Instead, use a Separation of Concerns model with two distinct identities.

Observer vs. Remediator

IdentityPersistenceRBAC RolesPrimary Function
ObserverAlways-OnMonitoring Reader, Log Analytics ReaderDiagnostic investigation, reading metrics/logs.
RemediatorJIT OnlyAKS Cluster Admin, ContributorExecuting approved changes (scaling, restarts).

The agent runtime always operates as the Observer. It has enough permission to understand why an incident is happening but no permission to change anything. The Remediator identity only activates via Privileged Identity Management (PIM) for a 15-minute window after a human has approved a specific proposed action.

The defaults are designed for a demo. You’re not running a demo.

Prerequisite: The remediator identity must have an Eligible assignment for the required roles at the target scope. The agent cannot “grant” itself a role; it can only “activate” an existing eligibility.

2. Stateful Orchestration with Durable Functions

Incident response is inherently stateful. An agent must investigate, propose, wait for a human, and then execute. Azure Durable Functions (running in a VNet-integrated Function App) is the ideal runtime for this “Human Interaction” pattern.

The ReAct reasoning Loop

The agent uses a ReAct (Reasoning + Acting) pattern. Given an alert, it iterates through a cycle: Thought -> Tool Call (e.g., get_top_pods_by_memory) -> Observation -> Repeat.

# Durable Functions Orchestrator: The Reasoning Loop
def orchestrator_function(context: df.DurableOrchestrationContext):
    alert_context = context.get_input()
    reasoning_history = []
    
    for i in range(MAX_ITERATIONS):
        # 1. Ask LLM for next action
        proposal = yield context.call_activity("GetLlmProposal", {
            "history": reasoning_history, 
            "alert": alert_context
        })
        
        if proposal.type == "TOOL_CALL":
            # 2. Execute a SAFE read-only tool (Observer identity)
            result = yield context.call_activity(proposal.tool_name, proposal.args)
            reasoning_history.append({"tool": proposal.tool_name, "result": result})
        elif proposal.type == "REMEDIATION":
            # 3. Transition to Human-in-the-Loop
            return yield context.call_sub_orchestrator("HumanApprovalWorkflow", proposal)

3. The Human-in-the-Loop Gate

For production workloads, autonomous action is a risk. We implement a hard gate using Teams Adaptive Cards. The orchestrator sends a card to the on-call engineer’s Teams channel and pauses execution using wait_for_external_event.

HITL Approval Flow via Teams

[14..AIPRreAsgpueomnseteO]Arccthieosntration[Lo2g.icSeAnpdpA/daBpottiv]eCard[3.On[-CAaPlPlROEVnEgi]neer]

If the 15-minute timer expires before approval, the agent safe-fails to “No-Action” and opens a manual P1 incident. An agent must never act autonomously on timeout.

4. JIT Elevation via Privileged Identity Management (PIM)

Once an action is approved, the orchestrator elevates the Remediator identity using the Azure Resource Manager REST API to programmatically trigger a PIM activation.

# Activating a PIM role via REST API (Simplified)
$body = @{
    properties = @{
        principalId = $RemediatorPrincipalId
        roleDefinitionId = "$Scope/providers/Microsoft.Authorization/roleDefinitions/$ContributorId"
        requestType = "SelfActivate"
        justification = "Incident Remediation: $IncidentId"
        scheduleInfo = @{
            expiration = @{ type = "AfterDuration"; duration = "PT30M" }
        }
    }
}
Invoke-RestMethod -Method Put -Uri "$Url/roleAssignmentScheduleRequests/$Guid?api-version=2020-10-01" -Body ($body | ConvertTo-Json)

Write permission exists for only as long as the remediation takes. Every elevation records in Entra ID audit logs, giving you a secondary layer of non-repudiation.

5. Tool Library Safety and Validation

The LLM does not call Azure APIs directly. It calls “Tools” (activity functions) that you define. Security comes from what tools exist, not from what the system prompt says.

Tool Scoping Principles

  1. Absence is Security: If you don’t want the agent to delete resources, don’t include a delete_resource tool. A tool that doesn’t exist cannot be called.
  2. Input Validation: Each activity function must validate LLM-generated arguments. A tool to scale_deployment needs hard-coded bounds (e.g., min: 1, max: 10) and must reject any operation targeting the kube-system namespace.
  3. Parameterized KQL: Never let the LLM generate free-form KQL. Use named templates where the model only provides parameters (e.g., podName).

6. The Immutable Action Log

To pass a SOC 2 audit, you need to prove that the AI agent’s actions were authorized and auditable. Every decision, LLM “Chain of Thought,” and approval record gets written to append-only Azure Blob Storage.

Using a Time-based Retention Policy with allowProtectedAppendWrites: true, even a fully compromised agent runtime cannot modify or delete the historical record of its actions.

// Bicep: Immutable Audit Log Container
resource immutabilityPolicy 'Microsoft.Storage/storageAccounts/blobServices/containers/immutabilityPolicies@2023-01-01' = {
  parent: auditContainer
  name: 'default'
  properties: {
    immutabilityPeriodSinceCreationInDays: 365
    allowProtectedAppendWrites: true // Mandatory for append logs
    state: 'Unlocked'
  }
}

Key Takeaways

  1. Agency Needs Guardrails: Security comes from tool design and identity boundaries, not system prompt instructions.
  2. Split Your Identities: Use a read-only Observer for investigation and a JIT-elevated Remediator for approved fixes.
  3. Human-in-the-Loop is Mandatory: Always pause for human approval on Teams before state-changing production actions.
  4. Audit the “Thought”: Log the agent’s internal reasoning (CoT) to immutable storage for post-incident reviews.
  5. Fail Safe: If a human doesn’t respond or the agent reaches its reasoning limit, default to “No-Action.”

The Privacy-First AI DevOps Cluster is now complete. By integrating these eight layers—from Private Link and Managed Identity to Agentic SRE guardrails—you have built a zero-trust, enterprise-grade ecosystem where AI acts as a secure productivity multiplier.

Sources