A hands-on sample showing how to use AWS DevOps Agent to investigate Spark job failures on Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS). You deploy an Amazon EMR on Amazon EKS environment with two custom MCP servers, inject real Spark failures (OOM, schema errors, data skew, bad Amazon S3 paths), and use the agent to find the root cause with natural language prompts.
🚀 Quick Start: You'll need an Amazon EMR on EKS cluster first (use your own, or deploy one in Step 0). Then run
./deploy.shto set up the rest in ~15-20 minutes.
- Walkthrough
- What You'll Learn
- Solution Architecture
- MCP Server Connectivity Options
- Getting Started
- AWS DevOps Agent Configuration
- Fault Injection Labs
- Sample Spark Jobs
- Cost
- Cleanup
- Security Considerations
- Troubleshooting
- Additional Resources
- Deploy the Amazon EMR on EKS environment with Spark History Server and two custom MCP servers (Runbook MCP, Spark History Server MCP)
- Inject a fault by submitting a broken PySpark job
- Investigate with AWS DevOps Agent. It looks at Amazon CloudWatch Logs, Spark execution data (SHS MCP), and operational runbooks (Runbook MCP) to find what went wrong
- Fix by running the rollback script to submit a working job
- How to build and deploy custom MCP servers for AWS DevOps Agent
- How AWS DevOps Agent pulls together Spark execution data, Amazon CloudWatch Logs, and runbooks to diagnose OOM, data skew, schema errors, and config problems
⚠️ Disclaimer: This repo includes scripts that intentionally break Spark jobs to demo AWS DevOps Agent. Don't run these in production. See Security Considerations before adapting for production.
The SHS MCP server runs inside the Amazon EKS cluster on a private subnet. AWS DevOps Agent reaches it through a VPC Lattice Private Connection — no public internet exposure.
This workshop deploys the following components:
| Component | What It Does | Where It Runs |
|---|---|---|
| Amazon EMR on EKS | Runs Spark jobs on Kubernetes. Spark event logs go to Amazon S3. stdout/stderr and per-container logs go to Amazon CloudWatch Logs (/emr-on-eks/dev). |
Amazon EKS cluster (emr-eks-karpenter) |
| Amazon CloudWatch metrics (Amazon EMR on EKS) | AWS automatically publishes Amazon EMR on EKS service-level metrics (job run counts, container state) under the AWS/EMRContainers namespace. |
Amazon CloudWatch |
| Amazon Managed Service for Prometheus | Prometheus running on Amazon EKS scrapes pod and node metrics (CPU, memory, network) and ingests them into Amazon Managed Service for Prometheus. Enabled by default in the data-on-eks blueprint. | Amazon Managed Service for Prometheus workspace |
| Spark History Server (SHS) | Reads Spark event logs from Amazon S3 and serves a REST API with job execution details (stages, tasks, executors, shuffle metrics). | Amazon EKS pod (spark-history namespace) |
| SHS MCP Server (18 tools) | Wraps the SHS REST API as an MCP server so AWS DevOps Agent can query Spark execution data. Runs as a sidecar alongside SHS. | Amazon EKS pod with Nginx sidecar for TLS + API key auth |
| Runbook MCP Server (3 tools) | Searches operational runbooks indexed in an Amazon Bedrock Knowledge Base. Returns step-by-step investigation guides for known failure patterns (what to check, in what order). | Amazon Bedrock AgentCore Runtime (managed HTTPS endpoint) |
| Amazon Bedrock Knowledge Base | Indexes 7 YAML runbooks (OOM, data skew, shuffle errors, scheduling failures, etc.) stored in Amazon S3, using Amazon OpenSearch Serverless as the vector store. | Amazon Bedrock + Amazon OpenSearch Serverless |
| AWS DevOps Agent | Connects to all the above sources (Amazon CloudWatch Logs and Amazon CloudWatch metrics built-in, SHS MCP via Private Connection, Runbook MCP via Amazon Bedrock AgentCore Runtime) and investigates Spark job failures with natural language. Amazon Managed Service for Prometheus metrics aren't directly queried by the agent in this sample. | AWS DevOps Agent console |
AWS DevOps Agent requires HTTPS with authentication for all MCP servers. This sample deploys MCP servers in two diffeent ways and demonsrates how to Integrate them with AWS Devops Agent.
The Runbook MCP server is deployed to Amazon Bedrock AgentCore , which hosts it as a managed HTTPS endpoint. AWS DevOps Agent authenticates with an OAuth 2.0 Client Credentials token from Amazon Cognito.
The SHS MCP server runs as a pod inside the Amazon EKS cluster, exposed via an Internal Network Load Balancer. AWS DevOps Agent connects through Amazon VPC Lattice as MCP Server is not exposed over Internet.
See the Solution Architecture diagram for the full connectivity picture. SHS MCP needs separate VPC lattice connectivity as it is not exposed over Internet.
| Tool | Install | Docs |
|---|---|---|
| AWS CLI | curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip awscliv2.zip && sudo ./aws/install |
Install AWS CLI |
| Terraform | sudo yum install -y yum-utils && sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo && sudo yum -y install terraform |
Install Terraform |
| kubectl | curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && sudo install kubectl /usr/local/bin/ |
Install kubectl |
| Helm | curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash |
Install Helm |
| jq | sudo yum install -y jq or sudo apt-get install -y jq |
Install jq |
| openssl | Pre-installed on most Linux/macOS | OpenSSL |
| Node.js 18+ | curl -fsSL https://rpm.nodesource.com/setup_18.x | sudo bash - && sudo yum install -y nodejs |
Install Node.js |
| AgentCore CLI | sudo npm install -g @aws/agentcore typescript |
AgentCore CLI |
Configure AWS credentials:
aws configureRegions: The Terraform blueprint defaults to
us-west-2. AWS DevOps Agent Spaces are available inus-east-1,us-west-2,ap-southeast-2,ap-northeast-1,eu-central-1, andeu-west-1, and can monitor resources in any region. Your EKS cluster can be in any region. SetAWS_REGIONinconfig.envto match your EKS cluster region.
You need an Amazon EKS cluster with Amazon EMR on EKS configured. Pick one option:
Option A: Use your existing Amazon EMR on EKS cluster
Gather these values from your existing setup:
aws eks list-clusters --region $AWS_REGION
aws emr-containers list-virtual-clusters --region $AWS_REGION --states RUNNING \
--query "virtualClusters[].{Id:id,Name:name}" --output table
aws iam list-roles --query "Roles[?contains(RoleName,'emr-data-team-a')].Arn" --output textSkip to Step 1: One-Click Deployment.
Option B: Deploy a new Amazon EMR on EKS cluster (~25-30 min)
Uses the data-on-eks emr-eks-karpenter blueprint:
git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks/analytics/terraform/emr-eks-karpenter
terraform init
terraform apply -auto-approve
# For a different region: terraform apply -auto-approve -var region=us-east-1If Terraform fails partway through (usually Helm timing issues), run terraform destroy -auto-approve and try again.
Configure kubectl and verify:
export AWS_REGION=us-west-2
aws eks update-kubeconfig --name emr-eks-karpenter --region $AWS_REGION
kubectl get nodes
aws emr-containers list-virtual-clusters --region $AWS_REGION --states RUNNING \
--query "virtualClusters[].{Id:id,Name:name}" --output tableYou should see nodes running and two virtual clusters (emr-data-team-a and emr-data-team-b).
git clone https://github.com/aws-samples/sample-devops-agent-emr-alertreduction.git
cd sample-devops-agent-emr-alertreduction
cp config.env.template config.env
# Edit config.env and set: EKS_CLUSTER_NAME, EMR_VIRTUAL_CLUSTER_ID,
# JOB_EXECUTION_ROLE_ARN, AWS_REGION (follow the commands in config.env)
./deploy.shThe deploy.sh script runs these steps (~15-20 min):
- Patch the Amazon EMR execution role with scoped Amazon CloudWatch Logs + Amazon S3 permissions (see
scripts/patch-emr-role.sh) - Deploy the CloudFormation stack (Amazon S3 + Amazon OpenSearch Serverless + Amazon Bedrock Knowledge Base, hardened per
infrastructure/template.yaml) - Upload runbooks to Amazon S3 and sync to Amazon Bedrock Knowledge Base
- Deploy Spark History Server on Amazon EKS
- Deploy Runbook MCP to Amazon Bedrock AgentCore Runtime (Cognito + OAuth) and SHS MCP on Amazon EKS via Helm + internal NLB
See Security Considerations for IAM policy details and S3 hardening.
Manual Step-by-Step Deployment
If you prefer to run each step manually:
bash scripts/patch-emr-role.sh # Step 1: IAM patches
bash scripts/deploy-infra.sh # Step 2: CFN + Amazon Bedrock Knowledge Base + runbooks
bash scripts/deploy-shs.sh # Step 3: Spark History Server
bash scripts/deploy-mcp-server.sh # Step 4: Runbook MCP → Amazon Bedrock AgentCore
bash scripts/deploy-shs-mcp-private.sh # Step 5: SHS MCP via Helm + NLBCheck pods are running:
kubectl get pods -n spark-history
# Expected: spark-history-server (1/1) and shs-mcp-... (2/2)Submit a baseline Spark job to confirm end-to-end flow:
bash fault-injection/rollback-submit-good-job.sh
# Should complete in ~50-90 secondsVerify the job shows up in Spark History Server and CloudWatch Logs:
kubectl exec -n spark-history deploy/spark-history-server -- \
curl -s http://localhost:18080/api/v1/applications
aws logs describe-log-streams --log-group-name "/emr-on-eks/dev" --region $AWS_REGION \
--order-by LastEventTime --descending --limit 5 \
--query "logStreams[].logStreamName" --output tableAdditional verification checks
Browse the SHS Web UI (optional):
kubectl port-forward -n spark-history deploy/spark-history-server 18080:18080
# Open http://localhost:18080 in your browser (Ctrl+C to stop)Verify Amazon Bedrock Knowledge Base returns runbooks:
KB_ID=$(aws bedrock-agent list-knowledge-bases --region $AWS_REGION \
--query "knowledgeBaseSummaries[?starts_with(name,'dev-emr-runbooks-kb') && status=='ACTIVE'].knowledgeBaseId" --output text)
aws bedrock-agent-runtime retrieve --knowledge-base-id "$KB_ID" --region $AWS_REGION \
--retrieval-query '{"text": "spark OOM failure"}' \
--query "retrievalResults[0].content.text" --output text | head -5Verify SHS MCP endpoint (retrieves the auto-generated API key from the Kubernetes Secret):
API_KEY=$(kubectl get secret shs-mcp-apikey -n spark-history \
-o jsonpath='{.data.api-key}' | base64 -d)
kubectl exec -n spark-history deploy/shs-mcp-mcp-apache-spark-history-server -c nginx-auth-proxy -- \
sh -c "curl -sk https://localhost:18889/mcp/ -H 'x-api-key: $API_KEY'"Configure AWS DevOps Agent to access your EKS cluster and both MCP servers through the AWS console.
1. Print all the values you'll need:
bash scripts/show-setup-values.shThis prints the NLB DNS, security group IDs, Cognito credentials, and everything else needed for the console.
2. Create the Agent Space: Follow docs/AGENT_SPACE_SETUP.md for step-by-step console instructions with commands to retrieve each value.
Once configured, your Agent Space should look like this (with both MCP servers registered). Click Operator access to launch the Web App:
Four labs that simulate real Spark job failures. Each inject script uploads a broken PySpark script to Amazon S3, submits it to Amazon EMR on EKS, waits for it to finish, and prints the job run ID.
cd fault-injection
chmod +x *.sh| Lab | Failure | Inject | Rollback | Result |
|---|---|---|---|---|
| 1. OOM | OutOfMemoryError | ./inject-oom-failure.sh |
./rollback-submit-good-job.sh |
❌ FAILED (~2-4 min) |
| 2. Bad Column | AnalysisException | ./inject-bad-column.sh |
./rollback-submit-good-job.sh |
❌ FAILED (~60 sec) |
| 3. Data Skew | Severe partition skew | ./inject-data-skew.sh |
./rollback-submit-good-job.sh |
✅ COMPLETED (slow) |
| 4. Bad S3 Path | S3 Access Denied | ./inject-bad-s3path.sh |
./rollback-submit-good-job.sh |
❌ FAILED (~60 sec) |
Before opening the AWS DevOps Agent console, gather these three values:
| Value | Where to find it |
|---|---|
<JOB_RUN_ID> |
Printed by the inject script as [OK] Submitted: <id> (e.g., 000000037d4dkr1m57d). Also shown in Amazon EMR console under Amazon EMR on EKS → Virtual clusters → Job runs. |
<VIRTUAL_CLUSTER_ID> |
Run grep EMR_VIRTUAL_CLUSTER_ID config.env. Also in Amazon EMR console under Virtual clusters. |
<REGION> |
Run grep AWS_REGION config.env. The region where your EKS cluster is deployed. |
Example output from an inject script:
[INFO] Injecting OOM fault ...
[OK] Submitted: 000000037d4dkr1m57d ← this is the JOB_RUN_ID
[INFO] PENDING (15s)
...
[WARN] Job 000000037d4dkr1m57d: FAILED
Then in the AWS DevOps Agent console:
- Open your Agent Space and click Operator access to launch the Web App
- Click Start Investigation
- Paste the Investigation details prompt from the lab below, with your three values substituted in
- Paste the Investigation starting point prompt
- Observe the root cause the agent arrives at, and the tools it used to get there (CloudWatch Logs, Amazon CloudWatch metrics, SHS MCP, Runbook MCP)
- Use the chat option to follow up — ask the agent how it reached the conclusion, which runbook steps it followed, or for more detail on any finding
The Web App lets you start an investigation and follow up with chat:
To check job status directly from the CLI:
aws emr-containers list-job-runs --virtual-cluster-id $EMR_VIRTUAL_CLUSTER_ID \
--region $AWS_REGION --max-results 3After each lab, run ./rollback-submit-good-job.sh to submit a working baseline job.
Scenario: A Spark job uses crossJoin instead of a proper join on customer_id. This creates a cartesian product (500 × 10,000 = 5M rows), then calls collect() to pull everything to the driver. The driver runs out of memory.
Script: sample-jobs/customer_analytics_bad_oom.py
./inject-oom-failure.sh
# Job fails in ~2-4 minutesNote: OOM jobs often crash hard enough that the Spark event log never gets flushed to S3. If the job doesn't appear in SHS, CloudWatch Logs will still have stderr/stdout.
Investigation details:
A recent Amazon EMR Spark job failed on virtual cluster <VIRTUAL_CLUSTER_ID> in <REGION>
with an OutOfMemoryError. Job run ID: <JOB_RUN_ID>.
Investigation starting point:
Find the root cause of this failure. Use Amazon CloudWatch Logs, Amazon CloudWatch metrics, Spark History
Server MCP, and Runbook MCP as needed. Explain which tools you used and why.
What to observe:
Look at the root cause the agent reports and the tools it used to get there. For this lab, you should see it:
- Pull the OOM stack trace from CloudWatch Logs (
java.lang.OutOfMemoryError,SparkContext was shut down) - Match the
spark-oom-failurerunbook via Runbook MCP - Possibly call SHS MCP for stage details (may return little if the event log wasn't flushed before the crash)
- Conclude that
crossJoincreated a cartesian product andcollect()pulled all rows to the driver
Try follow-up questions in the chat:
- "Which CloudWatch log streams had the most useful information?"
- "Walk me through the runbook steps you followed"
- "Why didn't SHS have complete data for this job?"
Key learnings:
crossJoinpluscollect()is a classic OOM pattern. Use explicit join conditions, and write to S3 instead of collecting.- OOM jobs often don't appear in SHS because the event log flush needs a graceful shutdown. Fall back to CloudWatch Logs when that happens.
- Runbook MCP acts as a guided investigation. The agent picks a matching runbook, follows its steps, and correlates each step's findings with CloudWatch and SHS data.
Scenario: A groupBy().agg() call references a column called transaction_fee that doesn't exist. Spark catches this at plan time (before any stages run) and throws an AnalysisException with suggested column names.
Script: sample-jobs/customer_analytics_bad_column.py
./inject-bad-column.sh
# Job fails in ~60 secondsInvestigation details:
A recent Amazon EMR Spark job failed on virtual cluster <VIRTUAL_CLUSTER_ID> in <REGION>
with an AnalysisException. Job run ID: <JOB_RUN_ID>.
Investigation starting point:
Find the root cause of this failure. Use Amazon CloudWatch Logs, Amazon CloudWatch metrics, Spark History
Server MCP, and Runbook MCP as needed. Explain which tools you used and why.
What to observe:
Look at the root cause the agent reports and the tools it used to get there. For this lab, you should see it:
- Pull the
AnalysisException/UNRESOLVED_COLUMN.WITH_SUGGESTIONfrom CloudWatch Logs - Call SHS MCP and find that no stages ran (the error happened at plan time)
- Conclude that
transaction_feeisn't a real column, and the error message suggests valid ones
Try follow-up questions in the chat:
- "Why were there no completed stages in SHS?"
- "What's the difference between an AnalysisException and a runtime error?"
- "How would this failure be caught in CI before submitting to Amazon EMR?"
Key learnings:
- Spark catches schema errors at plan time, so no stages execute. The error message even suggests valid column names.
- SHS shows no completed stages for plan-time failures.
- This is one of the most common Spark errors in production. Usually caused by schema evolution, typos, or a missing upstream transformation.
Scenario: A job processes 100K transactions where 95% are assigned to customer_id=1. One partition gets ~95K rows while others get a handful. The job completes but runs much slower than baseline. No error message, just bad performance. This is the hardest failure type to diagnose.
Script: sample-jobs/customer_analytics_bad_skew.py
./inject-data-skew.sh
# Job completes but takes longer than the ~50 second baselineInvestigation details:
A recent Amazon EMR Spark job completed on virtual cluster <VIRTUAL_CLUSTER_ID> in <REGION>
but took much longer than the baseline. Job run ID: <JOB_RUN_ID>.
Investigation starting point:
Find the root cause of why this job ran slowly. Use Amazon CloudWatch Logs, Amazon CloudWatch metrics,
Spark History Server MCP, and Runbook MCP as needed. Compare against the baseline job
(CustomerAnalytics). Explain which tools you used and why.
What to observe:
Look at the root cause the agent reports and the tools it used to get there. For this lab, you should see it:
- Call SHS MCP to get application details, stages, and per-task metrics (this is where the skew shows up)
- Identify the app as
CustomerAnalytics-BAD-SKEWand spot the heavy partition imbalance (one partition with ~97% of rows) - Match the
spark-data-skewrunbook via Runbook MCP for the investigation steps - Conclude that the
customer_iddistribution is highly skewed — most transactions concentrated on one key
Try follow-up questions in the chat:
- "Which SHS MCP tools did you call and what did each one return?"
- "Walk me through the runbook steps you followed"
- "What would the task distribution look like in a healthy job?"
Key learnings:
- Data skew has no error message. You need SHS task-level metrics (min/max/median duration) to spot it.
collect_list()on a skewed key is dangerous. One partition ends up with a massive list.- AQE helps: setting
spark.sql.adaptive.skewJoin.enabled=truesplits skewed partitions automatically.
Scenario: A job reads from s3://this-bucket-does-not-exist-12345/, which is a non-existent bucket. AWS returns 403 Access Denied (not 404) and the job fails immediately.
Script: sample-jobs/customer_analytics_bad_s3path.py
./inject-bad-s3path.sh
# Job fails in ~60 secondsInvestigation details:
A recent Amazon EMR Spark job failed on virtual cluster <VIRTUAL_CLUSTER_ID> in <REGION>
when trying to read input data. Job run ID: <JOB_RUN_ID>.
Investigation starting point:
Find the root cause of this failure. Use Amazon CloudWatch Logs, Amazon CloudWatch metrics, Spark History
Server MCP, and Runbook MCP as needed. Explain which tools you used and why.
What to observe:
Look at the root cause the agent reports and the tools it used to get there. For this lab, you should see it:
- Pull the
AmazonS3Exception: Access Denied (403)from CloudWatch Logs - Call SHS MCP and find the job failed at the first stage with no completed stages
- Match the
emr-job-submission-failurerunbook via Runbook MCP - Conclude the input S3 bucket doesn't exist (AWS returns 403 for non-existent buckets)
Try follow-up questions in the chat:
- "Why does AWS return 403 instead of 404 here?"
- "Walk me through the runbook steps you followed"
- "How would you distinguish a missing bucket from an actual permissions issue?"
Key learnings:
- AWS returns 403 (not 404) for non-existent buckets. Don't assume 403 means it's a permissions issue.
- Validate S3 paths before submitting jobs. A simple
aws s3 lscheck catches this early.
All PySpark scripts are in sample-jobs/:
| Script | What It Does | Result |
|---|---|---|
customer_analytics.py |
Baseline: synthetic data, aggregations, window functions, joins, S3 writes (~261 lines) | ✅ COMPLETED (~50s) |
customer_analytics_bad_oom.py |
Replaces join("customer_id") with crossJoin + collect() on 5M rows |
❌ FAILED (OOM) |
customer_analytics_bad_column.py |
References non-existent column transaction_fee in groupBy().agg() |
❌ FAILED (AnalysisException) |
customer_analytics_bad_skew.py |
95% of 100K transactions assigned to customer_id=1, collect_list() on skewed key |
✅ COMPLETED (slow) |
customer_analytics_bad_s3path.py |
Reads from non-existent S3 bucket | ❌ FAILED (S3 403) |
| Resource | Cost |
|---|---|
| EKS cluster (running) | |
| EKS cluster (idle, ASG=0) | ~$0.34/hr (control plane + AOSS) |
| Internal NLB | ~$0.02/hr |
| Lambda | Free tier |
Scale down when not in use:
bash scripts/scale-down.shScale back up:
bash scripts/scale-up.sh# Remove the AWS DevOps Agent layer (MCP servers, Amazon Bedrock Knowledge Base, Amazon OpenSearch Serverless, Amazon S3)
./destroy.sh
# Remove the EKS cluster (if you deployed it in Step 0)
cd data-on-eks/analytics/terraform/emr-eks-karpenter
terraform destroy -auto-approveNote: Before running
destroy.sh, manually delete the Private Connection and Agent Space from the AWS DevOps Agent console. These aren't managed by the script.
See docs/SECURITY_CONSIDERATIONS.md for the full security guide covering shared responsibility model, data classification, key management, API key handling, IAM access review, AI/ML security controls, third-party components, residual risks, and scan attestation.
| Issue | Symptom | Fix |
|---|---|---|
| SHS MCP pod not starting | ImagePullBackOff |
Verify outbound internet from EKS nodes (NAT Gateway required) |
| SHS pod CrashLoopBackOff | ClassNotFoundException S3AFileSystem |
Verify shs-deployment-v2.yaml is used (has init container for hadoop-aws JARs) |
search_runbooks returns 0 results |
AOSS network policy reset | Run bash scripts/scale-up.sh. It checks and fixes the AOSS network policy. |
| Runbook MCP auth fails | 401 from Amazon Bedrock AgentCore endpoint | Verify Cognito credentials (printed by deploy-mcp-server.sh) |
| KB sync shows failed docs | 7 documents failed | Expected. .gitkeep and schema.yaml aren't valid runbooks. |
| Private Connection not working | SHS MCP unreachable from Agent | Check the security group allows inbound from VPC Lattice ENIs on port 18889. Make sure all VPC CIDRs are covered in outbound rules. |
| OOM job not in SHS | Job failed but no app in SHS | Expected. OOM crashes prevent event log flush to S3. Use CloudWatch Logs instead. |
| Resource | Link |
|---|---|
| AWS DevOps Agent Documentation | User Guide |
| AWS DevOps Agent Console | Console |
| AWS DevOps Agent EKS Workshop | GitHub |
| Spark History Server MCP | GitHub |
| Amazon EMR on EKS Documentation | Developer Guide |
| data-on-eks Blueprints | GitHub |
This project is licensed under the MIT-0 License. See the LICENSE file.



