I get asked all the time: “How do I securely automate my EMR deployment so I only need to write the code once to create a repeatable production ready environment?”. If you want EMR to feel boring in production (the best outcome), you need Infrastructure as Code (IaC) for the base platform and Configuration as Code (CaC) for the node-level runtime. Terraform and Ansible are a tools needed to answer this question. In this post I will walk you through a practical checklist to create a secure, production-ready EMR environment using Terraform for infra and Ansible for configuration that can be applied in a repeatable template to deployment your EMR pipelines.
The separation that keeps you sane
Use Terraform for cloud resources and relationships (VPC, subnets, IAM, S3, EMR security config). Use Ansible for OS/runtime configuration (packages, JVM tuning, custom libraries, logging agents).
If you blur the line, you get:
- Terraform trying to be a config manager (bad at it)
- Ansible trying to own cloud resources (possible, but slower and harder to reason about)
The result is a brittle platform. Keep the contract crisp.
Terraform best practices (IaC)
1) Lock state and never share a local state file
Use remote state + locking so two people don’t mutate prod simultaneously.
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "emr/prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
2) Pin provider + Terraform versions
Don’t let a random upgrade break prod on a Friday.
terraform {
required_version = "~> 1.7.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
3) Build a dedicated EMR VPC posture
Private subnets only for EMR nodes; use NAT for outbound. Put the EMR primary in a private subnet as well.
- No public IPs on core/task nodes
- Restrict inbound to corporate CIDRs or bastion/VPN
- Separate security groups for master/core/task
4) Use an EMR security configuration
Enforce encryption at rest and in transit as a baseline.
- EBS encryption via KMS
- S3 encryption (SSE-KMS) and bucket policies
- In-transit encryption for EMR
5) IAM: least privilege, separate roles
EMR is IAM-heavy. Split responsibilities:
- EMR Service Role: minimal EMR permissions
- EC2 Instance Profile Role: S3 access for data + logs only
- PassRole permissions locked down for pipelines
6) Tag everything and enforce with policy
Tags are your billing, audit, and lifecycle system.
Minimum tags to enforce:
env,owner,cost_center,data_classification,retention
Use AWS Organizations SCPs or a CI policy check to enforce tagging.
7) Use modules, not copy/paste
Build reusable modules for:
- VPC + subnets
- EMR cluster definition
- Logging + S3 buckets
- IAM roles + policies
Make modules boring and stable. Put “clever” logic in the pipeline, not in Terraform.
8) Make EMR ephemeral by default
For most workloads, EMR should be cluster-per-job, not “forever running.” This is similar to job clusters in DataBricks that spin-up, run the job, then termihate. This will save money in the longrun by conserving cycles for jobs only.
- Use
TERMINATE_AT_TASK_COMPLETIONfor batch - Keep state in S3, not HDFS
- Minimize snowflake configs that only exist on the cluster
9) Emit logs and metrics to S3 + CloudWatch
If you can’t debug without SSH, you’re not done.
- Enable EMR log URI in S3
- Stream YARN, Spark, and system logs to CloudWatch
- Collect bootstrap and step logs centrally
Ansible best practices (CaC)
1) Keep playbooks idempotent and versioned
Every run should converge to the same result.
- No ad-hoc shell scripts unless absolutely required
- Package install steps should be explicit and pinned
- Treat playbooks as versioned artifacts
2) Use Ansible roles per concern
Split by purpose:
java_runtimespark_confhadoop_confmonitoring_agentssecurity_hardening
That keeps reviews focused and avoids “one monster playbook.”
3) Prefer templates for config files
Render configs explicitly so changes are reviewable.
- name: Render spark-defaults.conf
template:
src: spark-defaults.conf.j2
dest: /etc/spark/conf/spark-defaults.conf
owner: root
group: root
mode: "0644"
4) Use Ansible for bootstrap or AMI bake, not SSHing in later
Two reliable patterns:
- EMR bootstrap actions that call Ansible locally
- Custom AMI built with Packer + Ansible
Avoid the “log in and tweak” trap. If it isn’t in code, it doesn’t exist.
5) Store secrets out-of-band
Never put credentials in playbooks.
- Use AWS Secrets Manager or SSM Parameter Store
- Inject at runtime via environment variables or lookup plugins
6) Validate configuration before rolling
Use ansible-lint and a CI pipeline that runs dry-run checks.
--checkfor safe previews--difffor readability in review
EMR security hardening checklist
If you want “production ready” to mean “auditable and boring,” these are the basics:
- Private subnets only for EMR
- No inbound SSH except via bastion or SSM Session Manager
- KMS encryption for EBS and S3
- S3 bucket policies locked to EMR instance role
- VPC endpoints for S3, CloudWatch, and STS where possible
- CloudTrail + GuardDuty on the account
- Centralized log retention (S3 lifecycle + CloudWatch retention)
A simple “golden path” workflow
1) Terraform provisions the VPC, IAM, S3 buckets, and EMR security config
2) Packer + Ansible builds a hardened AMI (or Ansible runs via bootstrap)
3) Terraform launches EMR clusters with that baseline
4) CI runs terraform plan and ansible-lint on every change
5) Production applies require review + approval
Final thought
You’re not just building a cluster, you’re building a repeatable, auditable system. Terraform defines the platform; Ansible makes the runtime consistent. The win is not just security. It’s predictability: every EMR environment looks the same, behaves the same, and can be torn down and rebuilt without drama.
In my next post, I will walk you through a reference repo I use to deploy my personal EMR clusters, using Terraform and Ansible.
Thanks you for reading.
Cheers!
Jason