Terraform + Ansible Best Practices for a Secure, Production-Ready AWS EMR Environment

read

I get asked all the time: “How do I securely automate my EMR deployment so I only need to write the code once to create a repeatable production ready environment?”. If you want EMR to feel boring in production (the best outcome), you need Infrastructure as Code (IaC) for the base platform and Configuration as Code (CaC) for the node-level runtime. Terraform and Ansible are a tools needed to answer this question. In this post I will walk you through a practical checklist to create a secure, production-ready EMR environment using Terraform for infra and Ansible for configuration that can be applied in a repeatable template to deployment your EMR pipelines.

The separation that keeps you sane

Use Terraform for cloud resources and relationships (VPC, subnets, IAM, S3, EMR security config). Use Ansible for OS/runtime configuration (packages, JVM tuning, custom libraries, logging agents).

If you blur the line, you get:

Terraform trying to be a config manager (bad at it)
Ansible trying to own cloud resources (possible, but slower and harder to reason about)

The result is a brittle platform. Keep the contract crisp.

Terraform best practices (IaC)

Use remote state + locking so two people don’t mutate prod simultaneously.

terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "emr/prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

2) Pin provider + Terraform versions

Don’t let a random upgrade break prod on a Friday.

terraform {
  required_version = "~> 1.7.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

3) Build a dedicated EMR VPC posture

Private subnets only for EMR nodes; use NAT for outbound. Put the EMR primary in a private subnet as well.

No public IPs on core/task nodes
Restrict inbound to corporate CIDRs or bastion/VPN
Separate security groups for master/core/task

4) Use an EMR security configuration

Enforce encryption at rest and in transit as a baseline.

EBS encryption via KMS
S3 encryption (SSE-KMS) and bucket policies
In-transit encryption for EMR

5) IAM: least privilege, separate roles

EMR is IAM-heavy. Split responsibilities:

EMR Service Role: minimal EMR permissions
EC2 Instance Profile Role: S3 access for data + logs only
PassRole permissions locked down for pipelines

6) Tag everything and enforce with policy

Tags are your billing, audit, and lifecycle system.

Minimum tags to enforce:

env, owner, cost_center, data_classification, retention

Use AWS Organizations SCPs or a CI policy check to enforce tagging.

7) Use modules, not copy/paste

Build reusable modules for:

VPC + subnets
EMR cluster definition
Logging + S3 buckets
IAM roles + policies

Make modules boring and stable. Put “clever” logic in the pipeline, not in Terraform.

8) Make EMR ephemeral by default

For most workloads, EMR should be cluster-per-job, not “forever running.” This is similar to job clusters in DataBricks that spin-up, run the job, then termihate. This will save money in the longrun by conserving cycles for jobs only.

Use TERMINATE_AT_TASK_COMPLETION for batch
Keep state in S3, not HDFS
Minimize snowflake configs that only exist on the cluster

9) Emit logs and metrics to S3 + CloudWatch

If you can’t debug without SSH, you’re not done.

Enable EMR log URI in S3
Stream YARN, Spark, and system logs to CloudWatch
Collect bootstrap and step logs centrally

Ansible best practices (CaC)

1) Keep playbooks idempotent and versioned

Every run should converge to the same result.

No ad-hoc shell scripts unless absolutely required
Package install steps should be explicit and pinned
Treat playbooks as versioned artifacts

2) Use Ansible roles per concern

Split by purpose:

java_runtime
spark_conf
hadoop_conf
monitoring_agents
security_hardening

That keeps reviews focused and avoids “one monster playbook.”

3) Prefer templates for config files

Render configs explicitly so changes are reviewable.

- name: Render spark-defaults.conf
  template:
    src: spark-defaults.conf.j2
    dest: /etc/spark/conf/spark-defaults.conf
    owner: root
    group: root
    mode: "0644"

4) Use Ansible for bootstrap or AMI bake, not SSHing in later

Two reliable patterns:

EMR bootstrap actions that call Ansible locally
Custom AMI built with Packer + Ansible

Avoid the “log in and tweak” trap. If it isn’t in code, it doesn’t exist.

5) Store secrets out-of-band

Never put credentials in playbooks.

Use AWS Secrets Manager or SSM Parameter Store
Inject at runtime via environment variables or lookup plugins

6) Validate configuration before rolling

Use ansible-lint and a CI pipeline that runs dry-run checks.

--check for safe previews
--diff for readability in review

EMR security hardening checklist

If you want “production ready” to mean “auditable and boring,” these are the basics:

Private subnets only for EMR
No inbound SSH except via bastion or SSM Session Manager
KMS encryption for EBS and S3
S3 bucket policies locked to EMR instance role
VPC endpoints for S3, CloudWatch, and STS where possible
CloudTrail + GuardDuty on the account
Centralized log retention (S3 lifecycle + CloudWatch retention)

A simple “golden path” workflow

1) Terraform provisions the VPC, IAM, S3 buckets, and EMR security config 2) Packer + Ansible builds a hardened AMI (or Ansible runs via bootstrap) 3) Terraform launches EMR clusters with that baseline 4) CI runs terraform plan and ansible-lint on every change 5) Production applies require review + approval

Final thought

You’re not just building a cluster, you’re building a repeatable, auditable system. Terraform defines the platform; Ansible makes the runtime consistent. The win is not just security. It’s predictability: every EMR environment looks the same, behaves the same, and can be torn down and rebuilt without drama.

In my next post, I will walk you through a reference repo I use to deploy my personal EMR clusters, using Terraform and Ansible.

Thanks you for reading.

Cheers!

Jason

Terraform + Ansible Best Practices for a Secure, Production-Ready AWS EMR Environment

Jason Rich

The separation that keeps you sane

Terraform best practices (IaC)

2) Pin provider + Terraform versions

3) Build a dedicated EMR VPC posture

4) Use an EMR security configuration

5) IAM: least privilege, separate roles

6) Tag everything and enforce with policy

7) Use modules, not copy/paste

8) Make EMR ephemeral by default

9) Emit logs and metrics to S3 + CloudWatch

Ansible best practices (CaC)

1) Keep playbooks idempotent and versioned

2) Use Ansible roles per concern

3) Prefer templates for config files

4) Use Ansible for bootstrap or AMI bake, not SSHing in later

5) Store secrets out-of-band

6) Validate configuration before rolling

EMR security hardening checklist

A simple “golden path” workflow

Final thought

Written by

Jason Rich

Supported by

NADEBlg!

Terraform + Ansible Best Practices for a Secure, Production-Ready AWS EMR Environment

Jason Rich

The separation that keeps you sane

Terraform best practices (IaC)

1) Lock state and never share a local state file

2) Pin provider + Terraform versions

3) Build a dedicated EMR VPC posture

4) Use an EMR security configuration

5) IAM: least privilege, separate roles

6) Tag everything and enforce with policy

7) Use modules, not copy/paste

8) Make EMR ephemeral by default

9) Emit logs and metrics to S3 + CloudWatch

Ansible best practices (CaC)

1) Keep playbooks idempotent and versioned

2) Use Ansible roles per concern

3) Prefer templates for config files

4) Use Ansible for bootstrap or AMI bake, not SSHing in later

5) Store secrets out-of-band

6) Validate configuration before rolling

EMR security hardening checklist

A simple “golden path” workflow

Final thought

Written by

Jason Rich

Supported by

NADEBlg!