Article Image
Article Image
read

I get asked all the time: “How do I securely automate my EMR deployment so I only need to write the code once to create a repeatable production ready environment?”. If you want EMR to feel boring in production (the best outcome), you need Infrastructure as Code (IaC) for the base platform and Configuration as Code (CaC) for the node-level runtime. Terraform and Ansible are a tools needed to answer this question. In this post I will walk you through a practical checklist to create a secure, production-ready EMR environment using Terraform for infra and Ansible for configuration that can be applied in a repeatable template to deployment your EMR pipelines.


The separation that keeps you sane

Use Terraform for cloud resources and relationships (VPC, subnets, IAM, S3, EMR security config). Use Ansible for OS/runtime configuration (packages, JVM tuning, custom libraries, logging agents).

If you blur the line, you get:

  • Terraform trying to be a config manager (bad at it)
  • Ansible trying to own cloud resources (possible, but slower and harder to reason about)

The result is a brittle platform. Keep the contract crisp.


Terraform best practices (IaC)

1) Lock state and never share a local state file

Use remote state + locking so two people don’t mutate prod simultaneously.

terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "emr/prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

2) Pin provider + Terraform versions

Don’t let a random upgrade break prod on a Friday.

terraform {
  required_version = "~> 1.7.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

3) Build a dedicated EMR VPC posture

Private subnets only for EMR nodes; use NAT for outbound. Put the EMR primary in a private subnet as well.

  • No public IPs on core/task nodes
  • Restrict inbound to corporate CIDRs or bastion/VPN
  • Separate security groups for master/core/task

4) Use an EMR security configuration

Enforce encryption at rest and in transit as a baseline.

  • EBS encryption via KMS
  • S3 encryption (SSE-KMS) and bucket policies
  • In-transit encryption for EMR

5) IAM: least privilege, separate roles

EMR is IAM-heavy. Split responsibilities:

  • EMR Service Role: minimal EMR permissions
  • EC2 Instance Profile Role: S3 access for data + logs only
  • PassRole permissions locked down for pipelines

6) Tag everything and enforce with policy

Tags are your billing, audit, and lifecycle system.

Minimum tags to enforce:

  • env, owner, cost_center, data_classification, retention

Use AWS Organizations SCPs or a CI policy check to enforce tagging.

7) Use modules, not copy/paste

Build reusable modules for:

  • VPC + subnets
  • EMR cluster definition
  • Logging + S3 buckets
  • IAM roles + policies

Make modules boring and stable. Put “clever” logic in the pipeline, not in Terraform.

8) Make EMR ephemeral by default

For most workloads, EMR should be cluster-per-job, not “forever running.” This is similar to job clusters in DataBricks that spin-up, run the job, then termihate. This will save money in the longrun by conserving cycles for jobs only.

  • Use TERMINATE_AT_TASK_COMPLETION for batch
  • Keep state in S3, not HDFS
  • Minimize snowflake configs that only exist on the cluster

9) Emit logs and metrics to S3 + CloudWatch

If you can’t debug without SSH, you’re not done.

  • Enable EMR log URI in S3
  • Stream YARN, Spark, and system logs to CloudWatch
  • Collect bootstrap and step logs centrally

Ansible best practices (CaC)

1) Keep playbooks idempotent and versioned

Every run should converge to the same result.

  • No ad-hoc shell scripts unless absolutely required
  • Package install steps should be explicit and pinned
  • Treat playbooks as versioned artifacts

2) Use Ansible roles per concern

Split by purpose:

  • java_runtime
  • spark_conf
  • hadoop_conf
  • monitoring_agents
  • security_hardening

That keeps reviews focused and avoids “one monster playbook.”

3) Prefer templates for config files

Render configs explicitly so changes are reviewable.

- name: Render spark-defaults.conf
  template:
    src: spark-defaults.conf.j2
    dest: /etc/spark/conf/spark-defaults.conf
    owner: root
    group: root
    mode: "0644"

4) Use Ansible for bootstrap or AMI bake, not SSHing in later

Two reliable patterns:

  • EMR bootstrap actions that call Ansible locally
  • Custom AMI built with Packer + Ansible

Avoid the “log in and tweak” trap. If it isn’t in code, it doesn’t exist.

5) Store secrets out-of-band

Never put credentials in playbooks.

  • Use AWS Secrets Manager or SSM Parameter Store
  • Inject at runtime via environment variables or lookup plugins

6) Validate configuration before rolling

Use ansible-lint and a CI pipeline that runs dry-run checks.

  • --check for safe previews
  • --diff for readability in review

EMR security hardening checklist

If you want “production ready” to mean “auditable and boring,” these are the basics:

  • Private subnets only for EMR
  • No inbound SSH except via bastion or SSM Session Manager
  • KMS encryption for EBS and S3
  • S3 bucket policies locked to EMR instance role
  • VPC endpoints for S3, CloudWatch, and STS where possible
  • CloudTrail + GuardDuty on the account
  • Centralized log retention (S3 lifecycle + CloudWatch retention)

A simple “golden path” workflow

1) Terraform provisions the VPC, IAM, S3 buckets, and EMR security config 2) Packer + Ansible builds a hardened AMI (or Ansible runs via bootstrap) 3) Terraform launches EMR clusters with that baseline 4) CI runs terraform plan and ansible-lint on every change 5) Production applies require review + approval


Final thought

You’re not just building a cluster, you’re building a repeatable, auditable system. Terraform defines the platform; Ansible makes the runtime consistent. The win is not just security. It’s predictability: every EMR environment looks the same, behaves the same, and can be torn down and rebuilt without drama.

In my next post, I will walk you through a reference repo I use to deploy my personal EMR clusters, using Terraform and Ansible.

Thanks you for reading.

Cheers!

Jason

Blog Logo

Jason Rich


Published

Image

NADEBlg!

Back to Overview