Article Image
Article Image
read

In my previous post I covered Terraform + Ansible best practices for a secure, production-ready EMR environment. In this post, I’m going to do the practical follow-up: a walkthrough of the reference repo I use to deploy my personal EMR clusters.

The example repo for this post is:

GitHub repo: https://github.com/jrich8573/emr-deploy-iac

The goal of the repo is intentionally narrow and boring:

  • Terraform provisions the EMR cluster (and optionally a VPC).
  • Terraform outputs the EMR primary node hostname.
  • Ansible targets that node and performs post-provisioning configuration using roles.

If you want to copy the approach, you can fork the repo structure and replace the “app” role with your own Spark job runner, notebook tooling, observability agent, or whatever you install on your personal clusters.


Repo structure (what lives where)

At the top level, there are only two working directories:

  • terraform/: everything that creates AWS resources
  • ansible/: everything that configures the EMR node after it exists

Here’s the important structure:

emr-deploy-iac/
  terraform/
    backend.tf
    versions.tf
    variables.tf
    terraform.tfvars.example
    main.tf
    outputs.tf
  ansible/
    ansible.cfg
    inventory.ini
    playbooks/site.yml
    roles/
      bootstrap/...
      app/...

This separation is deliberate. Terraform does “cloud wiring”. Ansible does “machine configuration”.


Terraform walkthrough: what it builds

The main entrypoint is terraform/main.tf. There are three big ideas inside:

1) Optional network creation (create_vpc)

The repo supports two modes:

  • Bring your own network (create_vpc = false): you provide vpc_id and subnet_id.
  • Create a VPC (create_vpc = true): Terraform creates a VPC, public subnets, private subnets, and NAT so EMR can egress.

The “guardrail” is explicit:

  • When create_vpc = false, variables.tf expects you to provide vpc_id and subnet_id.
  • main.tf enforces that with a lifecycle.precondition so you fail fast if you forget.

Practical note for personal clusters:

  • If you want to SSH from your laptop directly, make sure your EMR primary node is actually reachable (public subnet + routing, or a bastion/SSM if private). The repo supports private subnets (recommended), but private networking means you need an access pattern that matches it.

2) SSH key pair + master security group

Terraform creates an EC2 key pair:

  • aws_key_pair.emr_key reads public_key_path and creates/uses key_pair_name.

And it creates a dedicated EMR master SG with SSH ingress:

  • aws_security_group.emr_master uses allowed_ssh_cidrs

Important: the default in variables.tf is permissive (0.0.0.0/0). That’s fine for a quick personal experiment, but if you keep clusters around for more than a coffee break, lock it down.

3) EMR cluster resource

The cluster itself is aws_emr_cluster.this. The “personal cluster” defaults are straightforward:

  • EMR release: emr-6.15.0
  • Applications: Hadoop, Spark
  • Instance types: m5.xlarge master and core
  • Core nodes: 2

The repo expects you to supply IAM roles:

  • emr_service_role_arn (service role, e.g. EMR_DefaultRole)
  • emr_instance_profile_arn (instance profile, e.g. EMR_EC2_DefaultRole)

Terraform: what you edit first (terraform.tfvars)

The repo provides terraform/terraform.tfvars.example. The “minimum viable” values look like this:

name_prefix              = "personal"
create_vpc               = false
vpc_id                   = "vpc-xxxxxxxx"
subnet_id                = "subnet-xxxxxxxx"
key_pair_name            = "personal-emr"
public_key_path          = "~/.ssh/id_rsa.pub"
emr_service_role_arn      = "arn:aws:iam::123456789012:role/EMR_DefaultRole"
emr_instance_profile_arn  = "arn:aws:iam::123456789012:instance-profile/EMR_EC2_DefaultRole"
allowed_ssh_cidrs         = ["0.0.0.0/0"]

For a personal cluster, the variables I touch the most are:

  • allowed_ssh_cidrs (lock to your home IP/CIDR)
  • emr_release_label (when I intentionally upgrade)
  • core_instance_count (when I want more parallelism)
  • core_instance_type (when I want cheaper/faster)
  • tags (so I can find + delete things easily)

Running Terraform (the cluster comes first)

From the repo root:

cd terraform
terraform init
terraform apply

Key outputs are defined in terraform/outputs.tf, especially:

  • emr_cluster_id
  • emr_master_public_dns

You can pull the primary node hostname like this:

terraform output -raw emr_master_public_dns

Remote state is intentionally not “on by default” in this repo. terraform/backend.tf contains a commented S3 backend block you can enable when you’re ready.

For personal clusters, I still prefer remote state if:

  • I might run the repo from more than one laptop, or if I wanted to add others to help me with my work
  • I want locking so I don’t accidentally double-apply, which could lead to:
  1. State corruption / state drift: both applies read the same “old” state, then each writes updates. The later writer can overwrite parts of state from the earlier run, leaving Terraform’s state inconsistent with what actually exists in AWS.
  2. Race conditions creating resources: both applies may attempt to create/update the same resources; you’ll see flaky errors like “already exists”, “resource in use”, or intermittent failures.
  3. Duplicate or conflicting infrastructure: depending on how resources are named, you can accidentally create duplicates (or partially create them), then spend time cleaning up.
  4. Harder recovery: once the state is wrong, fixes often require careful imports/state surgery, or destroying/rebuilding.

Ansible walkthrough: how post-provisioning works

Once the cluster exists, Ansible is responsible for “turning it into my cluster”.

Inventory targeting

ansible/inventory.ini has a single group:

[emr_master]
emr-master.example.com ansible_user=hadoop

You replace emr-master.example.com with the Terraform output and keep ansible_user=hadoop (EMR default SSH user).

Playbook entrypoint

ansible/playbooks/site.yml is intentionally short:

  • hosts: emr_master
  • roles:
    • bootstrap
    • app

This makes it easy to reason about what happens on the cluster: baseline first, then your “app layer”.

Bootstrap role (baseline “quality of life”)

The bootstrap role does a few pragmatic things:

  • Installs useful packages (git, tmux, jq, htop, python3-pip)
  • Creates standard directories (/opt/emr, /var/log/emr-deploy)
  • Installs common Python tooling (boto3, awscli)
  • Drops a profile helper script at /etc/profile.d/emr-bootstrap.sh

It also writes a marker file (/etc/emr-deploy-iac) so you can quickly confirm the node was configured.

App role (your personalization hook)

The app role is the “extension point”:

  • Creates an app directory and config directory under /opt/emr/apps/...
  • Renders an environment file (from templates/app.env.j2)
  • Optionally clones a repo (if app_repo_url is set)
  • Optionally installs a systemd unit (if app_command is set)

In other words:

  • Set app_repo_url and it will pull your code onto the EMR primary node.
  • Set app_command and it becomes a managed service.

For personal clusters, this pattern is a clean way to install:

  • a lightweight “job runner” wrapper
  • helpers for submitting Spark steps
  • a tiny internal API that triggers EMR steps

Running Ansible (after Terraform)

From the repo root:

cd ansible
ansible-playbook -i inventory.ini playbooks/site.yml

To customize without editing role defaults, pass overrides at runtime. Example:

ansible-playbook -i inventory.ini playbooks/site.yml \
  -e app_repo_url="https://github.com/yourname/your-emr-tools.git" \
  -e app_repo_version="main" \
  -e app_env='{"APP_ENV":"personal","APP_PLACEHOLDER":"set-me"}'

The “personal cluster” lifecycle (how I actually use this)

For personal work, my loop is:

  1. terraform apply (cluster up)
  2. ansible-playbook ... (bootstrap + my tools)
  3. Run experiments / jobs
  4. terraform destroy (cluster down)

The repo is designed to make steps 1–2 boring and repeatable so you can spend time on the work that matters (Spark code, data layout, tuning) instead of redoing setup.


Final thought

The best personal platform is the one you can recreate from scratch without thinking.

Terraform gets you a consistent EMR “base”. Ansible gets you a consistent “you-shaped” runtime. Once you have both, your clusters become disposable, and disposable clusters are the path to being fast and safe.

Thank you for reading.

Cheers!

Jason

Blog Logo

Jason Rich


Published

Image

NADEBlg!

Back to Overview