In my previous post I covered Terraform + Ansible best practices for a secure, production-ready EMR environment. In this post, I’m going to do the practical follow-up: a walkthrough of the reference repo I use to deploy my personal EMR clusters.
The example repo for this post is:
GitHub repo: https://github.com/jrich8573/emr-deploy-iac
The goal of the repo is intentionally narrow and boring:
- Terraform provisions the EMR cluster (and optionally a VPC).
- Terraform outputs the EMR primary node hostname.
- Ansible targets that node and performs post-provisioning configuration using roles.
If you want to copy the approach, you can fork the repo structure and replace the “app” role with your own Spark job runner, notebook tooling, observability agent, or whatever you install on your personal clusters.
Repo structure (what lives where)
At the top level, there are only two working directories:
terraform/: everything that creates AWS resourcesansible/: everything that configures the EMR node after it exists
Here’s the important structure:
emr-deploy-iac/
terraform/
backend.tf
versions.tf
variables.tf
terraform.tfvars.example
main.tf
outputs.tf
ansible/
ansible.cfg
inventory.ini
playbooks/site.yml
roles/
bootstrap/...
app/...
This separation is deliberate. Terraform does “cloud wiring”. Ansible does “machine configuration”.
Terraform walkthrough: what it builds
The main entrypoint is terraform/main.tf. There are three big ideas inside:
1) Optional network creation (create_vpc)
The repo supports two modes:
- Bring your own network (
create_vpc = false): you providevpc_idandsubnet_id. - Create a VPC (
create_vpc = true): Terraform creates a VPC, public subnets, private subnets, and NAT so EMR can egress.
The “guardrail” is explicit:
- When
create_vpc = false,variables.tfexpects you to providevpc_idandsubnet_id. main.tfenforces that with alifecycle.preconditionso you fail fast if you forget.
Practical note for personal clusters:
- If you want to SSH from your laptop directly, make sure your EMR primary node is actually reachable (public subnet + routing, or a bastion/SSM if private). The repo supports private subnets (recommended), but private networking means you need an access pattern that matches it.
2) SSH key pair + master security group
Terraform creates an EC2 key pair:
aws_key_pair.emr_keyreadspublic_key_pathand creates/useskey_pair_name.
And it creates a dedicated EMR master SG with SSH ingress:
aws_security_group.emr_masterusesallowed_ssh_cidrs
Important: the default in variables.tf is permissive (0.0.0.0/0). That’s fine for a quick personal experiment, but if you keep clusters around for more than a coffee break, lock it down.
3) EMR cluster resource
The cluster itself is aws_emr_cluster.this. The “personal cluster” defaults are straightforward:
- EMR release:
emr-6.15.0 - Applications:
Hadoop,Spark - Instance types:
m5.xlargemaster and core - Core nodes: 2
The repo expects you to supply IAM roles:
emr_service_role_arn(service role, e.g.EMR_DefaultRole)emr_instance_profile_arn(instance profile, e.g.EMR_EC2_DefaultRole)
Terraform: what you edit first (terraform.tfvars)
The repo provides terraform/terraform.tfvars.example. The “minimum viable” values look like this:
name_prefix = "personal"
create_vpc = false
vpc_id = "vpc-xxxxxxxx"
subnet_id = "subnet-xxxxxxxx"
key_pair_name = "personal-emr"
public_key_path = "~/.ssh/id_rsa.pub"
emr_service_role_arn = "arn:aws:iam::123456789012:role/EMR_DefaultRole"
emr_instance_profile_arn = "arn:aws:iam::123456789012:instance-profile/EMR_EC2_DefaultRole"
allowed_ssh_cidrs = ["0.0.0.0/0"]
For a personal cluster, the variables I touch the most are:
allowed_ssh_cidrs(lock to your home IP/CIDR)emr_release_label(when I intentionally upgrade)core_instance_count(when I want more parallelism)core_instance_type(when I want cheaper/faster)tags(so I can find + delete things easily)
Running Terraform (the cluster comes first)
From the repo root:
cd terraform
terraform init
terraform apply
Key outputs are defined in terraform/outputs.tf, especially:
emr_cluster_idemr_master_public_dns
You can pull the primary node hostname like this:
terraform output -raw emr_master_public_dns
State management (optional, but recommended)
Remote state is intentionally not “on by default” in this repo. terraform/backend.tf contains a commented S3 backend block you can enable when you’re ready.
For personal clusters, I still prefer remote state if:
- I might run the repo from more than one laptop, or if I wanted to add others to help me with my work
- I want locking so I don’t accidentally double-apply, which could lead to:
- State corruption / state drift: both applies read the same “old” state, then each writes updates. The later writer can overwrite parts of state from the earlier run, leaving Terraform’s state inconsistent with what actually exists in AWS.
- Race conditions creating resources: both applies may attempt to create/update the same resources; you’ll see flaky errors like “already exists”, “resource in use”, or intermittent failures.
- Duplicate or conflicting infrastructure: depending on how resources are named, you can accidentally create duplicates (or partially create them), then spend time cleaning up.
- Harder recovery: once the state is wrong, fixes often require careful imports/state surgery, or destroying/rebuilding.
Ansible walkthrough: how post-provisioning works
Once the cluster exists, Ansible is responsible for “turning it into my cluster”.
Inventory targeting
ansible/inventory.ini has a single group:
[emr_master]
emr-master.example.com ansible_user=hadoop
You replace emr-master.example.com with the Terraform output and keep ansible_user=hadoop (EMR default SSH user).
Playbook entrypoint
ansible/playbooks/site.yml is intentionally short:
- hosts:
emr_master - roles:
bootstrapapp
This makes it easy to reason about what happens on the cluster: baseline first, then your “app layer”.
Bootstrap role (baseline “quality of life”)
The bootstrap role does a few pragmatic things:
- Installs useful packages (
git,tmux,jq,htop,python3-pip) - Creates standard directories (
/opt/emr,/var/log/emr-deploy) - Installs common Python tooling (
boto3,awscli) - Drops a profile helper script at
/etc/profile.d/emr-bootstrap.sh
It also writes a marker file (/etc/emr-deploy-iac) so you can quickly confirm the node was configured.
App role (your personalization hook)
The app role is the “extension point”:
- Creates an app directory and config directory under
/opt/emr/apps/... - Renders an environment file (from
templates/app.env.j2) - Optionally clones a repo (if
app_repo_urlis set) - Optionally installs a systemd unit (if
app_commandis set)
In other words:
- Set
app_repo_urland it will pull your code onto the EMR primary node. - Set
app_commandand it becomes a managed service.
For personal clusters, this pattern is a clean way to install:
- a lightweight “job runner” wrapper
- helpers for submitting Spark steps
- a tiny internal API that triggers EMR steps
Running Ansible (after Terraform)
From the repo root:
cd ansible
ansible-playbook -i inventory.ini playbooks/site.yml
To customize without editing role defaults, pass overrides at runtime. Example:
ansible-playbook -i inventory.ini playbooks/site.yml \
-e app_repo_url="https://github.com/yourname/your-emr-tools.git" \
-e app_repo_version="main" \
-e app_env='{"APP_ENV":"personal","APP_PLACEHOLDER":"set-me"}'
The “personal cluster” lifecycle (how I actually use this)
For personal work, my loop is:
terraform apply(cluster up)ansible-playbook ...(bootstrap + my tools)- Run experiments / jobs
terraform destroy(cluster down)
The repo is designed to make steps 1–2 boring and repeatable so you can spend time on the work that matters (Spark code, data layout, tuning) instead of redoing setup.
Final thought
The best personal platform is the one you can recreate from scratch without thinking.
Terraform gets you a consistent EMR “base”. Ansible gets you a consistent “you-shaped” runtime. Once you have both, your clusters become disposable, and disposable clusters are the path to being fast and safe.
Thank you for reading.
Cheers!
Jason