NADEBlg!
Featured
Databricks Apps Review: What It Is, What It Isn’t, and How to Roll It Out Safely
Databricks Apps is one of those platform features that looks deceptively simple: “deploy an app next to the data.” In practice, it changes how teams deliver internal tools (dashboards, forms, RAG UIs, lightweight workflows) because you can ship an application inside the Databricks security and governance boundary without standing up...
Walkthrough: Deploying My Personal EMR Clusters with Terraform + Ansible (Reference Repo)
In my previous post I covered Terraform + Ansible best practices for a secure, production-ready EMR environment. In this post, I’m going to do the practical follow-up: a walkthrough of the reference repo I use to deploy my personal EMR clusters. The example repo for this post is: GitHub repo:...
Terraform + Ansible Best Practices for a Secure, Production-Ready AWS EMR Environment
I get asked all the time: “How do I securely automate my EMR deployment so I only need to write the code once to create a repeatable production ready environment?”. If you want EMR to feel boring in production (the best outcome), you need Infrastructure as Code (IaC) for the...
How One Skewed Join Key Turned a 12-Minute Spark Job into a 2-Hour Job (and How to Fix It Without Resizing the Cluster)
In my last post, I showed how to diagnose skew in Spark UI quickly. In this post, I want to show the real-world version: one skewed join key took a job that normally ran in ~12 minutes and pushed it past 2 hours. The punchline: you don’t need to resize...
Diagnose Skew, Spill, and Too Many Small Tasks in Spark UI (EMR) in Under 10 Minutes
In my last post, I promised a Spark UI walkthrough you can actually use under pressure. This is that post. The goal is simple: open the Spark UI (on EMR) and identify whether your job is suffering from: Skew (a few tasks do all the work) Spill (not enough memory,...
PySpark Best Practices on AWS EMR
If you’ve ever asked “Why is my PySpark job slow on EMR?” the honest answer is usually: it’s not one thing. It’s a handful of small decisions that compound—cluster sizing, file layout, shuffle tuning, join strategy, and the never-ending battle with small files on S3. This post is my “battle...
Using Referential Integrity
I am not sure about you, but tax season is a busy time of year for my teams. With that, I have jumped in the mix to assist with code reviews, PR approvals and branch merging in order to free up some of my Senior Data Engineers to do more...
Competitive Advantage
I recently started reading Tomasz Tunguz and Frank Bien’s Winning with Data; Tranform your Culture, Empower your People, and Shape the Future. For many us in the data management field, whether in Data Engineering, Business Intelligence, Data Architecture, Database Administration, or even Software Engineering, understanding and extending the usage of...
Regular
PySpark Best Practices on AWS EMR
If you’ve ever asked “Why is my PySpark job slow on EMR?” the honest answer is usually: it’s not one thing. It’s a handful of small decisions that compound—cluster sizing, file layout, shuffle tuning, join strategy, and the never-ending battle with small files on S3.
Using Referential Integrity
I am not sure about you, but tax season is a busy time of year for my teams. With that, I have jumped in the mix to assist with code reviews, PR approvals and branch merging in order to free up some of my Senior Data Engineers to do more...
Competitive Advantage
I recently started reading Tomasz Tunguz and Frank Bien’s Winning with Data; Tranform your Culture, Empower your People, and Shape the Future. For many us in the data management field, whether in Data Engineering, Business Intelligence, Data Architecture, Database Administration, or even Software Engineering, understanding and extending the usage of...
Welcome to NADEBlg!
Today, I am formally announcing the release of Not Another Data Engineering Blog! (NADEBLG!). I have thought about the best ways to approach this new endeavor, what I want to write about and discuss, and most of all, who is the audience I am trying to reach. Before my first...