NADEBlg!

Featured

Databricks Apps Review: What It Is, What It Isn’t, and How to Roll It Out Safely

Databricks Apps is one of those platform features that looks deceptively simple: “deploy an app next to the data.” In practice, it changes how teams deliver internal tools (dashboards, forms, RAG UIs, lightweight workflows) because you can ship an application inside the Databricks security and governance boundary without standing up...

03 February 2026

Walkthrough: Deploying My Personal EMR Clusters with Terraform + Ansible (Reference Repo)

In my previous post I covered Terraform + Ansible best practices for a secure, production-ready EMR environment. In this post, I’m going to do the practical follow-up: a walkthrough of the reference repo I use to deploy my personal EMR clusters. The example repo for this post is: GitHub repo:...

27 January 2026

Terraform + Ansible Best Practices for a Secure, Production-Ready AWS EMR Environment

I get asked all the time: “How do I securely automate my EMR deployment so I only need to write the code once to create a repeatable production ready environment?”. If you want EMR to feel boring in production (the best outcome), you need Infrastructure as Code (IaC) for the...

25 January 2026

How One Skewed Join Key Turned a 12-Minute Spark Job into a 2-Hour Job (and How to Fix It Without Resizing the Cluster)

In my last post, I showed how to diagnose skew in Spark UI quickly. In this post, I want to show the real-world version: one skewed join key took a job that normally ran in ~12 minutes and pushed it past 2 hours. The punchline: you don’t need to resize...

23 January 2026

Diagnose Skew, Spill, and Too Many Small Tasks in Spark UI (EMR) in Under 10 Minutes

In my last post, I promised a Spark UI walkthrough you can actually use under pressure. This is that post. The goal is simple: open the Spark UI (on EMR) and identify whether your job is suffering from: Skew (a few tasks do all the work) Spill (not enough memory,...

21 January 2026

PySpark Best Practices on AWS EMR

If you’ve ever asked “Why is my PySpark job slow on EMR?” the honest answer is usually: it’s not one thing. It’s a handful of small decisions that compound—cluster sizing, file layout, shuffle tuning, join strategy, and the never-ending battle with small files on S3. This post is my “battle...

20 January 2026

Using Referential Integrity

I am not sure about you, but tax season is a busy time of year for my teams. With that, I have jumped in the mix to assist with code reviews, PR approvals and branch merging in order to free up some of my Senior Data Engineers to do more...

15 March 2021

Competitive Advantage

I recently started reading Tomasz Tunguz and Frank Bien’s Winning with Data; Tranform your Culture, Empower your People, and Shape the Future. For many us in the data management field, whether in Data Engineering, Business Intelligence, Data Architecture, Database Administration, or even Software Engineering, understanding and extending the usage of...

08 February 2021

Regular

PySpark Best Practices on AWS EMR

20 January 2026

Using Referential Integrity

15 March 2021

Competitive Advantage

08 February 2021

Welcome to NADEBlg!

Today, I am formally announcing the release of Not Another Data Engineering Blog! (NADEBLG!). I have thought about the best ways to approach this new endeavor, what I want to write about and discuss, and most of all, who is the audience I am trying to reach. Before my first...

03 February 2021