Apache Spark Architecture

A Beginner-Friendly Guide to How Spark Really Works

Ever wondered how Spark processes millions of records in seconds? It’s not magic — it’s clever architecture. Let’s break it down together, step by step, using a fun real-world analogy that will make everything click!

PART 1: Understanding Spark with a Simple Analogy

The Marble Counting Story

Before we dive into technical terms, let’s understand the core idea of Spark through a simple story you’ll never forget.

The Marble Analogy

Imagine an instructor has a huge bag of marbles divided into pouches. Their job is to count all the marbles. Instead of counting everything alone, the instructor divides the work — assigning pouches to different groups (A, B, C, D). Each group counts their own pouches locally. Then a collector (G) gathers all local counts and reports the grand total back to the instructor.

Here’s how this maps to Spark:

Marble Story	Spark World
Instructor	Driver — assigns and coordinates all work
Groups (A, B, C, D)	Executors — do the actual processing
Counting marbles	Tasks — individual units of work
Local count + Global count	2 Stages — divided by a Shuffle
G collecting counts	Shuffle — data moves between executors
Writing count on paper	Data persisted between stages

Key Insight: Whenever a Shuffle happens in a Spark job, the job gets divided into Stages. Shuffle is the boundary between stages!

PART 2: Core Components of Spark

Driver — The Brain of the Operation

The Driver is like your instructor. It’s the central coordinator that:

Maintains all information required by Executors
Receives the job from the user
Analyzes and breaks the job into Stages and Tasks
Assigns Tasks to Executors
Tracks the progress of every single task

Executors — The Workers

Executors are JVM (Java Virtual Machine) processes that run on cluster machines. Think of them as the groups counting marbles.

Real World Parallel

Imagine you have 3 Executors each with 2 Cores = 6 total Cores. All 6 Cores can run in PARALLEL, handling 6 tasks at the same time. One Core = One Task at a time. More cores = more parallel work = faster results!

Component	What It Does
Driver	Coordinates the entire job, creates execution plan
Executor	JVM process that runs tasks on cluster machines
Core	A single processing unit — runs ONE task at a time
Task	Smallest unit of work applied to one data partition
Stage	Group of tasks that run without a shuffle in between

Partitions — Spark’s Secret Weapon

To allow Executors to work in parallel, Spark breaks your input data into smaller chunks called Partitions.

Think of it like this: If you have 1 big pizza and 4 people, you slice it into 4 pieces so everyone can eat at the same time. Spark slices your data into partitions so all cores can process simultaneously!

Example: 4 data partitions → 4 tasks can run in parallel → 4x faster processing!

PART 3: How Spark Plans Your Job

Lazy Evaluation — Spark Waits Before Acting

Here’s something surprising about Spark — it doesn’t actually do anything when you write transformations. It waits!

The Bread Analogy

Imagine ordering bread at a bakery. You say ‘brown bread, sliced, toasted’. The baker doesn’t start immediately — they WAIT until you actually pay (the action). Why? Because by knowing the full order upfront, they can plan the most efficient way to make it. Spark works the same way!

This is called Lazy Evaluation:

Transformations are recorded but NOT executed
When an Action is called (like .show(), .count()), Spark executes everything
This allows Spark’s optimizer to plan the most efficient execution

Narrow vs Wide Transformations

Not all transformations are created equal. Spark classifies them into two types:

Narrow Transformation	Wide Transformation
No shuffle needed	Requires shuffle (data movement)
Each partition processed independently	Data must be grouped across partitions
Fast — stays within executor	Slower — creates new Stage boundary
select(), filter(), withColumn(), drop()	groupBy(), join(), cube(), rollup(), agg()

Pro Tip: Wide transformations cause Shuffles which create Stage boundaries. Minimize wide transformations where possible to improve performance!

Spark’s 2-Phase Planning Process

When you submit a Spark job, it goes through two planning phases before any data is touched:

Phase 1: Logical Planning

Your code starts as an Unresolved Logical Plan
Spark validates column names and table names against the Catalog
Creates a Resolved Logical Plan
Catalyst Optimizer steps in and optimizes the plan
Final output: Optimized Logical Plan (the Logical DAG)

Phase 2: Physical Planning

Spark generates multiple Physical Plans based on cluster configuration
Each plan is run against a Cost Model
Best Physical Plan is selected
Plan is sent to the cluster for execution
Executors run the plan against data partitions

DAG — The Execution Roadmap

DAG = Directed Acyclic Graph — It’s the logical execution plan of a Spark job. Think of it as your GPS route — Spark follows this map to process your data from start to finish.

PART 4: DataFrames & Immutability

What Makes DataFrames Special?

Spark DataFrames are immutable — meaning once created, you CANNOT change them. But you CAN create a new DataFrame from an existing one with your changes applied.

The Stamp Analogy

Think of a DataFrame like a stamped document. You can’t change the original stamp, but you can make a photocopy and stamp THAT copy differently. Each transformation creates a new ‘photocopy’ — the original stays intact.

This immutability is what makes Spark fault-tolerant. If something fails, Spark can always recreate the data from the original source by replaying the transformations!

PART 5: How Spark Reads & Writes Data

Writing Data — The Partition File System

When Spark writes data, here’s exactly what happens:

Input file has 4 partitions → 4 tasks execute in parallel
Each Core processes one partition
Each task writes its own output file (part-1, part-2, part-3, part-4)
A folder is created (e.g., emp.csv/) containing all part files

So if you open an output folder and see multiple files like part-00000, part-00001 etc. — that’s normal! Each file = one task’s output. Total files = number of partitions.

Joining Data — The Shuffle Problem

Joining two tables in Spark is more complex than it seems. Here’s the challenge:

Suppose you have Sales and City tables. Data is read in partitions across executors. But the same City ID might be in different executors — and you can’t join data that’s in different places!

Spark’s Solution: SHUFFLE — move data so that matching City IDs land on the same executor. This is why joins cause shuffles and create Stage boundaries.

Bucketing — A Smarter Way to Join

Bucketing is Spark’s way to avoid shuffle on repeated joins. Here’s how it works:

Pre-group data into a fixed number of buckets using a hash function (Murmur)
Same key value always goes to the same bucket
When joining bucketed tables, Spark reads matching buckets together — no shuffle needed!

Important Rule: Both tables must have the SAME number of buckets for shuffle-free joins to work!

PART 6: Spark Memory Management

Understanding Executor Memory

When you request memory for an executor (say 512 MB), here’s the reality of how it gets divided:

Memory Layer	Details
Requested Memory (512 MB)	What you ask for
JVM Usable (89% = ~455 MB)	JVM takes 11% for internal processes like Garbage Collection
Reserved Memory (300 MB)	Spark sets aside 300 MB for its own internals
Usable Memory (~155 MB)	What’s left for your actual work
User Memory (40%)	For data structures, UDFs, custom code
Spark Unified Memory (60%)	For Storage + Execution (managed by Spark)

Minimum Memory Rule: You must request at least 1.5x the Reserved Memory (300 MB). So minimum executor memory = 450 MB!

Spark Unified Memory — Storage vs Execution

Spark’s Unified Memory is shared between two areas with NO hard boundary:

Storage Memory	Execution Memory
Used to cache (persist) data	Used for computations: aggregations, shuffles, hash tables
Lower priority	Higher priority — can evict Storage
Can borrow from Execution if free	Can forcefully evict Storage blocks
CANNOT evict Execution Memory	Can evict Storage up to storageFraction limit (default 50%)

Think of it like a shared office desk: Execution gets priority. Storage can use free space, but has to give it up when Execution needs it!

Memory Spillage — When Things Get Tight

When Spark runs out of memory, it spills data to disk. This is called spillage — and it’s expensive because:

In-memory data = Deserialized Java Objects (fast to use)
On-disk data = Serialized byte streams (slow to read/convert)
Every spill = costly read/write + deserialization overhead

Spillage is a WARNING sign! It means your partitions are too large or your memory is too small. Always investigate and fix spills for better performance.

Out of Memory (OOM) Errors

OOM errors happen when Spark simply cannot fit data into memory even with spilling. Common causes:

One partition is much larger than others (e.g., 25 MB in a 20 MB executor)
repartition, skewed joins generating too much shuffle data
Cross joins or Explode on nested arrays multiply data size
Zipped files (parquet with snappy/gzip) expand massively in memory

On-Heap vs Off-Heap Memory

On-Heap Memory	Off-Heap Memory
Default — managed by JVM	Optional — managed by OS directly
Subject to Garbage Collection (GC)	No GC overhead — you manage it manually
Enabled by default	Disabled by default — must be configured
Best for most use cases	Better for large caches to avoid GC pauses

Garbage Collection (GC) — The Memory Janitor

Garbage Collection (GC) is the JVM’s process of cleaning up objects no longer in use. While essential, it can cause problems:

Long GC cycles can pause your Spark program completely
You’ll see ‘GC overhead limit exceeded’ in logs
In severe cases, can lead to OOM errors

Fix: Reduce the number of long-lived objects in your code, use off-heap memory for large caches, or tune GC settings for your workload.

QUICK REFERENCE GLOSSARY

Driver

Central coordinator — breaks job into stages/tasks, assigns to executors

Executor

JVM process on cluster nodes — actually runs the tasks

Core

Single processing unit — handles exactly ONE task at a time

Task

Smallest unit of work — processes one data partition

Stage

Group of tasks separated by shuffle boundaries

Shuffle

Moving data across executors — triggers stage boundary

Partition

A chunk of data — enables parallel processing

DAG

Directed Acyclic Graph — the full execution plan map

Lazy Eval

Transformations wait for an Action before executing

Spillage

Writing memory data to disk when RAM is full — slow!

OOM

Out of Memory error — data won’t fit even with spilling

Bucketing

Pre-partitioning data to avoid shuffle during joins

Summary: Spark’s power comes from distributed parallel processing. The Driver plans, Executors execute, Partitions enable parallelism, and the DAG maps the entire journey. Understanding memory management and shuffle is the key to writing efficient Spark code!

Author

Blog Post

Apache Spark Architecture

PART 1: Understanding Spark with a Simple Analogy

PART 2: Core Components of Spark

PART 3: How Spark Plans Your Job

PART 4: DataFrames & Immutability

Shivam Kumar Singh

Leave a comment Cancel reply

Blog Post

PART 1: Understanding Spark with a Simple Analogy

PART 2: Core Components of Spark

PART 3: How Spark Plans Your Job

PART 4: DataFrames & Immutability

Shivam Kumar Singh

DATABRICKS CLI — COMPLETE COMMAND DOCUMENTATION

How to Commit Staged Entities in MDH.

Leave a comment Cancel reply