Apache Spark Architecture
A Beginner-Friendly Guide to How Spark Really Works
Ever wondered how Spark processes millions of records in seconds? It’s not magic — it’s clever architecture. Let’s break it down together, step by step, using a fun real-world analogy that will make everything click!
PART 1: Understanding Spark with a Simple Analogy
The Marble Counting Story
Before we dive into technical terms, let’s understand the core idea of Spark through a simple story you’ll never forget.
The Marble Analogy
Imagine an instructor has a huge bag of marbles divided into pouches. Their job is to count all the marbles. Instead of counting everything alone, the instructor divides the work — assigning pouches to different groups (A, B, C, D). Each group counts their own pouches locally. Then a collector (G) gathers all local counts and reports the grand total back to the instructor.
Here’s how this maps to Spark:
| Marble Story | Spark World |
| Instructor | Driver — assigns and coordinates all work |
| Groups (A, B, C, D) | Executors — do the actual processing |
| Counting marbles | Tasks — individual units of work |
| Local count + Global count | 2 Stages — divided by a Shuffle |
| G collecting counts | Shuffle — data moves between executors |
| Writing count on paper | Data persisted between stages |
Key Insight: Whenever a Shuffle happens in a Spark job, the job gets divided into Stages. Shuffle is the boundary between stages!
PART 2: Core Components of Spark
Driver — The Brain of the Operation
The Driver is like your instructor. It’s the central coordinator that:
- Maintains all information required by Executors
- Receives the job from the user
- Analyzes and breaks the job into Stages and Tasks
- Assigns Tasks to Executors
- Tracks the progress of every single task
Executors — The Workers
Executors are JVM (Java Virtual Machine) processes that run on cluster machines. Think of them as the groups counting marbles.
Real World Parallel
Imagine you have 3 Executors each with 2 Cores = 6 total Cores. All 6 Cores can run in PARALLEL, handling 6 tasks at the same time. One Core = One Task at a time. More cores = more parallel work = faster results!
| Component | What It Does |
| Driver | Coordinates the entire job, creates execution plan |
| Executor | JVM process that runs tasks on cluster machines |
| Core | A single processing unit — runs ONE task at a time |
| Task | Smallest unit of work applied to one data partition |
| Stage | Group of tasks that run without a shuffle in between |
Partitions — Spark’s Secret Weapon
To allow Executors to work in parallel, Spark breaks your input data into smaller chunks called Partitions.
Think of it like this: If you have 1 big pizza and 4 people, you slice it into 4 pieces so everyone can eat at the same time. Spark slices your data into partitions so all cores can process simultaneously!
Example: 4 data partitions → 4 tasks can run in parallel → 4x faster processing!
PART 3: How Spark Plans Your Job
Lazy Evaluation — Spark Waits Before Acting
Here’s something surprising about Spark — it doesn’t actually do anything when you write transformations. It waits!
The Bread Analogy
Imagine ordering bread at a bakery. You say ‘brown bread, sliced, toasted’. The baker doesn’t start immediately — they WAIT until you actually pay (the action). Why? Because by knowing the full order upfront, they can plan the most efficient way to make it. Spark works the same way!
This is called Lazy Evaluation:
- Transformations are recorded but NOT executed
- When an Action is called (like .show(), .count()), Spark executes everything
- This allows Spark’s optimizer to plan the most efficient execution
Narrow vs Wide Transformations
Not all transformations are created equal. Spark classifies them into two types:
| Narrow Transformation | Wide Transformation |
| No shuffle needed | Requires shuffle (data movement) |
| Each partition processed independently | Data must be grouped across partitions |
| Fast — stays within executor | Slower — creates new Stage boundary |
| select(), filter(), withColumn(), drop() | groupBy(), join(), cube(), rollup(), agg() |
Pro Tip: Wide transformations cause Shuffles which create Stage boundaries. Minimize wide transformations where possible to improve performance!
Spark’s 2-Phase Planning Process
When you submit a Spark job, it goes through two planning phases before any data is touched:
Phase 1: Logical Planning
- Your code starts as an Unresolved Logical Plan
- Spark validates column names and table names against the Catalog
- Creates a Resolved Logical Plan
- Catalyst Optimizer steps in and optimizes the plan
- Final output: Optimized Logical Plan (the Logical DAG)
Phase 2: Physical Planning
- Spark generates multiple Physical Plans based on cluster configuration
- Each plan is run against a Cost Model
- Best Physical Plan is selected
- Plan is sent to the cluster for execution
- Executors run the plan against data partitions
DAG — The Execution Roadmap
DAG = Directed Acyclic Graph — It’s the logical execution plan of a Spark job. Think of it as your GPS route — Spark follows this map to process your data from start to finish.
PART 4: DataFrames & Immutability
What Makes DataFrames Special?
Spark DataFrames are immutable — meaning once created, you CANNOT change them. But you CAN create a new DataFrame from an existing one with your changes applied.
The Stamp Analogy
Think of a DataFrame like a stamped document. You can’t change the original stamp, but you can make a photocopy and stamp THAT copy differently. Each transformation creates a new ‘photocopy’ — the original stays intact.
This immutability is what makes Spark fault-tolerant. If something fails, Spark can always recreate the data from the original source by replaying the transformations!
PART 5: How Spark Reads & Writes Data
Writing Data — The Partition File System
When Spark writes data, here’s exactly what happens:
- Input file has 4 partitions → 4 tasks execute in parallel
- Each Core processes one partition
- Each task writes its own output file (part-1, part-2, part-3, part-4)
- A folder is created (e.g., emp.csv/) containing all part files
So if you open an output folder and see multiple files like part-00000, part-00001 etc. — that’s normal! Each file = one task’s output. Total files = number of partitions.
Joining Data — The Shuffle Problem
Joining two tables in Spark is more complex than it seems. Here’s the challenge:
Suppose you have Sales and City tables. Data is read in partitions across executors. But the same City ID might be in different executors — and you can’t join data that’s in different places!
Spark’s Solution: SHUFFLE — move data so that matching City IDs land on the same executor. This is why joins cause shuffles and create Stage boundaries.
Bucketing — A Smarter Way to Join
Bucketing is Spark’s way to avoid shuffle on repeated joins. Here’s how it works:
- Pre-group data into a fixed number of buckets using a hash function (Murmur)
- Same key value always goes to the same bucket
- When joining bucketed tables, Spark reads matching buckets together — no shuffle needed!
Important Rule: Both tables must have the SAME number of buckets for shuffle-free joins to work!
PART 6: Spark Memory Management
Understanding Executor Memory
When you request memory for an executor (say 512 MB), here’s the reality of how it gets divided:
| Memory Layer | Details |
| Requested Memory (512 MB) | What you ask for |
| JVM Usable (89% = ~455 MB) | JVM takes 11% for internal processes like Garbage Collection |
| Reserved Memory (300 MB) | Spark sets aside 300 MB for its own internals |
| Usable Memory (~155 MB) | What’s left for your actual work |
| User Memory (40%) | For data structures, UDFs, custom code |
| Spark Unified Memory (60%) | For Storage + Execution (managed by Spark) |
Minimum Memory Rule: You must request at least 1.5x the Reserved Memory (300 MB). So minimum executor memory = 450 MB!
Spark Unified Memory — Storage vs Execution
Spark’s Unified Memory is shared between two areas with NO hard boundary:
| Storage Memory | Execution Memory |
| Used to cache (persist) data | Used for computations: aggregations, shuffles, hash tables |
| Lower priority | Higher priority — can evict Storage |
| Can borrow from Execution if free | Can forcefully evict Storage blocks |
| CANNOT evict Execution Memory | Can evict Storage up to storageFraction limit (default 50%) |
Think of it like a shared office desk: Execution gets priority. Storage can use free space, but has to give it up when Execution needs it!
Memory Spillage — When Things Get Tight
When Spark runs out of memory, it spills data to disk. This is called spillage — and it’s expensive because:
- In-memory data = Deserialized Java Objects (fast to use)
- On-disk data = Serialized byte streams (slow to read/convert)
- Every spill = costly read/write + deserialization overhead
Spillage is a WARNING sign! It means your partitions are too large or your memory is too small. Always investigate and fix spills for better performance.
Out of Memory (OOM) Errors
OOM errors happen when Spark simply cannot fit data into memory even with spilling. Common causes:
- One partition is much larger than others (e.g., 25 MB in a 20 MB executor)
- repartition, skewed joins generating too much shuffle data
- Cross joins or Explode on nested arrays multiply data size
- Zipped files (parquet with snappy/gzip) expand massively in memory
On-Heap vs Off-Heap Memory
| On-Heap Memory | Off-Heap Memory |
| Default — managed by JVM | Optional — managed by OS directly |
| Subject to Garbage Collection (GC) | No GC overhead — you manage it manually |
| Enabled by default | Disabled by default — must be configured |
| Best for most use cases | Better for large caches to avoid GC pauses |
Garbage Collection (GC) — The Memory Janitor
Garbage Collection (GC) is the JVM’s process of cleaning up objects no longer in use. While essential, it can cause problems:
- Long GC cycles can pause your Spark program completely
- You’ll see ‘GC overhead limit exceeded’ in logs
- In severe cases, can lead to OOM errors
Fix: Reduce the number of long-lived objects in your code, use off-heap memory for large caches, or tune GC settings for your workload.
QUICK REFERENCE GLOSSARY
| Driver | Central coordinator — breaks job into stages/tasks, assigns to executors |
| Executor | JVM process on cluster nodes — actually runs the tasks |
| Core | Single processing unit — handles exactly ONE task at a time |
| Task | Smallest unit of work — processes one data partition |
| Stage | Group of tasks separated by shuffle boundaries |
| Shuffle | Moving data across executors — triggers stage boundary |
| Partition | A chunk of data — enables parallel processing |
| DAG | Directed Acyclic Graph — the full execution plan map |
| Lazy Eval | Transformations wait for an Action before executing |
| Spillage | Writing memory data to disk when RAM is full — slow! |
| OOM | Out of Memory error — data won’t fit even with spilling |
| Bucketing | Pre-partitioning data to avoid shuffle during joins |
Summary: Spark’s power comes from distributed parallel processing. The Driver plans, Executors execute, Partitions enable parallelism, and the DAG maps the entire journey. Understanding memory management and shuffle is the key to writing efficient Spark code!