Databricks

Apache Spark Architecture

A Beginner-Friendly Guide to How Spark Really Works

Ever wondered how Spark processes millions of records in seconds? It’s not magic — it’s clever architecture. Let’s break it down together, step by step, using a fun real-world analogy that will make everything click!

PART 1: Understanding Spark with a Simple Analogy

The Marble Counting Story

Before we dive into technical terms, let’s understand the core idea of Spark through a simple story you’ll never forget.

The Marble Analogy

Imagine an instructor has a huge bag of marbles divided into pouches. Their job is to count all the marbles. Instead of counting everything alone, the instructor divides the work — assigning pouches to different groups (A, B, C, D). Each group counts their own pouches locally. Then a collector (G) gathers all local counts and reports the grand total back to the instructor.

Here’s how this maps to Spark:

Marble Story Spark World
Instructor Driver — assigns and coordinates all work
Groups (A, B, C, D) Executors — do the actual processing
Counting marbles Tasks — individual units of work
Local count + Global count 2 Stages — divided by a Shuffle
G collecting counts Shuffle — data moves between executors
Writing count on paper Data persisted between stages

Key Insight: Whenever a Shuffle happens in a Spark job, the job gets divided into Stages. Shuffle is the boundary between stages!

PART 2: Core Components of Spark

Driver — The Brain of the Operation

The Driver is like your instructor. It’s the central coordinator that:

  • Maintains all information required by Executors
  • Receives the job from the user
  • Analyzes and breaks the job into Stages and Tasks
  • Assigns Tasks to Executors
  • Tracks the progress of every single task

Executors — The Workers

Executors are JVM (Java Virtual Machine) processes that run on cluster machines. Think of them as the groups counting marbles.

Real World Parallel

Imagine you have 3 Executors each with 2 Cores = 6 total Cores. All 6 Cores can run in PARALLEL, handling 6 tasks at the same time. One Core = One Task at a time. More cores = more parallel work = faster results!

Component What It Does
Driver Coordinates the entire job, creates execution plan
Executor JVM process that runs tasks on cluster machines
Core A single processing unit — runs ONE task at a time
Task Smallest unit of work applied to one data partition
Stage Group of tasks that run without a shuffle in between

Partitions — Spark’s Secret Weapon

To allow Executors to work in parallel, Spark breaks your input data into smaller chunks called Partitions.

Think of it like this: If you have 1 big pizza and 4 people, you slice it into 4 pieces so everyone can eat at the same time. Spark slices your data into partitions so all cores can process simultaneously!

Example: 4 data partitions → 4 tasks can run in parallel → 4x faster processing!

PART 3: How Spark Plans Your Job

Lazy Evaluation — Spark Waits Before Acting

Here’s something surprising about Spark — it doesn’t actually do anything when you write transformations. It waits!

The Bread Analogy

Imagine ordering bread at a bakery. You say ‘brown bread, sliced, toasted’. The baker doesn’t start immediately — they WAIT until you actually pay (the action). Why? Because by knowing the full order upfront, they can plan the most efficient way to make it. Spark works the same way!

This is called Lazy Evaluation:

  • Transformations are recorded but NOT executed
  • When an Action is called (like .show(), .count()), Spark executes everything
  • This allows Spark’s optimizer to plan the most efficient execution

Narrow vs Wide Transformations

Not all transformations are created equal. Spark classifies them into two types:

Narrow Transformation Wide Transformation
No shuffle needed Requires shuffle (data movement)
Each partition processed independently Data must be grouped across partitions
Fast — stays within executor Slower — creates new Stage boundary
select(), filter(), withColumn(), drop() groupBy(), join(), cube(), rollup(), agg()

Pro Tip: Wide transformations cause Shuffles which create Stage boundaries. Minimize wide transformations where possible to improve performance!

Spark’s 2-Phase Planning Process

When you submit a Spark job, it goes through two planning phases before any data is touched:

Phase 1: Logical Planning

  • Your code starts as an Unresolved Logical Plan
  • Spark validates column names and table names against the Catalog
  • Creates a Resolved Logical Plan
  • Catalyst Optimizer steps in and optimizes the plan
  • Final output: Optimized Logical Plan (the Logical DAG)

Phase 2: Physical Planning

  • Spark generates multiple Physical Plans based on cluster configuration
  • Each plan is run against a Cost Model
  • Best Physical Plan is selected
  • Plan is sent to the cluster for execution
  • Executors run the plan against data partitions

DAG — The Execution Roadmap

DAG = Directed Acyclic Graph — It’s the logical execution plan of a Spark job. Think of it as your GPS route — Spark follows this map to process your data from start to finish.

PART 4: DataFrames & Immutability

What Makes DataFrames Special?

Spark DataFrames are immutable — meaning once created, you CANNOT change them. But you CAN create a new DataFrame from an existing one with your changes applied.

The Stamp Analogy

Think of a DataFrame like a stamped document. You can’t change the original stamp, but you can make a photocopy and stamp THAT copy differently. Each transformation creates a new ‘photocopy’ — the original stays intact.

This immutability is what makes Spark fault-tolerant. If something fails, Spark can always recreate the data from the original source by replaying the transformations!

PART 5: How Spark Reads & Writes Data

Writing Data — The Partition File System

When Spark writes data, here’s exactly what happens:

  • Input file has 4 partitions → 4 tasks execute in parallel
  • Each Core processes one partition
  • Each task writes its own output file (part-1, part-2, part-3, part-4)
  • A folder is created (e.g., emp.csv/) containing all part files

So if you open an output folder and see multiple files like part-00000, part-00001 etc. — that’s normal! Each file = one task’s output. Total files = number of partitions.

Joining Data — The Shuffle Problem

Joining two tables in Spark is more complex than it seems. Here’s the challenge:

Suppose you have Sales and City tables. Data is read in partitions across executors. But the same City ID might be in different executors — and you can’t join data that’s in different places!

Spark’s Solution: SHUFFLE — move data so that matching City IDs land on the same executor. This is why joins cause shuffles and create Stage boundaries.

Bucketing — A Smarter Way to Join

Bucketing is Spark’s way to avoid shuffle on repeated joins. Here’s how it works:

  • Pre-group data into a fixed number of buckets using a hash function (Murmur)
  • Same key value always goes to the same bucket
  • When joining bucketed tables, Spark reads matching buckets together — no shuffle needed!

Important Rule: Both tables must have the SAME number of buckets for shuffle-free joins to work!

PART 6: Spark Memory Management

Understanding Executor Memory

When you request memory for an executor (say 512 MB), here’s the reality of how it gets divided:

Memory Layer Details
Requested Memory (512 MB) What you ask for
JVM Usable (89% = ~455 MB) JVM takes 11% for internal processes like Garbage Collection
Reserved Memory (300 MB) Spark sets aside 300 MB for its own internals
Usable Memory (~155 MB) What’s left for your actual work
User Memory (40%) For data structures, UDFs, custom code
Spark Unified Memory (60%) For Storage + Execution (managed by Spark)

Minimum Memory Rule: You must request at least 1.5x the Reserved Memory (300 MB). So minimum executor memory = 450 MB!

Spark Unified Memory — Storage vs Execution

Spark’s Unified Memory is shared between two areas with NO hard boundary:

Storage Memory Execution Memory
Used to cache (persist) data Used for computations: aggregations, shuffles, hash tables
Lower priority Higher priority — can evict Storage
Can borrow from Execution if free Can forcefully evict Storage blocks
CANNOT evict Execution Memory Can evict Storage up to storageFraction limit (default 50%)

Think of it like a shared office desk: Execution gets priority. Storage can use free space, but has to give it up when Execution needs it!

Memory Spillage — When Things Get Tight

When Spark runs out of memory, it spills data to disk. This is called spillage — and it’s expensive because:

  • In-memory data = Deserialized Java Objects (fast to use)
  • On-disk data = Serialized byte streams (slow to read/convert)
  • Every spill = costly read/write + deserialization overhead

Spillage is a WARNING sign! It means your partitions are too large or your memory is too small. Always investigate and fix spills for better performance.

Out of Memory (OOM) Errors

OOM errors happen when Spark simply cannot fit data into memory even with spilling. Common causes:

  • One partition is much larger than others (e.g., 25 MB in a 20 MB executor)
  • repartition, skewed joins generating too much shuffle data
  • Cross joins or Explode on nested arrays multiply data size
  • Zipped files (parquet with snappy/gzip) expand massively in memory

On-Heap vs Off-Heap Memory

On-Heap Memory Off-Heap Memory
Default — managed by JVM Optional — managed by OS directly
Subject to Garbage Collection (GC) No GC overhead — you manage it manually
Enabled by default Disabled by default — must be configured
Best for most use cases Better for large caches to avoid GC pauses

Garbage Collection (GC) — The Memory Janitor

Garbage Collection (GC) is the JVM’s process of cleaning up objects no longer in use. While essential, it can cause problems:

  • Long GC cycles can pause your Spark program completely
  • You’ll see ‘GC overhead limit exceeded’ in logs
  • In severe cases, can lead to OOM errors

Fix: Reduce the number of long-lived objects in your code, use off-heap memory for large caches, or tune GC settings for your workload.

QUICK REFERENCE GLOSSARY

Driver Central coordinator — breaks job into stages/tasks, assigns to executors
Executor JVM process on cluster nodes — actually runs the tasks
Core Single processing unit — handles exactly ONE task at a time
Task Smallest unit of work — processes one data partition
Stage Group of tasks separated by shuffle boundaries
Shuffle Moving data across executors — triggers stage boundary
Partition A chunk of data — enables parallel processing
DAG Directed Acyclic Graph — the full execution plan map
Lazy Eval Transformations wait for an Action before executing
Spillage Writing memory data to disk when RAM is full — slow!
OOM Out of Memory error — data won’t fit even with spilling
Bucketing Pre-partitioning data to avoid shuffle during joins

Summary: Spark’s power comes from distributed parallel processing. The Driver plans, Executors execute, Partitions enable parallelism, and the DAG maps the entire journey. Understanding memory management and shuffle is the key to writing efficient Spark code!

Author

Shivam Kumar Singh

Leave a comment

Your email address will not be published. Required fields are marked *