[← BACK_TO_INDEX]
May 21, 20265 min read

How Database Indexing Works under the Hood: B-Trees and Hash Indexes Made Simple

Database Indexing

Imagine walking into a massive library containing 10 million books. You want to find a book named “The Cyberpunk Handbook”.

If the library had no organization, you would have to start at the first shelf of the first row and look at every single cover, one by one. This is a Full Table Scan. It is slow, tedious, and would take weeks.

Instead, you walk over to the Library Catalog (or search system). You type in the title, and it instantly gives you an exact coordinate: Row 42, Shelf B, Book 12. You walk directly there and grab it.

In the world of databases, this catalog is called an Index.

Database indexes make queries blazing fast. But how do they work under the hood? Here at Curious Folk, we want to make indexing crystal clear. In this article, we’ll explore the cost of unindexed queries, deep dive into the two most common index data structures—B-Trees and Hash Indexes—and see how they dramatically improve SQL query times.


The Problem: The Dreaded O(N) Scan

When you create a simple database table, the rows are written to disk blocks in the order they are inserted (like a pile of loose sheets).

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    age INT
);

If you have 10 million rows, and you run this query:

SELECT * FROM users WHERE age = 29;

Without an index, the database engine must load every single disk block containing the users table into memory and check if age = 29. In computer science terms, this is O(n) Time Complexity (Linear Time), where the query duration increases proportionally with the size of the table.

To avoid this, we create an index on the age column:

CREATE INDEX idx_users_age ON users(age);

By doing this, the database creates a separate, highly structured helper file that acts as our library catalog.


1. B-Trees: The Swiss Army Knife of Indexes

The default index type in almost every relational database (PostgreSQL, MySQL, Oracle, SQLite) is the B-Tree (Balanced Tree).

A B-Tree is a self-balancing search tree designed to work efficiently with system storage (disk blocks). Unlike a binary search tree (where each node has at most 2 children), a B-Tree node is wide and can have hundreds of children.

Structure of a B-Tree

                  [ Root Node: 50 ]
                 /                 \
       [ Branch Node: 20 ]       [ Branch Node: 80 ]
       /        |        \       /        |        \
  [10 | 15]  [30 | 40]  [..]  [60 | 70]  [85 | 95]  [..]  <-- Leaf Nodes (Points to Row ID)
  1. Root Node: The entry point of the search.
  2. Internal/Branch Nodes: Intermediate nodes that direct the search path.
  3. Leaf Nodes: The bottom-most layer. These nodes store the actual index value (e.g., age 29) along with a pointer (Row ID) to the physical location of the row on disk.

The Search Process

If we search for age = 40:

  1. Start at the Root: Compare 40 to 50. Since 40 < 50, go to the left child node.
  2. At the Branch Node (20): Compare 40. Since 40 > 20, go to the right branch pointer.
  3. Arrive at Leaf Node [30 | 40]: We find the key 40, follow its pointer directly to the disk block, and retrieve the row.

With 10 million records, a binary tree would require ~24 comparisons (steps). A B-Tree with a fan-out of 100 children per node requires only 3 to 4 disk lookups! This turns our query from O(n) into O(log n) (Logarithmic Time).

Why B-Trees are Disk-Block Friendly

Databases read data from disks in fixed chunks called Blocks or Pages (typically 4KB to 8KB). Loading a block from a physical disk is slow. B-Trees are structured so that a single B-Tree node fits exactly inside a disk page. When the database reads a node, it gets 100+ keys and child pointers in a single, fast read operation, minimizing slow input/output (I/O) delays.


2. Hash Indexes: Instant Lookups

While B-Trees are the standard, some databases support Hash Indexes.

A Hash Index uses a Hash Table data structure. It takes the indexed column value, runs it through a cryptographic or mathematical hashing function, and maps it directly to a specific slot (bucket) containing the Row ID.

  Key (Age)      Hash Function       Buckets / Slots       Physical Row ID
  [ 29 ] ------> hash(29) % 5 ----> [ Slot 0 ] ----------> Row ID: #4012
  [ 35 ] ------> hash(35) % 5 ----> [ Slot 1 ] ----------> Row ID: #2155
  [ 18 ] ------> hash(18) % 5 ----> [ Slot 2 ] ----------> Row ID: #9887

The Power of Hash Indexes:

  • Instant Search: It achieves O(1) Time Complexity (Constant Time). Whether your table has 10 rows or 10 billion, finding a specific value takes exactly the same amount of time.

The Limitations of Hash Indexes (Why B-Trees Win):

Despite the O(1) speed, Hash Indexes are rarely the default because:

  • No Range Queries: You cannot search for range limits:
    SELECT * FROM users WHERE age > 25;
    Since hash values are scattered randomly, the index cannot find values “greater than” another key. B-Trees keep keys sorted, making range queries incredibly fast.
  • No Sorting: Hash indexes cannot speed up ORDER BY operations.
  • Collisions: If two different keys produce the same hash value, they end up in the same slot (a collision), requiring a slower list scan to resolve.

Database Indexing in Action (SQL)

Let’s look at how database indexes affect execution plans using PostgreSQL.

1. The Setup (No Index)

Suppose we have a payments table with 1,000,000 transactions. We want to search for transaction #854002.

EXPLAIN ANALYZE 
SELECT * FROM payments WHERE transaction_code = 'TX-854002';

Output:

Seq Scan on payments  (cost=0.00..18450.00 rows=1 width=45) (actual time=48.212..78.950 rows=1 loops=1)
  Filter: (transaction_code = 'TX-854002'::text)
  Rows Removed by Filter: 999999
Planning Time: 0.082 ms
Execution Time: 79.110 ms

Notice Seq Scan (Sequential/Full Table Scan). The database scanned all 1,000,000 rows and took 79.1 ms.

2. The Setup (With Index)

Now, let’s create a B-Tree index on transaction_code.

CREATE INDEX idx_payments_code ON payments(transaction_code);

Let’s run the exact same query:

EXPLAIN ANALYZE 
SELECT * FROM payments WHERE transaction_code = 'TX-854002';

Output:

Index Scan using idx_payments_code on payments  (cost=0.42..8.44 rows=1 width=45) (actual time=0.045..0.047 rows=1 loops=1)
  Index Cond: (transaction_code = 'TX-854002'::text)
Planning Time: 0.120 ms
Execution Time: 0.065 ms

The database engine swapped to an Index Scan using our new index idx_payments_code. The execution time went from 79.11 ms to 0.065 ms—over 1200x faster!


The Catch: There is No Free Lunch

If indexes are so fast, why don’t we index every single column in our database?

Creating an index comes with two major costs:

  1. Storage Overhead: Every index is a separate data structure stored on disk. If you have 5 indexes on a table, you might end up using more disk space for the indexes than the actual data.
  2. Write Performance Penalty: Every time you perform an INSERT, UPDATE, or DELETE, the database must not only modify the table rows but also recalculate and write to all affected index trees. Over-indexing makes updates and inserts very slow.

Summary Strategy Checklist

Use this quick checklist to decide when to index:

  • DO Index:
    • Columns frequently used in WHERE clauses (e.g., email, status).
    • Columns used in table joins (JOIN keys, Foreign Keys).
    • Columns used for ordering/sorting (ORDER BY columns).
  • DO NOT Index:
    • Tables with small datasets (a Sequential Scan on 100 rows is faster than searching an index tree).
    • Columns with very low cardinality (e.g., a boolean is_active where 95% of rows are true).
    • Tables with heavy write loads where read performance is not a bottleneck.

Understanding these structural trade-offs is the difference between a sluggish application and a high-performance system that scales smoothly to millions of users! 🚀