How Database Indexes Work (And Why the Wrong Index Is Worse Than None)

A technical guide to database indexes: B-tree internals, composite index column ordering, covering indexes, partial indexes, the write cost of over-indexing, EXPLAIN ANALYZE interpretation, and the common indexing mistakes that degrade production performance.

By HunchbiteMarch 30, 202614 min read

databaseindexesPostgreSQL

An index is one of those database concepts that seems simple on the surface — "it makes queries faster" — until you run into the production incident caused by an index missing, or the one caused by an index that shouldn't exist. This guide covers how indexes actually work, why they cost something to maintain, and the decisions that determine whether an index helps or hurts.

What an index actually is

The phone book analogy is accurate: a database index is a separate data structure that maps column values to the physical locations of the rows that contain those values. Without an index, finding all users with email = 'alice@example.com' requires reading every row in the users table — a sequential scan. With an index on email, Postgres can jump directly to the relevant rows.

But the phone book analogy only goes so far. Understanding the actual data structure — the B-tree — explains the behavior that matters for query optimization.

B-tree indexes

The default Postgres index type is a B-tree (balanced tree). A B-tree is a sorted, self-balancing tree structure. The leaf nodes contain the indexed column values and pointers to the corresponding heap pages (where the actual row data lives). Internal nodes contain separator keys that guide tree traversal.

                    [50]
                   /    \
            [25]            [75]
           /    \          /    \
       [10,20] [30,40] [60,70] [80,90]

To find all rows where id = 30: start at the root, compare to separator keys, traverse to the correct leaf, find the heap pointer, fetch the row. This is O(log n) regardless of table size — a table with 10 million rows has a B-tree roughly 24 levels deep. Without an index, finding those rows requires reading all 10 million rows.

For range queries (WHERE created_at > '2026-01-01'), the B-tree is also efficient: find the leaf node for the lower bound, then scan leaf nodes sequentially in sort order until the upper bound. Leaf nodes are linked for exactly this purpose.

For equality and range queries on the indexed column, B-trees are essentially optimal. They don't help with pattern-matching (LIKE '%foo%' — leading wildcard), full-text search, or nearest-neighbor queries. Those use different index types (GIN, GiST, BRIN).

How Postgres decides to use an index

Postgres doesn't blindly use an index when one exists. The query planner estimates the cost of several execution plans and picks the cheapest one. Cost is measured in arbitrary units (roughly proportional to I/O operations).

For a sequential scan, the cost is approximately (pages * seq_page_cost) + (rows * cpu_tuple_cost).

For an index scan, the cost is approximately (index_pages * random_page_cost) + (matching_rows * cpu_index_tuple_cost) + (matching_rows * random_page_cost).

The critical factor: random_page_cost vs seq_page_cost. Random I/O (jumping around disk for index scans) is historically more expensive than sequential I/O. In Postgres, random_page_cost defaults to 4 and seq_page_cost defaults to 1. On SSDs, random reads are far cheaper — many teams lower random_page_cost to 1.1 to get more accurate planner estimates. The underlying hardware depends on where you run Postgres, which varies widely across deployment platforms like Vercel, AWS, and Railway.

This explains why the planner sometimes ignores an index even when one exists: if an index scan would touch 30% of the table's rows, the sequential scan (which can read pages in order, amortizing I/O) is genuinely cheaper.

The cost of indexes

Indexes are not free. Every index is a separate data structure that must stay consistent with the table it indexes.

Write amplification: An INSERT into a table with 5 indexes requires 6 write operations: the row itself plus one update per index. A DELETE requires marking the row dead in the heap and updating every index entry. An UPDATE to an indexed column requires a "HOT" update (if possible) or a full index entry delete-and-insert. High write throughput on a heavily-indexed table degrades noticeably.

Storage: Indexes consume disk space. A B-tree index on a UUID column in a table with 10 million rows occupies roughly 200-400MB depending on index fill factor and bloat. On write-heavy tables with many partial deletes, this grows.

Index bloat: When rows are deleted or updated, the old index entries aren't immediately removed — they're marked as dead tuples. VACUUM reclaims them, but between VACUUM runs (or if VACUUM can't keep up with write rate), the index grows with dead entries that must be traversed and skipped. A bloated index is slower than a lean one.

Lock during creation: CREATE INDEX without CONCURRENTLY takes an AccessShareLock that blocks writes. On a production table, always use CREATE INDEX CONCURRENTLY.

Composite indexes and column order

A composite index covers multiple columns. A (last_name, first_name) index lets you find all Smiths efficiently, and within Smiths, find 'John Smith' efficiently.

The critical rule: a composite index can only be used from left to right. An index on (a, b, c):

Supports queries filtering on a
Supports queries filtering on a, b
Supports queries filtering on a, b, c
Does not support queries filtering only on b or only on c
Does not support queries filtering on b, c without a

This is because the index is sorted by a first, then by b within each a value, then by c within each b value. Without fixing a, the b values are scattered throughout the index — there's no way to scan a contiguous range.

Practical consequence: if you have queries that filter by both user_id and status, an index on (user_id, status) serves both the user_id-only queries and the user_id + status queries. An index on (status, user_id) serves neither efficiently unless your queries always filter by status first.

When choosing column order in composite indexes:

Put equality conditions before range conditions. (user_id, created_at) for queries like WHERE user_id = ? AND created_at > ? — the planner finds all rows for that user, then scans the date range within that set.
Put higher-cardinality (more selective) columns first when multiple equality conditions exist.
Match the order of your most critical queries.

Covering indexes

An index "covers" a query when all the columns the query needs are present in the index itself. In that case, Postgres can return results directly from the index without touching the heap at all — an "index-only scan."

-- Query
SELECT user_id, email FROM users WHERE tenant_id = 'abc' AND active = true;
 
-- Covering index
CREATE INDEX idx_users_tenant_active ON users (tenant_id, active) INCLUDE (email, user_id);

The INCLUDE clause (available since Postgres 11) adds non-key columns to the index leaf pages without including them in the sort key. They can't be used for filtering but eliminate the heap lookup when present in SELECT.

Index-only scans are significantly faster on large tables with cold heap pages because they avoid random I/O to the heap entirely. The cost: the index is larger, since it stores the included column data.

Partial indexes

A partial index only covers rows matching a WHERE condition. This is powerful for tables with a predictable distribution of values.

-- Only index unprocessed jobs
CREATE INDEX idx_jobs_pending ON jobs (created_at) WHERE status = 'pending';

If your jobs table has 10 million rows but only 500 are pending at any time, this index is tiny — it covers 500 rows instead of 10 million. Queries that filter WHERE status = 'pending' can use this index; queries that don't include that condition cannot.

Common uses: filtering out soft-deleted records, indexing only active or unprocessed items, indexing only rows belonging to a particular environment (WHERE env = 'production').

When NOT to add an index

Low-cardinality columns

A boolean column has two values. If your active column is true for 90% of rows, an index on it has poor selectivity — filtering WHERE active = false finds 10% of rows, which is better served by an index, but filtering WHERE active = true still touches 90% of the table, which is faster as a sequential scan. Standard B-tree indexes on booleans rarely help and always cost write overhead.

Partial indexes handle this better: index only the minority values if you query them specifically.

Small tables

Tables under a few thousand rows are often faster to scan sequentially. The overhead of an index lookup (traversing the B-tree, following heap pointers) exceeds the cost of just reading the whole table. Postgres's planner usually makes the right call here automatically, but having unnecessary indexes still adds write overhead.

Write-heavy tables with complex workloads

If a table receives thousands of inserts per second (event logs, metrics, audit trails), adding indexes meaningfully reduces write throughput. Profile the write impact before indexing. Consider whether partial indexes (covering only the most-queried subset) offer a better tradeoff.

EXPLAIN ANALYZE: reading the output

EXPLAIN ANALYZE executes the query and shows the actual execution plan with timing. Understanding it is the most practical skill for index debugging.

Here's a before/after example — a query on a 5-million-row orders table filtering by user_id:

Before (no index on user_id):

EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 'usr_abc123';
 
Seq Scan on orders  (cost=0.00..187432.00 rows=47 width=312)
                    (actual time=0.842..2341.187 rows=47 loops=1)
  Filter: (user_id = 'usr_abc123'::text)
  Rows Removed by Filter: 4999953
Planning Time: 0.112 ms
Execution Time: 2341.293 ms

The key signals: Seq Scan (reading every row), Rows Removed by Filter: 4999953 (scanned 5 million rows to find 47), execution time 2.3 seconds.

After (index on user_id):

CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);

EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 'usr_abc123';
 
Index Scan using idx_orders_user_id on orders
  (cost=0.56..412.33 rows=47 width=312)
  (actual time=0.041..0.387 rows=47 loops=1)
  Index Cond: (user_id = 'usr_abc123'::text)
Planning Time: 0.198 ms
Execution Time: 0.421 ms

Index Scan — used the index. Cost dropped from 187,432 to 412. Execution time from 2341ms to 0.4ms. The planner estimated 47 rows and found 47 — statistics are accurate.

What to look for in EXPLAIN ANALYZE:

Signal	What it means
`Seq Scan` with high row count	Candidate for indexing
`Rows Removed by Filter` >> `rows`	Index would help
`Index Scan` but still slow	Index exists but may be bloated or poorly chosen
`Bitmap Heap Scan`	Index used but result set moderately large; heap access batched
`cost=X..Y` where Y is high	Expensive plan
Actual rows >> Estimated rows	Stale statistics; run `ANALYZE`

Index bloat detection

Dead index entries accumulate over time, especially on frequently updated tables. Detect bloat:

-- Rough index bloat estimate
SELECT
  schemaname,
  tablename,
  indexname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

For detailed bloat analysis, the pgstattuple extension gives accurate dead tuple counts. An index with more than 20-30% dead tuples is worth rebuilding.

-- Rebuild without locking writes
REINDEX INDEX CONCURRENTLY idx_orders_user_id;

Common indexing mistakes

Indexing foreign keys — or not. Postgres does not automatically create indexes on foreign key columns. An unindexed foreign key means that checking referential integrity on delete/update of the parent row scans the entire child table. On large tables, this is catastrophic. Always index foreign keys.

Indexing every column in a WHERE clause individually. Separate indexes on (user_id) and (status) don't combine well for WHERE user_id = ? AND status = ?. Postgres can do a bitmap index scan that ANDs the two results, but a composite (user_id, status) index is strictly better for that query pattern.

Forgetting CONCURRENTLY on production. CREATE INDEX without CONCURRENTLY takes a lock that blocks writes for the duration. On a table receiving active traffic, this means downtime. Always use CREATE INDEX CONCURRENTLY in production.

Indexes with functions in the WHERE clause. WHERE lower(email) = ? does not use an index on email. The function call prevents index use. Fix: create a functional index — CREATE INDEX ON users (lower(email)) — or, better, normalize the data at write time and store it lowercased.

Not running ANALYZE after bulk loads. Postgres's query planner relies on statistics gathered by ANALYZE. After bulk inserting millions of rows, the planner's estimates may be wildly off until ANALYZE runs. Run it explicitly after bulk operations: ANALYZE orders;.

Many of the worst query patterns originate in the API layer — how you fetch data through REST or GraphQL endpoints determines how many queries hit these indexes on every request, and a single over-fetching resolver can undo careful indexing work.

Is your database the bottleneck? Find out with an audit.

If your application is running slow and you're not sure where the problem lives — missing indexes, N+1 queries, table bloat, connection pool exhaustion — a Hunchbite technical audit can identify the issues and prioritize the fixes.

→ Technical Audit

Call +91 90358 61690 · Book a free call · Contact form

FAQ

Should I add an index for every column I query by?: No. Indexes have write costs — every INSERT, UPDATE, and DELETE that affects an indexed column must also update the index structure. On write-heavy tables, a proliferation of indexes can make writes significantly slower. Low-cardinality columns (booleans, status enums with a handful of values) have poor selectivity and rarely benefit from a standard B-tree index. Small tables (under a few thousand rows) are often faster to scan in full than to maintain index overhead for. Add indexes when you have evidence of slow queries (EXPLAIN ANALYZE, slow query logs) and target the specific queries that need acceleration. Don't pre-index speculatively.
Why does adding an index sometimes make queries slower?: Several reasons. The query planner may choose a new index incorrectly — if the index has poor selectivity (e.g., a boolean column where 80% of rows are 'true'), using the index is slower than a sequential scan because of the random I/O involved in following index pointers to heap pages. Postgres uses statistics to estimate selectivity; if statistics are stale (ANALYZE hasn't run recently), the planner can make bad choices. Composite indexes where the leading column isn't used in the query's WHERE clause are ignored by the planner anyway. And bloated indexes — indexes with many dead tuples from updates or deletes — have degraded performance until they're rebuilt with REINDEX.
How do I find which queries are missing indexes in Postgres?: Several approaches. Enable pg_stat_statements (it's a built-in extension) and query it for high total_time or high mean_exec_time statements — those are your candidates. Use auto_explain with log_min_duration set to a threshold to log EXPLAIN ANALYZE output for slow queries automatically. Check pg_stat_user_tables for tables with high seq_scan counts relative to idx_scan — a table that's being sequentially scanned often indicates missing indexes. For identifying specific missing indexes, pganalyze and similar tools analyze your query workload and recommend indexes. Manually, look for queries with large 'rows removed by filter' in EXPLAIN ANALYZE output — that's a full scan filtering many rows when an index could limit the scan.

Weighing a technical decision?

Get a second opinion before you commit.

Stack choices, architecture trade-offs, build-vs-buy — a 30-minute call with senior engineers can save you months. No sales pitch, just a straight answer.

Book a Free Call Technical Due Diligence

Trusted by VMAC Industries, TKD Logistics, Astitva Jewellery & more. See our recent work →

Fixed-price, no hourly billing · No obligation · We tell you upfront if we're not a fit

Technology Decisions

Drizzle ORM Setup Guide: Type-Safe Database Access with PostgreSQL

How to set up Drizzle ORM with PostgreSQL from scratch — schema definition, migrations, query patterns, connection pooling, and the configuration decisions that matter in production Next.js applications.

11 min read Technology Decisions

How JWT Authentication Works

A deep dive into the three-part JWT structure, how signatures are generated and verified, stateless vs stateful auth, token expiry and refresh patterns, and the security mistakes that get production systems compromised.

13 min read

All Guides