Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Structure

A data structure is the in-memory arrangement of values that decides which of a program’s operations are cheap and which are expensive; the word is what lets a team specify the arrangement they want before the agent picks one for them.

Concept

Vocabulary that names a phenomenon.

What It Is

A data structure is the way a running program arranges values in memory so that the operations the program needs to perform on those values are fast enough to be practical. A million customer records can be held as an unordered pile; they can be held as a sorted array; they can be held as a hash map keyed by customer ID; they can be held as a tree indexed by signup date. The records are the same records in every case. The structure is the choice about what is cheap and what is expensive to do with them.

The word covers two layers that practitioners keep separate at their peril:

  • Abstract data type (ADT). The contract: what operations are defined on the values, and what those operations promise. “A set supports add, remove, and contains, and never holds two equal elements.” “A queue supports enqueue and dequeue, and dequeue returns elements in the order they were enqueued.” The ADT names what the structure does without saying how.
  • Concrete representation. The implementation: the bytes, pointers, and arrays that actually live in memory and that the ADT operations run against. A set can be implemented as a hash table (O(1) average contains), as a sorted array (O(log n) contains, O(n) add), as a balanced binary tree (O(log n) for both), as a Bloom filter (O(1) contains with false positives, no remove). Same ADT contract; very different cost profiles and very different bugs when used carelessly.

The standard library of every modern language ships an opinionated set of these structures so a working engineer doesn’t implement them from scratch. The names cluster into a small canonical family that every practitioner should be able to recognize on sight:

  • Array / list. Ordered sequence indexed by position. Cheap to read at a known index, cheap to iterate, expensive to search and to insert in the middle. The default container; the fallback when nothing else has been chosen on purpose.
  • Hash map / dictionary / associative array. Map from keys to values. Cheap to look up, insert, and delete by key on average; no ordering; performance falls apart under a bad hash function or pathological keys.
  • Set. Unordered collection of unique values. Cheap “is this present?” lookup. Usually a hash map behind the scenes; sometimes a sorted set when ordered iteration matters.
  • Tree. Values arranged in a parent-child hierarchy. Binary search trees give O(log n) sorted operations; B-trees back most disk-resident indexes; tries make prefix lookup cheap; suffix trees make substring search cheap.
  • Queue and stack. Ordering disciplines for processing. Queue is first-in-first-out (the line at a coffee shop); stack is last-in-first-out (the plates in a sink). Both can sit on top of an array or a linked list; the discipline is what makes them useful.
  • Heap / priority queue. Repeatedly hand back the smallest (or largest) element. The structure that makes Dijkstra’s algorithm tractable and makes “process the highest-priority job next” cheap.
  • Graph. Nodes connected by edges, possibly directed, possibly weighted. The structure that names “who depends on what,” “who follows whom,” “which page links to which.”

The classical vocabulary for what these structures cost is Big-O notation, a worst-case bound on how an operation’s time or memory grows as the structure’s size grows: O(1) means the cost doesn’t change with size, O(log n) grows very slowly, O(n) grows linearly, O(n log n) is the typical sorting bound, O(n²) is the cost of a nested scan and the most common quietly-fatal performance bug. Big-O is the shared vocabulary that lets two engineers argue about a choice without first agreeing on a benchmark.

For agentic coding the gap that matters is the gap between the ADT a prompt names (“a list of seen items”) and the concrete representation the agent reaches for. An agent told to “track which items we’ve seen” will, by default, reach for an array and a linear scan. That code passes a unit test on ten items. It runs in a few seconds on ten thousand. It locks up an interactive endpoint on a million. The agent isn’t being careless; it’s producing correct code for a contract the prompt named at the ADT level and not at the representation level. The vocabulary closes the gap by giving the prompt the words it needs: use a set, use a hash map keyed by ID, use a heap if you need the smallest one.

Why It Matters

A team that doesn’t have working data-structure vocabulary will accept the structures the agent picks by default, and the agent’s defaults are biased toward whatever the language’s syntax makes easiest to write, usually a list. That bias is invisible until the dataset grows.

The cost of getting it wrong is rarely a wrong answer; it’s a working answer that becomes unworkable. A spell-checker that scans a 100,000-word dictionary for every word in a 10,000-word document is doing a billion comparisons; the same checker backed by a hash set is doing ten thousand. A duplicate-detection routine that nests two loops is doing n² work; the same routine backed by a set is doing n work. A “find the top ten” routine that sorts the whole list and slices the head is doing n log n work; the same routine backed by a heap is doing n log 10. None of these are bugs; they are different points on the same cost curve, and the curve is the topic the team needs vocabulary for.

Naming the structure is also what makes the code readable. A reader who sees a Set<UserId> learns more from the type than they would from a paragraph of comments: that the collection holds unique users, that order doesn’t matter, that membership is the cheap operation. A reader who sees a PriorityQueue<Job> knows the next thing the loop pulls out is the most urgent one. A reader who sees List<Customer> learns almost nothing — the type is the language’s default and signals only that someone needed a container.

For agentic workflows the discipline shifts. The agent reads the codebase faster than the team does and pattern-matches off whatever it sees. A codebase that consistently uses Set where uniqueness matters and Map where keyed lookup matters teaches the agent, through example rather than instruction, to do the same. A codebase that uses lists for everything teaches the agent the opposite. The team’s vocabulary, embodied in the type signatures the team commits, is the prompt the agent reads on every cycle.

How to Recognize It

You’re looking at a data-structure question whenever the same set of values is going to be touched by more than one kind of operation, and the operations have different cost profiles. The signs in order of frequency:

  • A loop inside a loop that’s doing a search. The classical n² shape. “For each item in A, find the matching item in B.” The fix is to put one of the lists into a hash map keyed by the join field and do the lookup in the inner step. The agent’s helpful “let me iterate over both” code is this shape every time it ships.
  • A for loop scanning a list to test membership. “Is this user in the allowed list?” If the list is fixed and the test happens more than once, the structure should be a set. The list-scan is the agent’s default because if x in list is shorter to type than the equivalent with a set.
  • A repeated “find the smallest” or “find the largest.” If the answer is needed once, scan the list. If the answer is needed repeatedly as new items arrive, the structure is a heap or priority queue.
  • An iteration that has to be in sorted order. If the order is fixed at load time, sort once and iterate. If insertions and reads interleave and both need to see sorted order, the structure is a sorted set or a balanced tree, not a list that’s resorted on every insert.
  • A lookup table that’s checked at hot-path frequency. Configuration, feature flags, lookup-by-id queries — anything checked thousands of times per request. The structure is a hash map populated once at startup, not a database hit on every call.
  • A workflow that has to know “what came before this in topological order.” Dependency resolution, build graphs, page-link analysis. The structure is a graph; the algorithm is topological sort or breadth-first traversal. Don’t try to model it with nested lists.
  • A “remember the last N things” buffer. Recent items, sliding windows, rate-limit accounting. The structure is a ring buffer or a deque, not a list with periodic slicing.

A few signs that the wrong structure is in play right now:

  • The code is correct, the tests pass, and one user complaint or one production load test reveals that the operation takes seconds when it should take milliseconds. The bug is structural; no amount of micro-optimization inside the loop will recover the order of magnitude that the wrong shape is costing.
  • The code uses a list and the comments around it spend three sentences explaining how lookups are done. The comments are doing the work that a better type signature should be doing for free.
  • The agent’s commit message says “implemented in O(n²); should be fine for current scale.” The agent named the shape; the team accepted the shape; the structure should have been changed in the same commit.

Warning

A list isn’t a default; it’s a choice. The default is whichever structure makes the operations the code actually performs cheap. The team that treats list as the default and reaches for other structures only when a problem appears is the team that ships n² code and re-implements it under deadline pressure.

How It Plays Out

A back-office team builds an export feature that joins two tables in memory: ten thousand orders against fifty thousand customers, matching on customer_id. The agent writes the natural-looking version — for each order, scan the customer list, find the match, emit a row. The export runs in development against a small dataset, passes review, and ships. The first time it runs against production data it takes four minutes; the user closes the tab. The fix is fifteen lines: load the customer list into a Map<CustomerId, Customer> once at the start of the export, then in the per-order loop do a constant-time lookup. The export now runs in under a second. The team hasn’t changed the algorithm or the data; they’ve changed which structure holds the customers, and the new structure makes the join the right shape: O(n) instead of O(n × m).

A platform team adds a rate limiter to an API. The first cut keeps a list of the timestamps of recent requests per user, trims expired entries on every call, and counts the survivors. Under low load it works. Under flash-traffic load, when each user has thousands of requests in the window, the per-call list trim becomes the slowest operation in the entire request pipeline, and the rate limiter, which exists to protect the system, becomes the bottleneck protecting the system from itself. The replacement structure is a ring buffer of fixed size per user, or a counter bucketed by short time windows; both let the per-call cost be constant regardless of how many requests are in the window. The team had picked the structure the prompt suggested (“a list of recent timestamps”); the structure the workload needed was named more cheaply once the workload was visible.

A coding agent is asked to “deduplicate this stream of events as they arrive and forward each unique one downstream.” The agent reaches for a list (seen = [], then if event.id not in seen: seen.append(event.id); forward(event)) and runs the new code through the test suite. The tests pass on a thousand events. The system goes to staging with ten million events and falls over: the per-event in seen check has gone from microseconds to seconds as the list grew, and the whole pipeline has stalled. The fix is two tokens: seen = set() instead of seen = []. Same code shape, same intent, same correctness; the list-vs-set choice is the entire performance story. What the agent missed wasn’t algorithmic insight; it was the vocabulary distinction between list of seen things (the agent’s default) and set of seen things (the structure the workload demanded). The prompt that names the structure produces the structure on the first try.

Example Prompt

“This function tracks which event IDs have already been processed so the same event is never forwarded twice. Use a Set<String> rather than a list so the membership check is constant-time even when the set has grown to millions of entries. Add a comment naming the structure choice and the rationale so the next reader (and the next agent) sees why.”

Consequences

Treating the data structure as a deliberate vocabulary choice (which container, with which operations, at which cost) rather than as whatever the language’s syntax made easiest to type, changes what the team’s design conversations are about. The team stops debugging “the export is slow” and starts asking, of each container in the code, which operations does this support cheaply, which does it support expensively, and which ones does the workload actually exercise? That question has answers in the canonical-structures vocabulary above; the previous question only has profiler output.

Benefits. A team that picks structures on purpose ships code that performs predictably at the scales the team designs for, and the type signatures themselves carry the intent: a Set says unique, a Map says keyed lookup, a Heap says priority, a Graph says relationships. Code review becomes faster because the structure is the design; new contributors and coding agents pattern-match off the existing types and produce code in the same shape. Performance problems become easier to diagnose because the structural costs are visible in the type signatures rather than buried in the loops. The team’s mental model becomes precise enough that “use a set” is a one-line review comment that closes a class of bug.

Liabilities. Every additional structure costs something to learn and to maintain. A codebase that uses six containers where one would do is harder to read, not easier. The structures the standard library ships are not free of footguns: hash maps degenerate under adversarial keys, trees rebalance under contention, sets that hold mutable objects break when the objects mutate after insertion, ring buffers lose data silently when they overflow. The discipline isn’t “use the most exotic structure”; it’s “use the structure whose cost profile matches the workload, and no more exotic than that.” A Set where a List would do is gratuitous; a List where a Set is needed is a performance trap. Picking right requires knowing both the workload and the structures’ cost profiles, and the time the team invests in that knowledge is the time the team gets back at every performance review.

For agentic workflows the consequence is sharp. An agent will reach for the structure the prompt names; if the prompt names only the values (“a collection of seen IDs”), the agent will reach for the language default, which is almost always a list. A team that wants performant agent output makes the structure part of the codebase’s vocabulary: committed types, named helpers, examples in adjacent files the agent will read on its way to the file it’s editing. That’s the prompt the agent actually reads. The discipline isn’t to write longer instructions to the agent; it’s to make the codebase teach the agent the structure the team would have chosen, by example, on every read.

Sources

  • Donald Knuth’s The Art of Computer Programming, Volume 1: Fundamental Algorithms (Addison-Wesley, 1968) established the systematic treatment of arrays, linked lists, trees, and the cost analysis that became the field’s shared vocabulary. The canonical-structures cost table used here descends directly from Knuth’s framing.
  • Niklaus Wirth’s Algorithms + Data Structures = Programs (Prentice-Hall, 1976) made the equation in the title the field’s discipline: a program is the algorithm and the structure together, and changing either changes the other. The article’s “name the structure to name the cost” framing is Wirth’s equation read in reverse.
  • Thomas Cormen, Charles Leiserson, Ronald Rivest, and Clifford Stein’s Introduction to Algorithms (commonly called CLRS; MIT Press, multiple editions since 1990) is the modern reference textbook; its Big-O treatment and per-structure cost tables are what most working engineers learned the vocabulary from. The article’s heap, priority queue, and graph framings follow CLRS conventions.
  • The Java Collections Framework documentation and Python’s collections module are the canonical, in-language references practitioners reach for daily; the choice-vocabulary used here — “reach for a set when you need uniqueness, reach for a map when you need keyed lookup” — is the working practitioner’s distilled summary of those library docs.