Data Model

A data model is the conceptual inventory of what a system knows: the nouns, their attributes, and how they connect, named once so every other layer of the system can agree on what it’s talking about.

Concept

Vocabulary that names a phenomenon.

“All models are wrong, but some are useful.” — George Box

Understand This First

Requirement — the data model reflects what the system is required to know.

A data model is the conceptual inventory of what a system knows about. It names the entities the system tracks, the attributes each entity carries, and the relationships between entities. It sits at the architectural level: above any particular database or programming language, but below product-level decisions about what the system does.

For a bookstore application, the data model says there are books, authors, and orders. It says a book has a title and a price. It says an author can write many books, and an order contains one or more books. It does not say whether the data lives in PostgreSQL, MongoDB, or a JSON file on disk; it does not say whether Book is a Python class, a Go struct, or a TypeScript interface. The data model captures meaning. The storage and code that follow capture mechanism.

The term gets used three ways in practice, and the layers are worth keeping separate because the conflation is where bugs come from:

Conceptual data model. Nouns, attributes, relationships, named in the vocabulary of the problem domain. This is what a product manager and a backend engineer can argue about together. It says what exists, not how it’s stored. A whiteboard with boxes and arrows is usually enough.
Logical data model. The conceptual model expressed in the form of whatever paradigm will store it: relational tables and foreign keys, document collections and embedded sub-documents, graph nodes and edges. Datatypes appear here but specific column lengths and indexes don’t. This is what you sketch before you write the migration.
Physical data model. The logical model committed to a specific engine. Postgres column types, indexes, partitioning strategy, denormalization choices made for query performance. This is the level the database administrator reads.

The data model also has neighbors that often get blurred into it. A Schema is the physical model rendered as DDL the database can enforce; a Data Structure is an in-memory shape used by code that runs on top of the model; a Domain Model is the broader business-rule layer that includes behavior, invariants, and workflows, not just data shape. The data model is the part of all of these that answers one question: what does the system know about the world?

Why It Matters

A team without a shared data model accumulates quiet disagreement. The product manager talks about customers; the developer writes a User class; the marketing analyst calls them accounts; the support agent looks at contacts in the CRM. They are all referring to roughly the same entity, but the word collisions hide real differences. Does a customer exist before they have placed an order? Can one user manage several accounts? Does a contact get archived or deleted? The questions sit unanswered until a feature ships, behaves wrong, and the team finally argues out which word means which thing, usually by reading code or running queries against production.

The data model is what shortcuts that argument. When the team names its entities, attributes, and relationships once, with care, the rest of the system gets to refer back to a single answer. The schema reflects the model. The API contract reflects the model. The product copy reflects the model. New engineers learn the model on day one and the vocabulary becomes load-bearing across hiring, design review, and onboarding.

For agentic workflows the discipline tightens. An agent is a fast writer of code in the codebase it’s reading. If the codebase has named entities (clear class names, well-typed columns, an entities.md doc, a domain glossary), the agent will pattern-match on that vocabulary and produce code that respects it. If the codebase has three half-built models (one in the schema, one in the ORM, one in the API serializers) the agent will write code that’s coherent inside whichever model the prompt happened to surface, and incoherent with the other two. The team won’t see the drift until a feature ships that updates the schema’s customer row but leaves the API serializer’s client payload unchanged, and a downstream consumer breaks. Naming the model is what gives the agent something to be consistent with.

The model is also where the team’s product clarity gets honest. A vague product brief along the lines of “we need to track customers and their stuff” survives until the model has to be drawn. The moment somebody asks “what’s an attribute of a customer and what’s a separate entity?” the team has to decide. Is mailing address a column on the customer row, an embedded value, or a separate Address entity that the customer has many of? The answer depends on whether addresses get reused, whether they get historical versioning, whether two customers can share one. Forcing the question is the value; the box-and-arrow diagram is the artifact.

How to Recognize It

You are looking at a data-model question whenever two people in the same conversation use different words for the same thing, the same word for different things, or describe the system’s content with hedges instead of nouns. Specific signs:

Vocabulary drift. Two services call the same row by different names — users in one schema, accounts in another, members in the API. Or the same name covers two different shapes — Order means “a cart in progress” in the storefront and “a completed sale” in the warehouse. Neither side is wrong; the model was never written down.

Ad-hoc relationships. A foreign key gets added to support one feature, then a second feature is built against the implicit relationship without anyone updating the model anywhere. Six months later nobody can answer “what does an Order belong to?” without reading the migration history.

Schema-as-spec. The team has no document describing the data model. Asked to explain it, an engineer opens the database and reads the table list. This is a tell: the model is being inferred from its implementation, which means the implementation is the model, which means every storage decision is also a modeling decision and the team can’t tell them apart.

Boundary fights. A new feature has to decide whether something is “part of” the customer entity or “linked to” the customer entity, and the decision keeps flipping. The instinct is to argue about ORM design; the actual question is whether the model has the right entities.

The “what counts as one of these” question. Two engineers argue about whether a returned-and-replaced item is one Order or two; whether a free trial is a Subscription with status: trial or its own entity; whether a deleted user is a row with deleted_at set or no row at all. These are modeling questions wearing implementation clothes. They surface when the model isn’t explicit about lifecycle.

Agent code that “works” but feels off. An agent asked to add a feature writes code that touches three tables, and the resulting payload feels strangely shaped — fields that should be one thing are two, or two things are mashed into one. The code is internally consistent; what’s wrong is that the agent inferred a model from the parts of the codebase it read, and that inferred model doesn’t match the one the team carries in their heads.

How It Plays Out

A team building a recipe-sharing app sits down for thirty minutes before writing any code. They list the entities: Recipe, Ingredient, User, Rating. They sketch the relationships: a User creates Recipes; a Recipe has Ingredients (with quantity and unit); a User can leave a Rating on a Recipe (one rating per user per recipe). They argue briefly about whether Ingredient is its own entity or a list of strings on the recipe, and decide on a separate entity because they want shopping-list features later. The whole exercise costs thirty minutes and a marker. Six months in, when a new engineer joins, that diagram is the first thing she reads, and her first PR uses the right names.

A platform team migrating from a monolith to services discovers, halfway through the cutover, that the monolith treats workspace and organization as synonyms in some endpoints and as distinct entities in others. The migration stalls for three weeks while the team figures out which usages meant which, writes a data-model document that names Organization (the legal entity that pays the bill), Workspace (the collaboration container that users join), and the one-to-many relationship between them, then renames every endpoint, column, and metric to match. The cost is bearable but real; the cost they avoided is shipping the migration with the ambiguity baked in and discovering it from a billing bug six quarters later.

A coding agent is asked to add “team plans” to an existing SaaS application. The codebase has a User table with a plan column and no Team concept. The agent reads the schema, the API serializers, and the billing module, then writes a migration that adds a Team table, a team_id foreign key on User, and an is_team_admin boolean. The endpoints route by team_id. The tests pass. A week later the team realizes the agent’s model says a user belongs to one team, but the product brief says a user can belong to several teams in different roles. The agent inferred a many-to-one model from the existing one-plan-per-user pattern, because nothing in the codebase named the alternative. The fix is two days of migration and a rewrite of half the agent’s endpoints. What the team should have done first is spend twenty minutes writing the new entities and relationships down and putting them in the prompt: Team has many Users via Memberships, with a role per membership; a User has many Teams via the same Memberships. The agent would have produced the correct shape on the first try.

Example Prompt

“Before writing any migration or endpoint code, sketch the data model for the new feature. List the entities, their attributes, and the relationships between them. Explicitly state the cardinality of each relationship (one-to-one, one-to-many, many-to-many) and what owns the foreign key. I’ll review the model before you generate the schema, API, or tests.”

Consequences

Benefits. A team that has named its data model gains a shared vocabulary across product, engineering, design, and operations. Code review gets faster because there’s a reference point for “what should exist.” New engineers ramp up on the model before they touch a line of code, which means their first contributions use the right names. Refactors stop being archeological digs because the model is documented separately from any one implementation of it. Agentic workflows benefit disproportionately: an agent given the model in the prompt produces code that respects it; an agent left to infer the model from a tangled codebase produces code that respects whatever fraction of the model it happened to see.

The model also makes change-cost legible. A proposed feature that adds an attribute to an existing entity is small; one that introduces a new entity is medium; one that changes the cardinality of an existing relationship is large and likely needs a migration plan. Without a model, every feature looks small at the start and surprises the team when the implementation reveals the actual scope. With a model, the team can read the diff against the diagram and price the work honestly.

Liabilities. Models cost effort to maintain. As the product evolves, the model has to evolve with it, and a stale model is worse than no model because it actively misleads — newcomers and agents read it as authoritative and ship code that contradicts the current schema. The discipline of updating the model alongside the schema migration is real work the team has to budget for.

Models can also be applied too rigidly, at the wrong moment. A team building a prototype to learn what the right entities are is going to draw the model wrong on the first try, and clinging to that first draft past the moment the prototype told the team something new is a way of using the model to prevent the learning that justified the prototype. The discipline is to draw the model lightly when uncertainty is high, redraw it when the product learns something, and treat it as a living document, not as a contract carved at the moment of greatest ignorance.

Finally, the model can become a place to hide product confusion. A team that can’t decide whether customer and user are the same thing is sometimes really saying that the business hasn’t decided who the product serves. The data model surfaces that question, but doesn’t answer it. Treat a stalled modeling discussion as a signal to escalate to the product owner, not as a problem to be solved with cleverer ER diagrams.

Sources

Peter Chen’s “The Entity-Relationship Model: Toward a Unified View of Data” (ACM Transactions on Database Systems, 1976) introduced the entity-relationship vocabulary used in this article — entities, attributes, relationships, cardinality — and established the separation between conceptual modeling and physical storage that the layered framing here depends on. The 50-year-old paper is still the cleanest statement of the concept.
Eric Evans’s Domain-Driven Design (Addison-Wesley, 2003) developed the case for naming the model in the vocabulary of the problem domain, and gave the field the term ubiquitous language for the discipline of using one set of words across product, code, and storage. The “vocabulary drift” diagnosis in this article descends directly from that framing.
Martin Fowler’s Patterns of Enterprise Application Architecture (Addison-Wesley, 2002) cataloged the patterns by which a logical model gets rendered into code and storage — Data Mapper, Active Record, Identity Field, Foreign Key Mapping — and is the canonical reference for the distinction between the conceptual, logical, and physical layers used above.
The agent-specific framing — that a codebase whose data model is named in the prompt produces materially better agent output than one whose model the agent has to infer — is part of the working literature on coding agents. The practitioner conversation around production-grade coding agents converges on the operational rule used here: the codebase’s named vocabulary is the agent’s working surface, and explicit modeling pays for itself many times over once an agent is in the loop.

Keyboard shortcuts