Core MongoDB Operations (CRUD) >
Data Modeling Considerations for MongoDB Applications

Data Modeling Considerations for MongoDB Applications¶

On this page

Overview
Data Modeling Decisions
Operational Considerations
Data Modeling Patterns and Examples

Overview¶

Data in MongoDB has a flexible schema. Collections do not enforce document structure. This means that:

documents in the same collection do not need to have the same set of fields or structure, and
common fields in a collection’s documents may hold different types of data.

Each document only needs to contain relevant fields to the entity or object that the document represents. In practice, most documents in a collection share a similar structure. Schema flexibility means that you can model your documents in MongoDB so that they can closely resemble and reflect application-level objects.

As in all data modeling, when developing data models (i.e. schema designs,) for MongoDB you must consider the inherent properties and requirements of the application objects and the relationships between application objects. MongoDB data models must also reflect:

how data will grow and change over time, and
the kinds of queries your application will perform.

These considerations and requirements force developers to make a number of multi-factored decisions when modeling data, including:

normalization and de-normalization.

These decisions reflect degree to which the data model should store related pieces of data in a single document or should the data model describe relationships using references between documents.
indexing strategy.
representation of data in arrays in BSON.

Although a number of data models may be functionally equivalent for a given application; however, different data models may have significant impacts on MongoDB and applications performance.

This document provides a high level overview of these data modeling decisions and factors. In addition, consider, the Data Modeling Patterns and Examples section which provides more concrete examples of all the discussed patterns.

Data Modeling Decisions¶

Data modeling decisions involve determining how to structure the documents to model the data effectively. The primary decision is whether to embed or to use references.

Embedding¶

To de-normalize data, store two related pieces of data in a single document.

Operations within a document are less expensive for the server than operations that involve multiple documents.

In general, use embedded data models when:

you have “contains” relationships between entities. See Model Embedded One-to-One Relationships Between Documents.
you have one-to-many relationships where the “many” objects always appear with or are viewed in the context of their parent documents. See Model Embedded One-to-Many Relationships Between Documents.

Embedding provides the following benefits:

generally better performance for read operations.
the ability to request and retrieve related data in a single database operation.

Embedding related data in documents, can lead to situations where documents grow after creation. Document growth can impact write performance and lead to data fragmentation. Furthermore, documents in MongoDB must be smaller than the maximum BSON document size. For larger documents, consider using GridFS.

Referencing¶

To normalize data, store references between two documents to indicate a relationship between the data represented in each document.

In general, use normalized data models:

when embedding would result in duplication of data but would not provide sufficient read performance advantages to outweigh the implications of the duplication.
to represent more complex many-to-many relationships.
to model large hierarchical data sets. See Model Tree Structures in MongoDB.

Referencing provides more flexibility than embedding; however, to resolve the references, client-side applications must issue follow-up queries. In other words, using references requires more roundtrips to the server.

See Model Referenced One-to-Many Relationships Between Documents for an example of referencing.

Atomicity¶

MongoDB only provides atomic operations on the level of a single document. [1] As a result needs for atomic operations influence decisions to use embedded or referenced relationships when modeling data for MongoDB.

Embed fields that need to be modified together atomically in the same document. See Model Data for Atomic Operations for an example of atomic updates within a single document.

[1]	Document-level atomic operations include all operations within a single MongoDB document record: operations that affect multiple sub-documents within that single record are still atomic.

Operational Considerations¶

In addition to normalization and normalization concerns, a number of other operational factors help shape data modeling decisions in MongoDB. These factors include:

data lifecycle management,
number of collections and
indexing requirements,
sharding, and
managing document growth.

These factors implications for database and application performance as well as future maintenance and development costs.

Data Lifecycle Management¶

Data modeling decisions should also take data lifecycle management into consideration.

The Time to Live or TTL feature of collections expires documents after a period of time. Consider using the TTL feature if your application requires some data to persist in the database for a limited period of time.

Additionally, if your application only uses recently inserted documents consider Capped Collections. Capped collections provide first-in-first-out (FIFO) management of inserted documents and optimized to support operations that insert and read documents based on insertion order.

Large Number of Collections¶

In certain situations, you might choose to store information in several collections rather than in a single collection.

Consider a sample collection logs that stores log documents for various environment and applications. The logs collection contains documents of the following form:

copy

{ log: "dev", ts: ..., info: ... }
{ log: "debug", ts: ..., info: ...}

If the total number of documents is low you may group documents into collection by type. For logs, consider maintaining distinct log collections, such as logs.dev and logs.debug. The logs.dev collection would contain only the documents related to the dev environment.

Generally, having large number of collections has no significant performance penalty and results in very good performance. Distinct collections are very important for high-throughput batch processing.

When using models that have a large number of collections, consider the following behaviors:

Each collection has a certain minimum overhead of a few kilobytes.
Each index, including the index on _id, requires at least 8KB of data space.

A single <database>.ns file stores all meta-data for each database. Each index and collection has its own entry in the namespace file, MongoDB places limits on the size of namespace files.

Because of limits on namespaces, you may wish to know the current number of namespaces in order to determine how many additional namespaces the database can support, as in the following example:

copy

db.system.namespaces.count()

The <database>.ns file defaults to 16 MB. To change the size of the <database>.ns file, pass a new size to --nssize option <new size MB> on server start.

The --nssize sets the size for new <database>.ns files. For existing databases, after starting up the server with --nssize, run the db.repairDatabase() command from the mongo shell.

Indexes¶

Create indexes to support common queries. Generally, indexes and index use in MongoDB correspond to indexes and index use in relational database: build indexes on fields that appear often in queries and for all operations that return sorted results. MongoDB automatically creates a unique index on the _id field.

As you create indexes, consider the following behaviors of indexes:

Each index requires at least 8KB of data space.
Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive as each insert must add keys to each index.
Collections with high proportion of read operations to write operations often benefit from additional indexes. Indexes do not affect un-indexed read operations.

See Indexing Strategies for more information on determining indexes. Additionally, the MongoDB database profiler may help identify inefficient queries.

Sharding¶

Sharding allows users to partition a collection within a database to distribute the collection’s documents across a number of mongod instances or shards.

The shard key determines how MongoDB distributes data among shards in a sharded collection. Selecting the proper shard key has significant implications for performance.

See Sharded Cluster Overview for more information on sharding and the selection of the shard key.

Document Growth¶

Certain updates to documents can increase the document size, such as pushing elements to an array and adding new fields. If the document size exceeds the allocated space for that document, MongoDB relocates the document on disk. This internal relocation can be both time and resource consuming.

Although MongoDB automatically provides padding to minimize the occurrence of relocations, you may still need to manually handle document growth. Refer to Pre-Aggregated Reports Use Case Study for an example of the Pre-allocation approach to handle document growth.

Data Modeling Patterns and Examples¶

The following documents provide overviews of various data modeling patterns and common schema design considerations:

For more information and examples of real-world data modeling, consider the following external resources:

Schema Design by Example
Walkthrough MongoDB Data Modeling
Document Design for MongoDB
Dynamic Schema Blog Post
MongoDB Data Modeling and Rails
Ruby Example of Materialized Paths
Sean Cribs Blog Post which was the source for much of the Model Tree Structures in MongoDB content.

← Write Operations BSON Documents →