Navigation

Product Catalog

Overview

This document describes the basic patterns and principles for designing an E-Commerce product catalog system using MongoDB as a storage engine.

Problem

Product catalogs must have the capacity to store many differed types of objects with different sets of attributes. These kinds of data collections are quite compatible with MongoDB’s data model, but many important considerations and design decisions remain.

Solution

For relational databases, there are several solutions that address this problem, each with a different performance profile. This section examines several of these options and then describes the preferred MongoDB solution.

SQL and Relational Data Models

Concrete Table Inheritance

One approach, in a relational model, is to create a table for each product category. Consider the following example SQL statement for creating database tables:

CREATE TABLE `product_audio_album` (
    `sku` char(8) NOT NULL,
    ...
    `artist` varchar(255) DEFAULT NULL,
    `genre_0` varchar(255) DEFAULT NULL,
    `genre_1` varchar(255) DEFAULT NULL,
    ...,
    PRIMARY KEY(`sku`))
...
CREATE TABLE `product_film` (
    `sku` char(8) NOT NULL,
    ...
    `title` varchar(255) DEFAULT NULL,
    `rating` char(8) DEFAULT NULL,
    ...,
    PRIMARY KEY(`sku`))
...

This approach has limited flexibility for two key reasons:

  • You must create a new table for every new category of products.
  • You must explicitly tailor all queries for the exact type of product.

Single Table Inheritance

Another relational data model uses a single table for all product categories and adds new columns anytime you need to store data regarding a new type of product. Consider the following SQL statement:

CREATE TABLE `product` (
    `sku` char(8) NOT NULL,
    ...
    `artist` varchar(255) DEFAULT NULL,
    `genre_0` varchar(255) DEFAULT NULL,
    `genre_1` varchar(255) DEFAULT NULL,
    ...
    `title` varchar(255) DEFAULT NULL,
    `rating` char(8) DEFAULT NULL,
    ...,
    PRIMARY KEY(`sku`))

This approach is more flexible than concrete table inheritance: it allows single queries to span different product types, but at the expense of space.

Multiple Table Inheritance

Also in the relational model, you may use a “multiple table inheritance” pattern to represent common attributes in a generic “product” table, with some variations in individual category product tables. Consider the following SQL statement:

CREATE TABLE `product` (
    `sku` char(8) NOT NULL,
    `title` varchar(255) DEFAULT NULL,
    `description` varchar(255) DEFAULT NULL,
    `price`, ...
    PRIMARY KEY(`sku`))

CREATE TABLE `product_audio_album` (
    `sku` char(8) NOT NULL,
    ...
    `artist` varchar(255) DEFAULT NULL,
    `genre_0` varchar(255) DEFAULT NULL,
    `genre_1` varchar(255) DEFAULT NULL,
    ...,
    PRIMARY KEY(`sku`),
    FOREIGN KEY(`sku`) REFERENCES `product`(`sku`))
...
CREATE TABLE `product_film` (
    `sku` char(8) NOT NULL,
    ...
    `title` varchar(255) DEFAULT NULL,
    `rating` char(8) DEFAULT NULL,
    ...,
    PRIMARY KEY(`sku`),
    FOREIGN KEY(`sku`) REFERENCES `product`(`sku`))
...

Multiple table inheritance is more space-efficient than single table inheritance and somewhat more flexible than concrete table inheritance. However, this model does require an expensive JOIN operation to obtain all relevant attributes relevant to a product.

Entity Attribute Values

The final substantive pattern from relational modeling is the entity-attribute-value schema where you would create a meta-model for product data. In this approach, you maintain a table with three columns, e.g. entity_id, attribute_id, value, and these triples describe each product.

Consider the description of an audio recording. You may have a series of rows representing the following relationships:

Entity Attribute Value
sku_00e8da9b type Audio Album
sku_00e8da9b title A Love Supreme
sku_00e8da9b ... ...
sku_00e8da9b artist John Coltrane
sku_00e8da9b genre Jazz
sku_00e8da9b genre General
... ... ...

This schema is totally flexible:

  • any entity can have any set of any attributes.
  • New product categories do not require any changes to the data model in the database.

The downside for these models, is that all nontrivial queries require large numbers of JOIN operations that results in large performance penalties.

Avoid Modeling Product Data

Additionally some e-commerce solutions with relational database systems avoid choosing one of the data models above, and serialize all of this data into a BLOB column. While simple, the details become difficult to access for search and sort.

Non-Relational Data Model

Because MongoDB is a non-relational database, the data model for your product catalog can benefit from this additional flexibility. The best models use a single MongoDB collection to store all the product data, which is similar to the single table inheritance relational model. MongoDB’s dynamic schema means that each document need not conform to the same schema. As a result, the document for each product only needs to contain attributes relevant to that product.

Schema

At the beginning of the document, the schema must contain general product information, to facilitate searches of the entire catalog. Then, a details sub-document that contains fields that vary between product types. Consider the following example document for an album product.

{
  sku: "00e8da9b",
  type: "Audio Album",
  title: "A Love Supreme",
  description: "by John Coltrane",
  asin: "B0000A118M",

  shipping: {
    weight: 6,
    dimensions: {
      width: 10,
      height: 10,
      depth: 1
    },
  },

  pricing: {
    list: 1200,
    retail: 1100,
    savings: 100,
    pct_savings: 8
  },

  details: {
    title: "A Love Supreme [Original Recording Reissued]",
    artist: "John Coltrane",
    genre: [ "Jazz", "General" ],
        ...
    tracks: [
      "A Love Supreme Part I: Acknowledgement",
      "A Love Supreme Part II - Resolution",
      "A Love Supreme, Part III: Pursuance",
      "A Love Supreme, Part IV-Psalm"
    ],
  },
}

A movie item would have the same fields for general product information, shipping, and pricing, but have different details sub-document. Consider the following:

{
  sku: "00e8da9d",
  type: "Film",
  ...,
  asin: "B000P0J0AQ",

  shipping: { ... },

  pricing: { ... },

  details: {
    title: "The Matrix",
    director: [ "Andy Wachowski", "Larry Wachowski" ],
    writer: [ "Andy Wachowski", "Larry Wachowski" ],
    ...,
    aspect_ratio: "1.66:1"
  },
}

Note

In MongoDB, you can have fields that hold multiple values (i.e. arrays) without any restrictions on the number of fields or values (as with genre_0 and genre_1) and also without the need for a JOIN operation.

Operations

For most deployments the primary use of the product catalog is to perform search operations. This section provides an overview of various types of queries that may be useful for supporting an e-commerce site. All examples in this document use the Python programming language and the PyMongo driver for MongoDB, but you can implement this system using any language you choose.

Find Albums by Genre and Sort by Year Produced

Querying

This query returns the documents for the products of a specific genre, sorted in reverse chronological order:

query = db.products.find({'type':'Audio Album',
                          'details.genre': 'Jazz'})
query = query.sort([('details.issue_date', -1)])

Indexing

To support this query, create a compound index on all the properties used in the filter and in the sort:

db.products.ensure_index([
    ('type', 1),
    ('details.genre', 1),
    ('details.issue_date', -1)])

Note

The final component of the index is the sort field. This allows MongoDB to traverse the index in the sorted order to preclude a slow in-memory sort.

Find Products Sorted by Percentage Discount Descending

While most searches will be for a particular type of product (e.g album, movie, etc.,) in some situations you may want to return all products in a certain price range, or discount percentage.

Querying

To return this data use the pricing information that exists in all products to find the products with the highest percentage discount:

query = db.products.find( { 'pricing.pct_savings': {'$gt': 25 })
query = query.sort([('pricing.pct_savings', -1)])

Indexing

To support this type of query, you will want to create an index on the pricing.pct_savings field:

db.products.ensure_index('pricing.pct_savings')

Since MongoDB can read indexes in ascending or descending order, the order of the index does not matter.

Note

If you want to perform range queries (e.g. “return all products over $25”) and then sort by another property like pricing.retail, MongoDB cannot use the index as effectively in this situation.

The field that you want to select a range, or perform sort operations, must be the last field in a compound index in order to avoid scanning an entire collection. Using different properties within a single combined range query and sort operation requires some scanning which will limit the speed of your query.

Find Movies Based on Starring Actor

Querying

Use the following query to select documents within the details of a specified product type (i.e. Film) of product (a movie) to find products that contain a certain value (i.e. a specific actor in the details.actor field,) with the results sorted by date descending:

query = db.products.find({'type': 'Film',
                          'details.actor': 'Keanu Reeves'})
query = query.sort([('details.issue_date', -1)])

Indexing

To support this query, you may want to create the following index.

db.products.ensure_index([
    ('type', 1),
    ('details.actor', 1),
    ('details.issue_date', -1)])

This index begins with the type field and then narrows by the other search field, where the final component of the index is the sort field to maximize index efficiency.

Find Movies with a Particular Word in the Title

Regardless of database engine, in order to retrieve this information the system will need to scan some number of documents or records to satisfy this query.

Querying

MongoDB supports regular expressions within queries. In Python, you can use the “re” module to construct the query:

import re
re_hacker = re.compile(r'.*hacker.*', re.IGNORECASE)

query = db.products.find({'type': 'Film', 'title': re_hacker})
query = query.sort([('details.issue_date', -1)])

MongoDB provides a special syntax for regular expression queries without the need for the re module. Consider the following alternative which is equivalent to the above example:

query = db.products.find({
    'type': 'Film',
    'title': {'$regex': '.*hacker.*', '$options':'i'}})
query = query.sort([('details.issue_date', -1)])

The $options operator specifies a case insensitive match.

Indexing

The indexing strategy for these kinds of queries is different from previous attempts. Here, create an index on { type: 1, details.issue_date: -1, title: 1 } using the following command at the Python/PyMongo console:

db.products.ensure_index([
    ('type', 1),
    ('details.issue_date', -1),
    ('title', 1)])

This index makes it possible to avoid scanning whole documents by using the index for scanning the title rather than forcing MongoDB to scan whole documents for the title field. Additionally, to support the sort on the details.issue_date field, by placing this field before the title field, ensures that the result set is already ordered before MongoDB filters title field.

Scaling

Sharding

Database performance for these kinds of deployments are dependent on indexes. You may use sharding to enhance performance by allowing MongoDB to keep larger portions of those indexes in RAM. In sharded configurations, select a shard key that allows mongos to route queries directly to a single shard or small group of shards.

Since most of the queries in this system include the type field, include this in the shard key. Beyond this, the remainder of the shard key is difficult to predict without information about your database’s actual activity and distribution. Consider that:

  • details.issue_date would be a poor addition to the shard key because, although it appears in a number of queries, no queries were selective by this field.
  • you should include one or more fields in the detail document that you query frequently, and a field that has quasi-random features, to prevent large unsplittable chunks.

In the following example, assume that the details.genre field is the second-most queried field after type. Enable sharding using the following shardCollection operation at the Python/PyMongo console:

>>> db.command('shardCollection', 'product', {
...     key : { 'type': 1, 'details.genre' : 1, 'sku':1 } })
{ "collectionsharded" : "details.genre", "ok" : 1 }

Note

Even if you choose a “poor” shard key that requires mongos to broadcast all to all shards, you will still see some benefits from sharding, because:

  1. Sharding makes a larger amount of memory available to store indexes, and
  2. MongoDB will parallelize queries across shards, reducing latency.

Read Preference

While sharding is the best way to scale operations, some data sets make it impossible to partition data so that mongos can route queries to specific shards. In these situations mongos sends the query to all shards and then combines the results before returning to the client.

In these situations, you can add additional read performance by allowing mongos to read from the secondary instances in a replica set by configuring read preference in your client. Read preference is configurable on a per-connection or per-operation basis. In PyMongo, set the read_preference argument.

The SECONDARY property in the following example, permits reads from a secondary (as well as a primary) for the entire connection .

conn = pymongo.MongoClient(read_preference=pymongo.SECONDARY)

Conversely, the SECONDARY_ONLY read preference means that the client will only send read operation only to the secondary member

conn = pymongo.MongoClient(read_preference=pymongo.SECONDARY_ONLY)

You can also specify read_preference for specific queries, as follows:

results = db.product.find(..., read_preference=pymongo.SECONDARY)

or

results = db.product.find(..., read_preference=pymongo.SECONDARY_ONLY)