Navigation
This version of the documentation is archived and no longer supported.

Map Relational Schemas to MongoDB

Overview

Business intelligence tools connect to a data source and, given a fixed tabular schema, allow users to visually explore their data. MongoDB uses a flexible schema, so some business intelligence tools cannot use MongoDB as a native data source.

To use MongoDB with a relational business intelligence tool such as Tableau, you must define a relational mapping for your MongoDB data. MongoDB Connector for BI provides several methods for creating and managing a relational schema.

BI Connector’s proxy server, mongosqld, has startup options which determine how it handles schema management. These are covered in detail in the mongosqld usage documentation.

Cached Sampling

New in version 2.3:

When mongosqld starts up, it can sample documents on a MongoDB instance to generate a relational representation of the schema, which it caches in memory. This is the default mode, which mongosqld uses unless you specify a schema file with the --schema option or a schema database with the --sampleSource option.

Note

If you have authentication enabled, ensure that your MongoDB user has the required permissions for cached sampling.

You can configure mongosqld to automatically resample on a fixed schedule with the --sampleRefreshIntervalSecs option. If you do not specify a resampling interval, mongosqld uses its initial schema for as long as the process runs.

If you need to manually edit your schema to ensure correct representation of your data, you can use a schema file instead.

See Sampling Mode Reference Chart for more information on sampling modes.

User Permissions for Cached Sampling

If your MongoDB instance uses authentication and you wish to use cached sampling, your BI Connector instance must also use authentication. The admin user that connects to MongoDB via the mongosqld program must have the following privileges:

Alternatively, create a user with the built-in readAnyDatabase role to grant all of the required privileges:

use admin

db.createUser(
  {
    user: "<username>",
    pwd: "<password>",
    roles: [
            { "role": "readAnyDatabase", "db": "admin"  }
           ]
  }
)

Note

Be aware of all privileges included with the readAnyDatabase role before granting it to a user.

To sample all namespaces, start mongosqld without the --sampleNamespaces option.

mongosqld --auth --mongo-username <username> --mongo-password <password>

Persist a Schema in MongoDB

The --sampleSource option directs mongosqld to use a particular database for schema storage. mongosqld can either create a new database to use for schema storage or use an existing database.

If you specify an existing database which has been used previously for schema storage, the --sampleMode option determines whether mongosqld writes new schema data to the specified database or only reads from it. If you specify a database which doesn’t currently exist, you must set --sampleMode to write.

The --sampleRefreshIntervalSecs option determines how frequently mongosqld resamples data to update the schema.

To force a one-time update of the schema, use the FLUSH SAMPLE command from your SQL client.

See Sampling Mode Reference Chart for more information on sampling modes.

User Permissions for Persisted Schemas

When using a persisted schema, the required user permissions are similar to those for cached sampling, with the addition of a required write permission on the specified schema database. To create a user with the necessary permissions to run mongosqld with a persisted schema, execute the following command in the mongo shell, with the placeholder values <username>, <password>, and <dbname> filled in with your own values:

db.getSiblingDB("admin").createUser(
  {
    user: "<username>",
    pwd: "<password>",
    roles: [
             { "role": "readAnyDatabase", "db": "admin"  },
             { "role": "readWrite", "db": "<dbname>"  }
           ]
  }
)

Resample Schema Data with FLUSH SAMPLE

When the mongosqld process starts it creates a schema, either from a schema file or by sampling data from a MongoDB instance. If the data in your MongoDB instance changes shape significantly with new fields or collections, you may wish to regenerate the schema BI Connector uses. You can regenerate the schema either by restarting mongosqld or by issuing the FLUSH SAMPLE command from within the MySQL shell.

The --sampleRefreshIntervalSecs option provides another way for mongosqld to regenerate the schema by resampling data at a user-specified interval.

Note

You cannot use the FLUSH SAMPLE command if mongosqld starts with a schema file via the --schema option.

Managing Schema with MongoDB Views

Another way to control the data visible to BI Connector without using a .drdl file is to read from a view. Views allow you to take a source collection and choose which fields to display, change the names of fields, sort data, and perform any kind of transformation that the aggregation pipeline can accomplish.

You can use the mongo shell method createView to define a view on a collection and then use that view as a specified namespace when starting mongosqld.

Examples

Consider a collection named salaries in the employees database. By creating a view on the salaries collection that excludes all personally identifiable information, BI Connector can safely facilitate data analysis on the view without compromising the private data in the source collection.

The following command starts mongosqld with data from a view named salaries_redacted in the employees database:

mongosqld --sampleNamespaces "employees.salaries_redacted"

Another use case for managing your schema with a MongoDB view is if your working data set includes a sparsely populated but important field which may be excluded in the BI Connector sampling process.

Consider a collection named sparseCollection which contains one field which occurs in every document and another field which occurs only occasionally. A portion of the collection might look like this:

{ "_id" : 0, "mainField": 247 }
{ "_id" : 1, "mainField": 71 }
{ "_id" : 2, "mainField": 981 }
{ "_id" : 3, "mainField": 712 }
{ "_id" : 4, "mainField": 781, "sparseField": 1000 }
{ "_id" : 5, "mainField": 256 }
{ "_id" : 6, "mainField": 910 }

To ensure that sparseField is included when BI Connector creates its schema, create a view in which sparseField is populated with zeroes in every document in which it does not occur.

db.runCommand (
  {
    create: "denseCollection",
    viewOn: "sparseCollection",
    pipeline: [
      {
        $addFields: {
           rareField: { $ifNull: [ "$rareField", 0 ] }
        }
      }
    ]
  }
)

To use the view as a data source when starting mongosqld, specify it with the --sampleNamespaces option.

mongosqld --sampleNamespaces "myDatabase.denseCollection"

Note

BI Connector does not recognize geospatial fields when reading from a view.

Load a Schema from a DRDL File

The --schema option allows you to specify a schema file for mongosqld to use.

The schema generation tool mongodrdl outputs a schema file in Document Relational Definition Language (DRDL) format, but you should verify that the schema it proposes matches your data requirements. Examine both the table structure and field types to ensure that you can generate your intended reports.

It is possible to manually edit the schema definition files to perform the following actions:

  • Add aggregation pipeline stages
  • Add fields that mongodrdl did not discover within the subset of documents that it sampled
  • Remove fields
  • Remove tables
  • Rename tables, so long as the collection field remains the same
  • Rename fields using the sqlname option

Important

To apply any changes you make to your DRDL files, you must restart mongosqld.

Geospatial Data

If a collection contains a 2d or 2dsphere geospatial index, BI Connector maps the indexed field into an array of numeric longitude-latitude coordinates.

Example

Given the following collection:

db.points.createIndex( { pos : "2dsphere" } )
db.points.insert({
    pos : { type: "Point", coordinates: [ -73.97, 40.77 ] },
    name: "Central Park",
    category : "Parks"
})

BI Connector generates the following schema:

schema:
- db: test
  tables:
  - table: points
    collection: points
    pipeline: []
    columns:
    - Name: _id
      MongoType: bson.ObjectId
      SqlName: _id
      SqlType: varchar
    - Name: category
      MongoType: string
      SqlName: category
      SqlType: varchar
    - Name: name
      MongoType: string
      SqlName: name
      SqlType: varchar
    - Name: pos.coordinates
      MongoType: geo.2darray
      SqlName: pos.coordinates
      SqlType: numeric[]

Note

BI Connector does not recognize geospatial fields when reading from a view.