Navigation

Data Lake Configuration

The Atlas Data Lake configuration is in JSON format. It contains mappings between your data stores and Data Lake. Data Lake supports S3 buckets, Atlas clusters, and publicly accessible URLs as data stores. You must define mappings in your Data Lake to your S3 bucket, Atlas cluster, and HTTP data stores to run queries against your data.

Important

Information in your storage configuration is visible internally at MongoDB and stored as operational data to monitor and improve the performance of Atlas Data Lake. So, we recommend that you do not use PII in your configurations.

Click on the tab below to learn more about the Data Lake configuration for that data store provider.

You can define mappings between your S3, Atlas cluster, and HTTP data stores and Data Lake in the storage configuration to run federated queries against your data.

Example

For the preceding sample S3, Atlas cluster, and HTTP data stores, the Data Lake configuration for federated queries resembles the following:

{
"stores" : [
{
"name" : "datacenter-alpha",
"provider" : "s3",
"region" : "us-east-1",
"bucket" : "datacenter-alpha",
"additionalStorageClasses" : [
"STANDARD_IA"
],
"prefix" : "/metrics",
"delimiter" : "/"
},
{
"name" : "atlasClusterStore",
"provider" : "atlas",
"clusterName" : "myDataCenter",
"projectId" : "5e2211c17a3e5a48f5497de3"
},
{
"name" : "httpStore",
"provider" : "http",
"allowInsecure" : false,
"urls" [
"https://www.datacenter-hardware.com/data.json",
"https://www.datacenter-software.com/data.json"
],
"defaultFormat" : ".json"
}
],
"databases" : [
{
"name" : "datacenter-metrics",
"collections" : [
{
"name" : "inventory",
"dataSources" : [
{
"storeName" : "datacenter-alpha",
"path" : "/hardware/{date date}"
},
{
"storeName" : "atlasClusterStore",
"database" : "metrics",
"collection" : "hardware"
},
{
"storeName" : "httpStore",
"allowInsecure" : false,
"urls": [
"https://www.datacenter-metrics.com/data.json"
],
"defaultFormat" : ".json"
}
]
}
]
}
]
}
Important

If the database in the storage configuration contains collections from S3, Atlas, and HTTP data stores, the query results might contain data from all the data stores.

Tip
See also:

The Data Lake configuration has the following format:

stores
The stores object defines each data store associated with the Data Lake. The data store captures files in an S3 bucket, documents in Atlas cluster, or files stored at publicly accessible URLs. Data Lake can only access data stores defined in the stores object.
databases
The databases object defines the mapping between each data store defined in stores and MongoDB collections in the databases.
stores

Array of objects where each object represents a data store to associate with the Data Lake. The data store captures files in an S3 bucket, documents in Atlas cluster, or files stored at publicly accessible URLs. A Data Lake can only access data stores defined in the stores object.

stores.[n].name

Name of the data store. The databases.[n].collections.[n].dataSources.[n].storeName field references this value as part of mapping configuration.

stores.[n].provider

Defines where the data is stored. Value can be one of the following:

  • s3 for an AWS S3 bucket.
  • atlas for a collection in an Atlas cluster.
  • http for data in files hosted at publicly accessible URLs.
databases

Array of objects where each object represents a database, its collections, and, optionally, any views on the collections. Each database can have multiple collections and views objects.

databases.[n].name

Name of the database to which Data Lake maps the data contained in the data store.

databases.[n].collections

Array of objects where each object represents a collection and data sources that map to a stores data store.

databases.[n].collections.name

Name of the collection to which Data Lake maps the data contained in each databases.[n].collections.[n].dataSources.[n].storeName. Each object in the array represents the mapping between the collection and an object in the stores array.

databases.[n].collections.[n].dataSources

Array of objects where each object represents a stores data store to map with the collection.

databases.[n].collections.[n].dataSources.[n].storeName

Name of a data store to map to the <collection>. Must match the name of an object in the stores array.

databases.[n].views

Array of objects where each object represents an aggregation pipeline on a collection. To learn more about views, see Views.

databases.[n].views.[n].name

Name of the view.

databases.[n].views.[n].source

Name of the source collection for the view.

databases.[n].views.[n].pipeline

Aggregation pipeline stage(s) to apply to the source collection.

Give Feedback

On this page

  • Overview
  • Example Configuration for Individual Data Stores
  • Example Configuration for Running Federated Queries
  • Configuration Format
  • stores
  • databases