Navigation

Data Lake Configuration File

Overview

The Atlas Data Lake configuration file uses the JSON format. It defines mappings between your data stores and Data Lake. Data Lake supports S3 buckets and Atlas clusters as data stores. Click on the tab to learn more about the Data Lake configuration for that data store.

Consider a S3 bucket datacenter-alpha containing data collected from a datacenter:

|--metrics
   |--hardware

The /metrics/hardware path stores JSON files with metrics derived from the datacenter hardware, where each filename is the UNIX timestamp in milliseconds of the 24 hour period covered by that file:

/hardware/1564671291998.json

The following configuration file:

  • Defines a data store on the datacenter-alpha S3 bucket in the us-east-1 AWS region. The data store is specifically restricted to only datafiles in the metrics folder path.
  • Maps files from the hardware folder to a MongoDB database datacenter-alpha-metrics and collection hardware. The configuration mapping includes parsing logic for capturing the timestamp implied in the filename.
{
  "stores" : [
    {
      "name" : "datacenter-alpha",
      "provider" : "s3",
      "region" : "us-east-1",
      "bucket" : "datacenter-alpha",
      "prefix" : "/metrics",
      "delimiter" : "/"
    }
  ],
  "databases" : [
    {
      "name" : "datacenter-alpha-metrics",
      "collections" : [
        {
          "name" : "hardware",
          "dataSources" : [
            {
              "storeName" : "datacenter-alpha",
              "path" : "/hardware/{date date}"
            }
          ]
        }
      ]
    }
  ]
}

Atlas Data Lake parses the S3 bucket datacenter-alpha and processes all files under /metrics/hardware/. The collections uses the path parsing syntax to map the filename to the date field, which is an ISO-8601 date, in each document. If a matching date field does not exist in a document, it will be added.

Users connected to the Data Lake can use the MongoDB Query Language and supported aggregations to analyze data in the S3 bucket through the datacenter-metrics.hardware collection.

note

To use Atlas as a data store, Data Lake requires M10 or higher cluster.

Consider a M10 or higher Atlas cluster named myDataCenter containing data in the metrics.hardware collection. The metrics.hardware collection contains JSON documents with metrics derived from the hardware in a datacenter. The following configuration file:

  • Specifies the Atlas cluster named myDataCenter in the specified project as a data store.
  • Maps documents from the metrics.hardware collection in the Atlas cluster to the dataCenter.inventory collection in the storage configuration.
{
  "stores" : [
    {
      "name" : "atlasClusterStore",
      "provider" : "atlas",
      "clusterName" : "myDataCenter",
      "projectID" : "5e2211c17a3e5a48f5497de3"
    }
  ],
  "databases" : [
    {
      "name" : "dataCenter",
      "collections" : [
        {
          "name" : "inventory",
          "dataSources" : [
            {
              "storeName" : "atlasClusterStore",
              "database" : "metrics",
              "collection" : "hardware"
            }
          ]
        }
      ]
    }
  ]
}

Atlas Data Lake maps all the documents in the metrics.hardware collection to the dataCenter.inventory collection in the storage configuration.

Users connected to the Data Lake can use the MongoDB Query Language and supported aggregations to analyze data in the Atlas cluster through the dataCenter.inventory collection. When you run queries, the query first goes to Atlas Data Lake. Therefore, if you run aggregation queries that are supported by your Atlas cluster but not by Atlas Data Lake, the queries will fail. To learn more about supported and unsupported commands in Data Lake, see Supported MongoDB Commands.

important

If the database in the storage configuration contains collections from both S3 and Atlas data stores, the query results might contain data from both the data stores.

Configuration File Format

The Data Lake configuration file has the following format:

{
  "stores" : [
    {
      "name" : "<string>",
      "provider": "<string>",
      "region" : "<string>",
      "bucket" : "<string>",
      "prefix" : "<string>",
      "includeTags": <boolean>,
      "delimiter": "<string>"
    }
  ],
    "databases" : [
      {
        "name" : "<string>",
        "collections" : [
          {
            "name" : "<string>",
            "dataSources" : [
              {
                "storeName" : "<string>",
                "path" : "<string>",
                "defaultFormat" : "<string>"
              }
            ]
          }
        ],
        "views" : [
          {
            "name" : "<string>",
            "source" : "<string>",
            "pipeline" : "<string>"
          }
        ]
      }
    ]
  }
{
  "stores" : [
    {
      "name" : "<string>",
      "provider": "<string>",
      "clusterName": "<string>",
      "projectId": "<string>"
    }
  ],
    "databases" : [
      {
        "name" : "<string>",
        "collections" : [
          {
            "name" : "<string>",
            "dataSources" : [
              {
                "storeName" : "<string>",
                "database" : "<string>",
                "collection" : "<string>"
              }
            ]
          }
        ],
        "views" : [
          {
            "name" : "<string>",
            "source" : "<string>",
            "pipeline" : "<string>"
          }
        ]
      }
    ]
  }
stores
The stores object defines each data store associated to the Data Lake. Data Lake can only access data stores defined in the stores object.
databases
The databases object defines the mapping between each data store defined in stores and MongoDB collections in the databases.

stores

"stores" : [
  {
    "name" : "<string>",
    "provider" : "<string>",
    "region" : "<string>",
    "bucket" : "<string>",
    "prefix" : "<string>",
    "delimiter" : "<string>",
    "includeTags": <boolean>
  }
]
"stores" : [
  {
    "name" : "<string>",
    "provider" : "<string>",
    "clusterName" : "<string>",
    "projectId": "<string>"
  }
]
stores

The stores object defines an array of data stores associated with a Data Lake. data store captures files in an S3 bucket or documents in Atlas cluster. An Atlas Data Lake can only access data stores defined in the stores object.

stores.[n].name

Name of the data store. The databases.[n].collections.[n].dataSources.[n].storeName field references this value as part of mapping configuration.

stores.[n].provider

Defines where the data is stored. Value can be one of the following:

  • s3 for an AWS S3 bucket.
  • atlas for a collection in an Atlas cluster.
stores.[n].region

Name of the AWS region in which the S3 bucket is hosted. For a list of valid region names, see Amazon Web Services (AWS).

stores.[n].bucket

Name of the AWS S3 bucket. Must exactly match the name of an S3 bucket which Data Lake can access given the configured AWS IAM credentials.

stores.[n].prefix

Optional. Prefix Data Lake applies when searching for files in the S3 bucket.

For example, consider an S3 bucket metrics with the following structure:

metrics
  |--hardware
  |--software
     |--computed

The data store prepends the value of prefix to the databases.[n].collections.[n].dataSources.[n].path to create the full path for files to ingest. Setting the prefix to /software restricts any databases objects using the data store to only subpaths /software.

If omitted, Data Lake searches all files from the root of the S3 bucket.

stores.[n].delimiter

Optional. The delimiter that separates databases.[n].collections.[n].dataSources.[n].path segments in the data store. Data Lake uses the delimiter to efficiently traverse S3 buckets with a hierarchical directory structure. You can specify any character supported by the S3 object keys as the delimiter. For example, you can specify an underscore (_) or a plus sign (+) or multiple characters such as double underscores (__) as the delimiter.

If omitted, defaults to "/".

stores.[n].includeTags

Optional. Determines whether or not to use S3 tags on the files in the given path as additional partition attributes. Valid values are true and false.

If omitted, defaults to false.

If set to true, Data Lake does the following:

  • Adds the S3 tags as additional partition attributes.
  • Adds new top level BSON elements that associates each tag to each document for the tagged files.

warning

If set to true, Data Lake processes the files for additional partition attributes by making extra calls to S3 to get the tags. This behavior might impact performance.

stores.[n].clusterName

Name of the Atlas cluster on which the store is based. Note that the cluster must be a M10 or higher cluster and must exist in the same project as your Data Lake. The source field on the data partition is the name of the Atlas cluster.

stores.[n].projectID

Unique identifier of the project that contains the Atlas cluster on which the store is based.

databases

"databases" : [
  {
    "name" : "<string>",
    "collections" : [
      {
        "name" : "<string>",
        "dataSources" : [
          {
            "storeName" : "<string>",
            "defaultFormat" : "<string>",
            "path" : "<string>"
          }
        ]
      }
    ],
    "views" : [
      {
        "name" : "<string>",
        "source" : "<string>",
        "pipeline" : "<string>"
      }
    ]
  }
]
"databases" : [
  {
    "name" : "<string>",
    "collections" : [
      {
        "name" : "<string>",
        "dataSources" : [
          {
            "storeName" : "<string>",
            "database" : "<string>",
            "collection" : "<string>"
          }
        ]
      }
    ]
  }
]
databases

Array of objects where each object represents a database, its collections, and, optionally, any views on the collections. Each database can have multiple collections and views objects.

databases.[n].name

Name of the database to which Data Lake maps the data contained in the data store.

databases.[n].collections

Array of objects where each object represents a collection and data sources that map to a stores data store.

databases.[n].collections.name

Name of the collection to which Data Lake maps the data contained in each databases.[n].collections.[n].dataSources.[n].storeName. Each object in the array represents the mapping between the collection and an object in the stores array.

You can generate collection names dynamically from file paths by specifying * for the collection name and the collectionName() function in the databases.[n].collections.[n].dataSources.[n].path field. See Generate Dynamic Collection Names from File Path for examples.

You can generate collection names dynamically by specifying * for the collection name and omitting the databases.[n].collections[n].dataSources.[n].collection field.

databases.[n].collections.[n].dataSources

Array of objects where each object represents a stores data store to map with the collection.

databases.[n].collections.[n].dataSources.[n].storeName

Name of a data store to map to the <collection>. Must match the stores.[n].name of an object in the stores array.

databases.[n].collections.[n].dataSources.[n].path

Controls how Atlas Data Lake searches for and parses files in the databases.[n].collections.[n].dataSources.[n].storeName before mapping them to the <collection>. Data Lake prepends the stores.[n].prefix to the path to build the full path to search within. Specify / to capture all files and folders from the prefix path.

For example, consider an S3 bucket metrics with the following structure:

metrics
|--hardware
|--software
   |--computed

A path of / directs Data Lake to search all files and folders in the metrics bucket.

A path of /hardware directs Data Lake to search only that path for files to ingest.

If the stores.[n].prefix is software, Data Lake searches for files only in the path /software/computed.

Appending the * wildcard character to the path directs Data Lake to include all files and folders from that point in the path. For example, /software/computed* would match files like /software/computed-detailed, /software/computedArchive, and /software/computed/errors.

databases.[n].collections.[n].dataSources.[n].path supports additional syntax for parsing filenames, including:

  • Generating document fields from filenames.
  • Using regular expressions to control field generation.
  • Setting boundaries for bucketing filenames by timestamp.

See Path Syntax Examples for more information.

note

When specifying the databases.[n].collections.[n].dataSources.[n].path, use the delimiter specified in stores.[n].delimiter.

databases.[n].collections.[n].dataSources.[n].defaultFormat

Optional. Specifies the default format Data Lake assumes if it encounters a file without an extension while searching the databases.[n].collections.[n].dataSources.[n].storeName.

If omitted, Data Lake attempts to detect the file type by processing a few bytes of the file.

note

If your file format is CSV or TSV, you must include a header row in your data. See Comma-Separated and Tab-Separated Value Data Files for more information.

The following values are valid for the defaultFormat field:

.json, .json.gz, .bson, .bson.gz, .avro, .avro.gz, .tsv, .tsv.gz, .csv, .csv.gz, .parquet

databases.[n].collections.[n].dataSources.[n].database

Name of the database on the Atlas cluster that contains the collection.

databases.[n].collections[n].dataSources.[n].collection

Name of the collection in the Atlas cluster on which the Data Lake data store is based. When creating a wildcard collection, this must not be specified.

databases.[n].views

Array of objects where each object represents an aggregation pipeline on a collection. To learn more about views, see Views.

databases.[n].views.[n].name

Name of the view.

databases.[n].views.[n].source

Name of the source collection for the view.

databases.[n].views.[n].pipeline

Aggregation pipeline stage(s) to apply to the databases.[n].views.[n].source collection.