Navigation

Querying Your Data Lake

You can use the MongoDB Query Language (MQL) on Atlas Data Lake to query and analyze data on your data store. Atlas Data Lake supports most, but not all the standard server commands. To learn more about the supported and unsupported MongoDB server commands and aggregation pipleline stages, see Supported MongoDB Commands.

You can run up to 30 simultaneous queries on your Data Lake against:

  • Data in your S3 bucket.
  • Documents in your MongoDB Atlas cluster.
  • Data in files hosted at publicly accessible URLs.
Tip
See:

You can use Atlas Data Lake to query and analyze data on your cloud object store using MongoDB Query Language (MQL). To query data on S3, your Data Lake storage configuration must contain settings that define:

  • Your S3 data store.
  • Data Lake virtual databases and collections that map to your data store.
Example
{
"stores" : [
{
"name" : "<store-name>",
"provider" : "s3",
"region" : "<aws-region>",
"bucket" : "<s3-bucket-name>",
"prefix" : "<file-path-prefix>",
"delimiter" : "<path-separator>"
}
],
"databases" : [
{
"name" : "<database-name>",
"collections" : [
{
"name" : "<collection-name>",
"dataSources" : [
{
"storeName" : "<store-name>",
"path" : "<path-to-file>"
}
]
}
],
"maxWildcardCollections" : <number-of-wildcard-collections>
}
]
}

To learn more about these settings, see Data Lake Configuration.

Data Lake creates the virtual databases and collections you specified in your Data Lake configuration for the data in your S3 store. When you connect to your Data Lake and run queries, Data Lake processes your queries against the data and returns the query results.

When deploying your Data Lake, if you specified an S3 bucket with both read and write permissions or AWS S3 s3:PutObject permission, you can also save your query results in your S3 bucket using $out to S3.

If you successfully create or update an object on your S3 data store, Data Lake returns the latest version of that object for any subsequent read requests and all list operations of the objects also reflect the changes. If your query contains multiple stages, each stage receives the most recent data available from the data store as that stage is processed.

By default, Atlas Data Lake does not return documents in any specific order for queries on Data Lakes for S3 data stores. Atlas Data Lake reads the partitions concurrently and the underlying storage response order determines which documents Atlas Data Lake returns first, unless you define order using $sort in your query. For example, if you run the same findOne() query twice, you could see different documents, and if you use $skip, different documents might be skipped if $sort is not used in the query.

You can use Atlas Data Lake to query and analyze data in your Atlas cluster. To query data in your Atlas cluster, your Data Lake storage configuration must contain settings that define:

  • Your Atlas data store.
  • Data Lake virtual databases and collections that map to your data store.
Example
{
"stores" : [
{
"name" : "<store-name>",
"provider": "atlas",
"clusterName": "<atlas-cluster-name>",
"projectId": "<atlas-project-ID>"
}
],
"databases" : [
{
"name" : "<database-name>",
"collections" : [
{
"name" : "<collection-name>",
"dataSources" : [
{
"storeName" : "<store-name>",
"database" : "<atlas-database-name>",
"collection" : "<atlas-collection-name>"
}
]
}
]
}
]
}

To learn more about these settings, see Data Lake Configuration. You can create or update your Data Lake storage configuration for an Atlas cluster data store using the Visual Editor or the JSON Editor. For more information, see Deploy a Data Lake for an Atlas Cluster Data Store.

Data Lake automatically detects the file format and creates the virtual databases and collections you specified in your Data Lake configuration. When you connect to your Data Lake and run queries, Data Lake processes your queries against the data and returns the query results.

If you query a collection in Atlas Data Lake that is mapped to only one Atlas collection, Atlas Data Lake acts as a proxy and forwards your query to Atlas. When acting as a proxy, Atlas Data Lake doesn't scan data into its virtual collection to proces the query thus improving performance and reducing cost. This optimization is not available for queries on Atlas Data Lake collections that are mapped to multiple Atlas collections.

Example

Consider the following Data Lake storage configuration:

{
"stores" : [
{
"name" : "atlas-store",
"provider": "atlas",
"clusterName": "myCluster",
"projectId": "5e2211c17a3e5a48f5497de3"
}
],
"databases" : [
{
"name" : "atlas-db",
"collections" : [
{
"name" : "foo",
"dataSources" : [
{
"storeName" : "atlas-store",
"database" : "myFooData",
"collection" : "foo"
}
]
},
{
"name" : "barbaz",
"dataSources" : [
{
"storeName" : "atlas-store",
"database" : "myBarData",
"collection" : "bar"
},
{
"storeName" : "atlas-store",
"database" : "myBazData",
"collection" : "baz"
}
]
}
]
}
]
}

For the above storage configuration, Atlas Data Lake acts as a proxy for queries on foo collection and forwards the queries to Atlas. This performance and cost optimization is not available for queries on barbaz collection because barbaz is mapped to multiple Atlas collections.

You can also save your query results in your Atlas cluster using $out to Atlas.

If you successfully create or update a document in your collection on the Atlas cluster, Data Lake returns the latest version of that document for any subsequent read requests and all list operations of the collection also reflect the changes. If your query contains multiple stages, each stage receives the most recent data available from the data store as that stage is processed.

Note
Beta

The support for HTTP data stores is available as a Beta feature. The feature and the corresponding documentation may change at any time during the Beta stage.

You can use Atlas Data Lake to query and analyze data in files hosted at publicly accessible URLs. To query data in your publicly accessible URLs, your Data Lake storage configuration must contain settings that define:

  • Your HTTP data store.
  • Data Lake virtual databases and collections that map to your data store.
Example
{
"stores" : [
{
"name" : "<store-name>",
"provider": "http",
"urls": ["<url>"],
"defaultFormat" : "<string>"
"allowInsecure": <boolean>,
}
],
"databases" : [
{
"name" : "<database-name>",
"collections" : [
{
"name" : "<collection-name>",
"dataSources" : [
{
"storeName" : "<store-name>",
"urls" : ["<url>"],
"defaultFormat" : "<string>"
"allowInsecure" : <boolean>,
}
]
}
]
}
]
}

To learn more about these settings, see Data Lake Configuration. You can create or update your Data Lake storage configuration for an Atlas cluster data store using the Visual Editor or the JSON Editor. For more information, see Deploy a Data Lake for an HTTP Data Store.

Data Lake creates the virtual databases and collections you specified in your Data Lake configuration for the data in your URL. Data Lake also creates one partition for each URL in your collection. When you connect to your Data Lake and run queries, Data Lake processes your queries against the data and returns the query results.

You can use Atlas Data Lake to query and analyze a unified view of data in your Atlas cluster, S3 bucket, and at your HTTP URL. For federated queries, your Data Lake storage configuration must contain the settings that define:

  • Your S3, Atlas, and HTTP data stores.
  • Data Lake virtual databases and collections that map to your S3, Atlas, and HTTP data stores.
Example
{
"stores" : [
{
"name" : "<atlas-store-name>",
"provider": "atlas",
"clusterName": "<atlas-cluster-name>",
"projectId": "<atlas-project-ID>"
},
{
"name" : "<s3-store-name>",
"provider" : "s3",
"region" : "<aws-region>",
"bucket" : "<s3-bucket-name>",
"prefix" : "<file-path-prefix>",
"delimiter" : "<path-separator>"
},
{
"name" : "<store-name>",
"provider": "http",
"urls": ["<url>"],
"defaultFormat" : "<string>"
"allowInsecure": <boolean>,
}
],
"databases" : [
{
"name" : "<database-name>",
"collections" : [
{
"name" : "<collection-name>",
"dataSources" : [
{
"storeName" : "<atlas-store-name>",
"database" : "<atlas-database-name>",
"collection" : "<atlas-collection-name>"
},
{
"storeName" : "<s3-store-name>",
"path" : "<path-to-file>"
},
{
"storeName" : "<store-name>",
"urls" : ["<url>"],
"defaultFormat" : "<string>"
"allowInsecure" : <boolean>,
}
]
}
]
}
]
}

To learn more about these settings, see Data Lake Configuration.

When you connect to your Data Lake and run federated queries, Data Lake combines data from your Atlas cluster, S3 bucket, and HTTP store in virtual databases and collections and returns a union of data in the results.

Error: We are currently experiencing increased query processing wait times for Atlas Data Lake. Our Engineering team is investigating. Normal service will resume shortly, please try again.

Atlas Data Lake returns this error only when Atlas Data Lake can't execute queries because of resource contention. We recommend that you run your queries again.

Give Feedback

On this page

  • Querying Data on S3
  • Querying Data in Your Atlas Cluster
  • Querying Data at a HTTP or HTTPS URL
  • Running Federated Queries
  • Toubleshooting