Navigation

Deploy a Data Lake for S3 Data Store

This page describes how to deploy a Data Lake for accessing data in your AWS S3 buckets.

Before you begin, you will need to:

1
2
3
  • For your first Data Lake, click Create a Data Lake.
  • For your subsequent Data Lakes, click Configure a New Data Lake.
4
  • For a guided experience, click Visual Editor.
  • To edit the raw JSON , click JSON Editor.
5
1
  1. Click Connect Data to choose your data store.
  2. Choose Amazon S3 to configure a Data Lake for data in AWS S3 buckets.

    Corresponds to stores.[n].provider JSON configuration setting.

2

You can select an existing AWS IAM role that Atlas is authorized for from the role selection dropdown list or choose Authorize an AWS IAM Role to authorize a new role.

If you selected an existing role that Atlas is authorized for, proceed to the next step to list your AWS S3 buckets.

If you are authorizing Atlas for an existing role or are creating a new role, complete the following steps before proceeding to the next step:

  1. Select Authorize an AWS IAM Role to authorize a new role or select an existing role from the dropdown and click Next.
  2. Use the AWS ARN and unique External ID in the Add Atlas to the trust relationships of your AWS IAM role section to add Atlas to the trust relationships of an existing or new AWS IAM role.

    In the Atlas UI, click and expand one of the following:

    • The Create New Role with the AWS CLI shows how to use the ARN and the unique External ID to add Atlas to the trust relationships of a new AWS IAM role. Follow the steps in the Atlas UI for creating a new role. To learn more, see Create New Role with the AWS CLI.

      When authorizing a new role, if you quit the Configure a New Data Lake workflow:

      • Before validating the role, Atlas will not create the Data Lake. You can go to the Atlas Integrations page to authorize a new role. You can resume the workflow when you have the AWS IAM role ARN .
      • After validating the role, Atlas will not create the Data Lake. However, the role is available in the role selection drop-down and can be used to create a Data Lake. You do not need to authorize the role again.
    • The Add Trust Relationships to an Existing Role shows how to use the ARN and the unique External ID to add Atlas to the trust relationships of an existing AWS IAM role. Follow the steps in the Atlas UI for adding Atlas to the trust relationship to an existing role. To learn more, see Add Trust Relationships to an Existing Role .
    Important

    If you modify your custom AWS role ARN in the future, ensure that the access policy of the role includes the appropriate access to the S3 resources for the Data Lake.

    Tip
    See also:
  3. Click Next.
3
  1. Enter the name of your S3 bucket.

    Corresponds to stores.[n].bucket JSON configuration setting.

  2. Specify whether the bucket is Read-only or both Read and write.

    Atlas can only query Read-only buckets; if you wish to query and save query results to your S3 bucket, choose Read and write. To save query results to your S3 bucket, the role policy that grants Atlas access to your AWS resources must include the s3:PutObject and s3:DeleteObject permissions in addition to the s3:ListBucket, s3:GetObject, s3:GetObjectVersion, and s3:GetBucketLocation permissions, which grant read access. See step 4 below to learn more about assigning access policy to your AWS IAM role.

  3. Select the region of the S3 bucket.

    Corresponds to stores.[n].region JSON configuration setting.

  4. Optional. Specify a prefix that Data Lake should use when searching the files in the S3 bucket. If omitted, Data Lake does a recursive search for all files from the root of the S3 bucket.

    Corresponds to stores.[n].prefix JSON configuration setting.

  5. Click Next.
4
  1. Follow the steps in the Atlas user interface to assign an access policy to your AWS IAM role.

    Your role policy for read-only or read and write access should look similar to the following:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "s3:ListBucket",
    "s3:GetObject",
    "s3:GetObjectVersion",
    "s3:GetBucketLocation"
    ],
    "Resource": [
    <role arn>
    ]
    }
    ]
    }
  2. Click Next.
5

For example:

s3://<bucket-name>/<path>/<to>/<files>/<filename>.<file-extension>

To add additional paths to data on your S3 bucket, click Add Data Source and enter the path. To learn more about paths, see Path Syntax.

Corresponds to databases.[n].collections.[n].dataSources.[n].path JSON configuration setting.

6
  1. (Optional) Click the for the:

    • Data Lake to specify a name for your Data Lake. Defaults to Data Lake[n].
    • Database to edit the database name. Defaults to Database[n].

      Corresponds to databases.[n].name JSON configuration setting.

    • Collection to edit the collection name. Defaults to Collection[n].

      Corresponds to databases.[n].collections.name JSON configuration setting.

    You can click:

    • Create Database to add databases and collections.
    • associated with the database to add collections to the database.
    • associated with the database or collection to remove the database or collection.
  2. Drag and drop the data store to map with the collection.

    Corresponds to databases.[n].collections.[n].dataSources JSON configuration setting.

7

To add Atlas or HTTP data stores for federated queries, see:

8
Give Feedback