Navigation

Optimize Data Lake Query Performance

On this page

The performance of your Atlas Data Lake is affected by the following factors:

  • The structure of your data in S3 and how you represent it in your Atlas Data Lake configuration.
  • The size of your data files.
  • The format and structure of your data files.

For easier management, ensure that your data is logically grouped into partitions. Atlas Data Lake utilizes partitions you create with the field values that you specify in your partition syntax. You can improve your Data Lake's performance by ensuring that your partition structure maps to your query patterns and the partition structure is defined in your databases.[n].collections.[n].dataSources.[n].path. For the partition, choose fields that you query frequently and order them from the most frequently queried in the first position to the least queried field in the last position.

The order of fields listed in the databases.[n].collections.[n].dataSources.[n].path is important in the same way as it is in Compound Indexes. The specified path corresponds to data that is partitioned first by the value of the first field, and then by the value of the next field, and so on.

Example

Consider a collection with the software, computer, and OS fields and partitions on the S3 bucket named metrics first for the software field, followed by the computer field, and then the OS field.

metrics
|--software
|--computer
|--OS

Atlas Data Lake uses the partitions for queries on the these fields:

  • the software field,
  • the software field and the computer field,
  • the software field and the computer field and the OS field.

Atlas Data Lake can use the partitions to support a query on the software and OS fields. However, in this case, Atlas Data Lake is not as efficient for the query as it would be if the query was on the software and computer fields only. Partitions are parsed in order; if a query omits a particular partition, Atlas Data Lake is less efficient in making use of any partitions that follow the partition. Because a query on software and OS omits computer, Atlas Data Lake uses the software partition more efficiently than the OS partition to support this query.

Atlas Data Lake can't use the partitions to support queries on fields not specified in the databases.[n].collections.[n].dataSources.[n].path. Also, Atlas Data Lake can't use the partitions to support queries that include the following fields without the software field:

  • the computer field,
  • the OS field, or
  • the computer and OS fields.

You can use partitions to improve Data Lake performance by mapping them to partition attributes in your configuration. By mapping your partition attributes (the parts of your S3 prefix that looks like a folder) to a query attribute, Atlas Data Lake can selectively open the files that contain data related to your query. This reduces the amount of time a query takes and decreases cost, because Data Lake reads and downloads less files from AWS.

Example

Consider an S3 bucket metrics with the following structure:

metrics
|--hardware
|--software
|--computer
|--phone

You can set a partition attribute for "metric type" by defining /metrics/{metric_type string}/* in your configuration. If you issue a query that contains {metric_type: software}, Data Lake only processes the files with the prefix /software and ignores files with the prefix /hardware.

You can then set a partition attribute for "software type" by defining /metrics/{metric_type string}/{software_type string} in your configuration . If you issue a query that contains {metric_type: software, software_type: computer}, Data Lake ignores files with the prefix /phone.

For more information on mapping partition attributes to a collection databases.[n].collections.[n].dataSources.[n].path, see Path Syntax.

Each file that Data Lake handles requires a certain amount of compute resources. If your data store contains many small data files, the resources required compound and can reduce performance. Alternatively, many large data files are problematic as Data Lake then downloads and processes unnecessary data.

For most use cases, a performant file size is 100 to 200 MB.

Atlas Data Lake supports several data file formats. You can improve performance by compressing certain file formats or by optimizing file contents for your queries.

When you compress data files, they take less time to download. Reduced download time has a greater performance benefit than parsing uncompressed data.

You can compress the following file formats using gzip:

Parquet, Avro, and ORC files contain metadata about the file itself so that an application can traverse the file contents in different ways. If you structure your data file to align with the queries you want to run, Atlas Data Lake can leverage this metadata to quickly jump to the right data.

Of these formats, Parquet files provide the best performance and space efficiency for Atlas Data Lake, as it is optimized to parse row and column groups for Parquet.

Give Feedback

On this page

  • Data Structure in S3
  • Data File Size
  • Data File Format