Optimize Data Lake Query Performance

The performance of your Atlas Data Lake is affected by the following factors:

  • The structure of your data in S3 and how you represent it in your Atlas Data Lake configuration.
  • The size of your data files.
  • The format and structure of your data files.

For easier management, make sure that your data is logically grouped into partitions. You can leverage partitions to improve Data Lake performance by mapping them to partition attributes in your configuration.

You can improve your Data Lake's performance by ensuring that your partition structure maps to your query patterns and that it is defined in your configuration. By mapping your partition attributes (the parts of your S3 prefix that looks like a folder) to a query attribute, Data Lake can selectively open the files that contain data related to your query. This both reduces the amount of time a query takes and decreases cost, since Data Lake reads and downloads less files from AWS .


Consider an S3 bucket metrics with the following structure:


You can set a partition attribute for "metric type" by defining /metrics/{metric_type string}/* in your configuration. If you issue a query that contains {metric_type: software}, Data Lake only processes the files with the prefix /software and ignores files with the prefix /hardware.

You can then set a partition attribute for "software type" by defining /metrics/{metric_type string}/{software_type string} in your configuration . If you issue a query that contains {metric_type: software, software_type: computer}, Data Lake ignores files with the prefix /phone.

For more information on mapping partition attributes to a collection path, see Path Syntax.

Each file that Data Lake handles requires a certain amount of compute resources. If your data store contains many small data files, the resources required compound and can reduce performance. Alternatively, many large data files are problematic as Data Lake then downloads and processes unnecessary data.

For most use cases, a performant file size is 100 to 200 MB.

Atlas Data Lake supports several data file formats. You can improve performance by compressing certain file formats or by optimizing file contents for your queries.

When you compress data files, they take less time to download. Reduced download time has a greater performance benefit than parsing uncompressed data.

You can compress the following file formats using gzip:

Parquet, Avro, and ORC files contain metadata about the file itself so that an application can traverse the file contents in different ways. If you structure your data file to align with the queries you want to run, Atlas Data Lake can leverage this metadata to quickly jump to the right data.

Of these formats, Parquet files provide the best performance and space efficiency for Atlas Data Lake, as it is optimized to parse row and column groups for Parquet.

Give Feedback