Partitions are a way to split data into smaller, easier-to-use chunks. They allow you to compute on individual slices of data while remaining flexible on how that data is read or written from storage.
Consider an online store's order data. In a database, the order data might be stored as a single orders
table, which contains multiple days' worth of orders. However, if the data were ingested into Amazon Web Services (AWS) S3 as parquet files, you could create a new parquet file per day or partition.
Using partitions provides the following benefits:
- Cost efficiency: Run only the data that’s needed and gain granular control over slices. For example, storing recent orders in hot storage and older orders in cheaper, cold storage.
- Speed up compute: Divide large datasets into smaller, more manageable parts to speed up queries.
- Scalability: As data grows, distribute it across multiple servers or storage systems or run multiple partitions at a time in parallel.
- Concurrent processing: Boost computational speed with parallel processing, significantly reducing the time and cost of data processing tasks.
- Speed up debugging: Test on an individual partition before trying to run larger ranges of data.
Partitions are supported for both Software-defined Assets and ops, but how each concept is used is unique. Refer to the following documentation for more info:
With partitions, you can:
- View runs by partition in the Dagster UI
- Define a schedule that fills in a partition each time it runs. For example, a job might run each day and process the data that arrived during the previous day.
- Launch backfills, which are sets of runs that each process a different partition. For example, after making a code change, you might want to run your job on all partitions instead of just one.