Backfills#

Overview#

Dagster supports data backfills for each partition or subsets of partitions. After defining a partitioned asset or job, you can launch a backfill which submits runs to fill in multiple partitions at the same time.

Launching Backfills#

Partitioned Asset Backfills#

You can open the backfill modal to launch backfills for a partitioned asset using the "Materialize" button, either on the Asset detail page or when viewing a graph of assets. Backfills can also be launched for a selection of differently partitioned assets, as long as the roots share the same partitioning.

To observe the the progress of an asset backfill, you can visit the backfill details page. You can get to this page by clicking on the notification that shows up after launching a backfill. You can also reach this page by clicking on "Overview" in the top navigation pane, then clicking the "Backfills" tab, and then clicking on the ID of one of the backfills.

Backfill policies and single-run backfills (Experimental)#

By default, if you launch a backfill that covers N partitions, Dagster will launch N separate runs - one for each partition. This works well when your code is single threaded, because it avoids overwhelming it with large amounts of data. However, if you're using a parallel-processing engine like Spark and Snowflake, you often don't need Dagster to help with parallelism, so splitting up the backfill into multiple runs just adds extra overhead.

Dagster supports backfills that execute as a single run that covers a range of partitions. For example, this allows you to execute the backfill as a single Snowflake query. After it completes, Dagster will track that all those partitions have been filled.

To get this behavior, you need to do two things:

Set the backfill_policy on your asset to BackfillPolicy.single_run.
Write your code so that it operates on partition ranges, instead of just single partitions. This means that, if your code uses the partition_key context property, you'll need to update it to instead use one of the partition_time_window, partition_key_range, or partition_keys properties. Which one to use depends on whether it's most convenient for you to operate on start / end datetime objects, start / end partition keys, or a list of partition keys.


from dagster import AssetKey, BackfillPolicy, DailyPartitionsDefinition, asset


@asset(
    partitions_def=DailyPartitionsDefinition(start_date="2020-01-01"),
    backfill_policy=BackfillPolicy.single_run(),
    deps=[AssetKey("raw_events")],
)
def events(context):
    start_datetime, end_datetime = context.partition_time_window

    input_data = read_data_in_datetime_range(start_datetime, end_datetime)
    output_data = compute_events_from_raw_events(input_data)

    overwrite_data_in_datetime_range(start_datetime, end_datetime, output_data)

If you are using an I/O manager to handle saving and loading your data, you'll need to ensure the I/O manager is also using these methods. If you're using any of the built-in database I/O managers, like Snowflake, BigQuery, or DuckDB, you'll have this out-of-the-box.


from dagster import IOManager


class MyIOManager(IOManager):
    def load_input(self, context):
        start_datetime, end_datetime = context.asset_partitions_time_window
        return read_data_in_datetime_range(start_datetime, end_datetime)

    def handle_output(self, context, obj):
        start_datetime, end_datetime = context.asset_partitions_time_window
        return overwrite_data_in_datetime_range(start_datetime, end_datetime, obj)

Partitioned Job Backfills#

You can launch and monitor backfills of a job using the Partitions tab.

To launch a backfill, click the "Launch backfill" button at the top center of the Partitions tab. This opens the "Launch backfill" modal, which lets you select the set of partitions to launch the backfill over. A run will be launched for each partition.

You can click the button on the bottom right to submit the runs. What happens when you hit this button depends on your Run Coordinator. With the default run coordinator, the modal will exit after all runs have been launched. With the queued run coordinator, the modal will exit after all runs have been queued.

After all the runs have been submitted, you'll be returned to the partitions page, with a filter for runs inside the backfill. This refreshes periodically and allows you to see how the backfill is progressing. Boxes become green or red as steps in the backfill runs succeed or fail.

Using the Backfill CLI#

You can also launch backfills using the backfill CLI.

In the Partitions concept page, we defined a partitioned job called do_stuff_partitioned that had date partitions.

Having done so, we can run the command dagster job backfill to execute the backfill.


$ dagster job backfill -p do_stuff_partitioned

This will display a list of all the partitions in the job, ask you if you want to proceed, and then launch a run for each partition.

Executing a subset of partitions#

You can also execute subsets of the partition sets.

You can specify the --partitions argument and provide a comma-separated list of partition names you want to backfill:


$ dagster job backfill -p do_stuff_partitioned --partitions 2021-04-01,2021-04-02

Alternatively, you can also specify ranges of partitions using the --from and --to arguments:


$ dagster job backfill -p do_stuff_partitioned --from 2021-04-01 --to 2021-05-01

You are viewing an unreleased or outdated version of the documentation