DoFn.Setup Annotation for the method to use to prepare an instance for processing bundles of elements.
Uses the word "bundle", takes zero arguments.
DoFn.StartBundle Annotation for the method to use to prepare an instance for processing a batch of elements.
Uses the word "batch", takes zero or one arguments (StartBundleContext, a way to access PipelineOptions).
I need to initialize a library within the DoFn instance, then use that library for every element in the "batch" or "bundle". I wouldn't normally split hairs with these two words, but in a pipeline, there might be some difference?
The lifecycle of a DoFn is as follows:
SetupStartBundleProcessElement
FinishBundleTeardownI.e. one instance of a DoFn can process many (zero or more) bundles, and within one bundle, it processes many (zero or more) elements.
Both Setup/Teardown and StartBundle/FinishBundle are optional - it is possible to implement any DoFn without using them, and with doing the work only in ProcessElement, however it will be inefficient. Both methods allow optimizations:
StartBundle/FinishBundle tell you what are the allowed boundaries of batching: basically, you are not allowed to batch across FinishBundle - FinishBundle must force a flush of your batch (and StartBundle must initialize / reset the batch). This is the only common use of these methods that I'm aware of, but if you're interested in a more general or rigorous explanation - a bundle is a unit of fault tolerance, and the runner assumes that by the time FinishBundle returns, you have completely performed all the work (outputting elements or performing side effects) associated with all elements seen in this bundle; work must not "leak" between bundles.StartBundle/FinishBundle, but, unlike pending side effects or output, it is fine for such resources to persist between bundles. That's what Setup and Teardown are for.DoFn, e.g. parsing a config file etc. This is also best done in Setup.More concisely:
Setup/Teardown.StartBundle/FinishBundle.(Managing resources in bundle methods is inefficient; managing batching in setup/teardown is plain incorrect and will lead to data loss)
The DoFn documentation was recently updated to make this more clear.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With