Trying to understand dvc, most tutorials mention generation of dvc.yaml by running dvc run command.
But at the same time, dvc.yaml which defines the DAG is also well documented. Also the fact that it is a yaml format and human readable/writable would point to the fact that it is meant to be a DSL for specifying your data pipeline.
Can somebody clarify which is the better practice?
Writing the dvc.yaml or let it be generated by dvc run command?
Or is it left to user's choice and there is no technical difference?
I'd recommend manual editing as the main route! (I believe that's officially recommended since DVC 2.0)
dvc stage add can still be very helpful for programmatic generation of pipelines files, but it doesn't support all the features of dvc.yaml, for example setting vars values or defining foreach stages.
Both, really.
Primarily dvc run (or the newer dvc stage add followed by dvc exp run) is meant to mange your dvc.yaml file. For most (including casual) users, this is probably easiest & thus best. The format will be guaranteed to be correct (similar to choosing between {git,dvc} config and directly modifying .{git,dvc}/config)
However as you note, dvc.yaml is human-readable. This is intentional so that more advanced users could manually edit the YAML (potentially short-circuiting some validation checks, or unlocking advanced functionality such as foreach stages).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With