Learn how to ingest data using console and the PachCTL CLI.
You can upload data to an input repo in Pachyderm by using Console or the PachCTL CLI. Console offers an easy-to-use interface for quick drop-in uploads, while the PachCTL CLI provides more advanced options for uploading data by opening commits and transactions.
Atomic commits automatically open a commit, add data, and close the commit. They are typically useful when you only need to submit one file or directory – otherwise, you should use manual commits to avoid noisy commit history and overwhelming your pipeline with many jobs.
Manual commits allow you to open a commit, add data, and close the commit when you are ready. They are typically useful when you need to submit multiple files or directories that are related. They also help your pipeline process your data faster by using only one job to process all of the data.
If you have a large dataset and you wish to only upload a subset of it, you can add a metadata file containing a list of urls/paths to the relevant data. Your pipeline code will retrieve the data following their path without the need to preload it all.
In this case, Pachyderm will not keep versions of the source file, but it will keep
track and provenance of the resulting output commits.
Transactions allow you to bundle multiple manual commits together and process them simultaneously in one job run. This is particularly useful for pipelines that require multiple inputs to be processed together instead of kicking off a job each time only one of the inputs has been updated.