How to version datasets?¶
When working on a project, it is not unusual that I change the datasets on which I train my models. How can I keep track of that in Neptune?
Under many circumstances it is possible to calculate a hash of your dataset. Even if you are working with large image datasets, you have some sort of a smaller metadata file, that points to image paths.
If this is the case you should:
Create hashing function. For example:
1 2 3 4 5 6 7 8
import hashlib def md5(fname): hash_md5 = hashlib.md5() with open(fname, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): hash_md5.update(chunk) return hash_md5.hexdigest()
1 2 3
import neptune ctx = neptune.Context()
Calculate the hash of your training data and send it to Neptune:
1 2 3 4 5 6
... TRAIN_FILEPATH = 'PATH/TO/TRAIN/DATA' train_hash = md5(TRAIN_FILEPATH) ctx.channel_send('train_data_version', train_hash) ...
Add data version column to your project dashboard:
If your dataset is too large for fast hashing you could think about rearranging your data to have a light-weight metadata file.