How to version datasets?

Problem

When working on a project, it is not unusual that I change the datasets on which I train my models. How can I keep track of that in Neptune?

Solution

Under many circumstances it is possible to calculate a hash of your dataset. Even if you are working with large image datasets, you have some sort of a smaller metadata file, that points to image paths.

If this is the case you should:

Step 1

Create hashing function. For example:

1
2
3
4
5
6
7
8
import hashlib

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Step 2

Instantiate Neptune Context object:

1
2
3
import neptune

ctx = neptune.Context()

Step 3

Calculate the hash of your training data and send it to Neptune:

1
2
3
4
5
6
...
TRAIN_FILEPATH = 'PATH/TO/TRAIN/DATA'
train_hash = md5(TRAIN_FILEPATH)

ctx.channel_send('train_data_version', train_hash)
...

Step 4

Add data version column to your project dashboard:

image

Note

If your dataset is too large for fast hashing you could think about rearranging your data to have a light-weight metadata file.

See also