Tensorflow 2.0 - Dataset

This is a series of posts exploring some of the new features in tensorflow 2.0, which I am currently using in my own projects. These posts are introductory guides and do not cover more advanced uses.

Tensorflow 2.0 introduced the concept of a Dataset. This high level API allows you to load different data formats such as images, numpy arrays and panda dataframes.

Previously, in Keras, when we want to load a training dataset that is too big to fit into memory, we create a custom generator that iterates over the dataset in batches which are fed into the model during training using method calls such as fit_generator.

The issue with the above approach is that it can be error-prone to setup. For instance, changes to the dataset structure means changes to the generator or there could be issues in the generator code implementation.

A Dataset is a high-level construct in TF 2.0 which represent a collection of data or documents. It supports batching, caching and pre-fetching of data in the background. The dataset is not loaded into memory but streamed into the model when its iterated through.

Using a Dataset generally follows the guidelines:

  • Create a dataset from input data

  • Apply transformations to preprocess the data

  • Iterate over dataset and process its elements i.e. training loop

Let’s go through each of the above stages in the pipeline.

Creating a dataset

The easiest method to create a dataset is to use the from_tensor_slices method:

1
2
3
dataset = tf.data.Dataset.from_tensor_slices([1,2,3])
for ele in dataset:
  print(ele) # returns tf.Tensor

If we try to print each element of a dataset, we get a Tensor object back. In order to inspect the contents, we can call the as_numpy_iterator method to convert the tensors into numpy arrays, which returns an iterable:

1
2
for num in dataset.as_numpy_iterator():
  print(num)

To create dataset from a directory list of files, we can use the list_files method which accepts a file/glob matching pattern. For example, if we had a directory of "/mydir/", consisting of python files such as "/mydir/a.py", "/mydir/b.py", it would produce the following:

1
2
3
dataset = tf.data.Dataset.list_files("/mydir/*.py")
files_list = list(dataset.as_numpy_iterator())
print(files_list) # => returns ["/mydir/a.py", "/mydir/b.py"]

The issue with the above approach is that globbing occurs for every filename encountered in the path, so its more efficient to produce the list of file names first and construct the dataset using from_tensor_slices

There are other methods such as from_generator and from_tensors which are outside the scope of this article. We will be using from_tensor_slices in a working example below.

Apply transformations to dataset

Now that we have a dataset of elements, the next step would be to preprocess it. We can call the map method and pass a function to process each element.

For instance, we may want to resize each image and perform mean normalization as part of preprocessing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# list_of_files is a collection of file paths...
dataset = tf.data.Dataset.from_tensor_slices(list_of_files)

train_ds = dataset.map(process_img)

def process_img(file_path):
  # read and process the image
  img = tf.io.read_file(file_path)
  img = tf.image.decode_jpeg(img, channels=3)
  # mean normalization
  img = tf.image.convert_image_dtype(img, tf.float32)
  img /= 255.0
  # resize the image
  img = tf.image.resize(img, (64, 64))
  return img

After calling process_img in the above, train_ds will now contain a dataset of preprocessed images.

Since map returns a dataset, we can chain multiple calls together, clarifying the sequence of operations:

1
2
3
4
5
6
7
8
9
10
11
def func1(x):
  return x * 2

def func2(x):
  return x ** 2

ds = tf.data.Dataset.from_tensor_slices([1,2,3])

new_ds = ds.map(func1).map(func2)

print(list(new_ds.as_numpy_iterator())) # => [4, 16, 36]

Iteration over dataset

We need to set certain parameters on the dataset object before we can pass it into a model for training. This would include setting the batch size, caching, pre-fetching options.

Using the image classification example above, we can do the following:

1
2
3
4
5
dataset = tf.data.Dataset.from_tensor_slices(list_of_files)
train_ds = dataset.map(process_img)
train_ds = train_ds.shuffle(buffer_size=1024).batch(64)

model.fit(train_ds, epochs=3)

The shuffle function randomly shuffles the elements in the dataset. The batch function sets the batch size for each training epoch. Note that by using batch we don’t have to set the batch size argument in the fit function.

One can also chain further functions such as cache to cache the data in memory or on the filesystem by setting the filename argument in the function. This is extremely useful when training large datasets.

Note that, the first iteration of the training loop will create the cache, after which, subsequent runs will use the same cached data in the same sequence. To randomize the data between iterations, call shuffle after cache

For example:

1
train_ds = train_ds.cache("cache/mycache").shuffle(buffer_size=1024).batch(64)

When the training loop is restarted, the cache directory needs to be cleared else it will raise an exception.

For most training scenarios, passing the dataset into model.fit will be sufficient. However, if you do have a custom/manual training process where you are iterating the dataset across multiple epochs, you need to call repeat before batch to iterate over the dataset.

1
2
3
4
train_ds = train_ds.repeat().batch(64)

for ele in train_ds.as_numpy_iterator():
  print(ele)

To access the next batch of data, you can create an iterator from the dataset by calling as_numpy_iterator or wrapping the dataset object in iter() and call next to retrieve the next batch of data.

For a working implementation, please refer to the following example on applying tf.data.Dataset on MNIST. The tf.data.Dataset API has more details on the various functions and examples.

Happy Hacking!