Save and load datasets stored in HDF5 file format

This example demonstrates how to load the data from a stored .h5 file and to build a data input Pipeline in TensorFlow / Keras.

Save Dataset to HDF5

At first, we create a small temporary dataset by utilizing the default synthetic dataset, compounding 5 source cases and the Cross-spectral matrix as input feature.

[4]:

import tensorflow as tf
from acoupipe.datasets.synthetic import DatasetSynthetic

# training dataset
d1 = DatasetSynthetic()

# save to .h5 file
d1.save_h5(features=["csm"], split="training", size=5, name="/tmp/tmp_dataset.h5")

100%|██████████| 5/5 [00:02<00:00,  2.05it/s]

Load Dataset from HDF5 File

The AcouPipe toolbox provides the LoadH5Dataset class to load the datasets stored into HDF5 format. One can access each individual sample/source case by the h5f attribute of the class. To extract the first input feature (‘csm’ in this case) of the dataset:

[5]:

from acoupipe.loader import LoadH5Dataset

dataset_h5 = LoadH5Dataset(name="/tmp/tmp_dataset.h5")

print(list(dataset_h5.h5f.keys())) # sample indices are the keys of the dataset file

['0', '1', '2', '3', '4']

Similarly as with the generate method, the get_data method can be used to retrieve the stored data iteratively

[6]:

for data in dataset_h5.get_data():
    print(f"index {data['idx']} includes features:", list(data.keys())) # keys are the names of the features

index 0 includes features: ['csm', 'idx', 'seeds']
index 1 includes features: ['csm', 'idx', 'seeds']
index 2 includes features: ['csm', 'idx', 'seeds']
index 3 includes features: ['csm', 'idx', 'seeds']
index 4 includes features: ['csm', 'idx', 'seeds']

Building a TensorFlow/Keras Dataset

With these definitions, a Python generator can be created which can be consumed by the Tensorflow Dataset API. Here, the dataset comprises the CSM, idx and seeds features.

[7]:

data_generator = dataset_h5.get_dataset_generator()

To build a TensorFlow Dataset, the output signature corresponding to the data must be known, which would be something like:

[10]:

# provide the signature of the features
output_signature = {
            'csm':  tf.TensorSpec(shape=(None,64,64), dtype=tf.complex64),
            'seeds' : tf.TensorSpec(shape=(None,2), dtype=tf.float32),
            'idx' : tf.TensorSpec(shape=(), dtype=tf.int64)
            }

tf_dataset = tf.data.Dataset.from_generator(
            generator=data_generator,
            output_signature=output_signature
            )

data = next(iter(tf_dataset))

print(f"index {data['idx']} includes features:", list(data.keys()))
for key, value in data.items():
    print(f"key: {key}, shape: {value.shape}")

index 0 includes features: ['csm', 'seeds', 'idx']
key: csm, shape: (65, 64, 64)
key: seeds, shape: (6, 2)
key: idx, shape: ()

Alternatively, the output signature can be retrieved from the dataset by using the get_output_signature method of the corresponding Dataset

[11]:

d1 = DatasetSynthetic()

signature = d1.get_output_signature(features=["csm","idx","seeds"])

tf_dataset = tf.data.Dataset.from_generator(
            generator=data_generator,
            output_signature=signature
            )

data = next(iter(tf_dataset))

print(f"index {data['idx']} includes features:", list(data.keys()))
for key, value in data.items():
    print(f"key: {key}, shape: {value.shape}")

index 0 includes features: ['csm', 'idx', 'seeds']
key: csm, shape: (65, 64, 64)
key: idx, shape: ()
key: seeds, shape: (6, 2)