Save and load datasets stored in HDF5 file format
This example demonstrates how to load the data from a stored .h5 file and to build a data input Pipeline in TensorFlow / Keras.
Save Dataset to HDF5
At first, we create a small temporary dataset by utilizing the default synthetic dataset, compounding 5 source cases and the Cross-spectral matrix as input feature.
[4]:
import tensorflow as tf
from acoupipe.datasets.synthetic import DatasetSynthetic
# training dataset
d1 = DatasetSynthetic()
# save to .h5 file
d1.save_h5(features=["csm"], split="training", size=5, name="/tmp/tmp_dataset.h5")
100%|██████████| 5/5 [00:02<00:00, 2.05it/s]
Load Dataset from HDF5 File
The AcouPipe toolbox provides the LoadH5Dataset
class to load the datasets stored into HDF5 format. One can access each individual sample/source case by the h5f attribute of the class. To extract the first input feature (‘csm’ in this case) of the dataset:
[5]:
from acoupipe.loader import LoadH5Dataset
dataset_h5 = LoadH5Dataset(name="/tmp/tmp_dataset.h5")
print(list(dataset_h5.h5f.keys())) # sample indices are the keys of the dataset file
['0', '1', '2', '3', '4']
Similarly as with the generate
method, the get_data
method can be used to retrieve the stored data iteratively
[6]:
for data in dataset_h5.get_data():
print(f"index {data['idx']} includes features:", list(data.keys())) # keys are the names of the features
index 0 includes features: ['csm', 'idx', 'seeds']
index 1 includes features: ['csm', 'idx', 'seeds']
index 2 includes features: ['csm', 'idx', 'seeds']
index 3 includes features: ['csm', 'idx', 'seeds']
index 4 includes features: ['csm', 'idx', 'seeds']
Building a TensorFlow/Keras Dataset
With these definitions, a Python generator can be created which can be consumed by the Tensorflow Dataset API. Here, the dataset comprises the CSM, idx and seeds features.
[7]:
data_generator = dataset_h5.get_dataset_generator()
To build a TensorFlow Dataset, the output signature corresponding to the data must be known, which would be something like:
[10]:
# provide the signature of the features
output_signature = {
'csm': tf.TensorSpec(shape=(None,64,64), dtype=tf.complex64),
'seeds' : tf.TensorSpec(shape=(None,2), dtype=tf.float32),
'idx' : tf.TensorSpec(shape=(), dtype=tf.int64)
}
tf_dataset = tf.data.Dataset.from_generator(
generator=data_generator,
output_signature=output_signature
)
data = next(iter(tf_dataset))
print(f"index {data['idx']} includes features:", list(data.keys()))
for key, value in data.items():
print(f"key: {key}, shape: {value.shape}")
index 0 includes features: ['csm', 'seeds', 'idx']
key: csm, shape: (65, 64, 64)
key: seeds, shape: (6, 2)
key: idx, shape: ()
Alternatively, the output signature can be retrieved from the dataset by using the get_output_signature
method of the corresponding Dataset
[11]:
d1 = DatasetSynthetic()
signature = d1.get_output_signature(features=["csm","idx","seeds"])
tf_dataset = tf.data.Dataset.from_generator(
generator=data_generator,
output_signature=signature
)
data = next(iter(tf_dataset))
print(f"index {data['idx']} includes features:", list(data.keys()))
for key, value in data.items():
print(f"key: {key}, shape: {value.shape}")
index 0 includes features: ['csm', 'idx', 'seeds']
key: csm, shape: (65, 64, 64)
key: idx, shape: ()
key: seeds, shape: (6, 2)