{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Save and load datasets stored in HDF5 file format\n",
    "================================================\n",
    "\n",
    "This example demonstrates how to load the data from a stored .h5 file and to build a \n",
    "data input Pipeline in TensorFlow / Keras."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save Dataset to HDF5\n",
    "\n",
    "At first, we create a small temporary dataset by utilizing the default synthetic dataset, compounding 5 source cases and the Cross-spectral matrix as input feature.    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|\u001b[38;2;31;119;180m██████████\u001b[0m| 5/5 [00:02<00:00,  2.05it/s]\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "from acoupipe.datasets.synthetic import DatasetSynthetic\n",
    "\n",
    "# training dataset\n",
    "d1 = DatasetSynthetic()\n",
    "\n",
    "# save to .h5 file\n",
    "d1.save_h5(features=[\"csm\"], split=\"training\", size=5, name=\"/tmp/tmp_dataset.h5\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Dataset from HDF5 File \n",
    "\n",
    "The AcouPipe toolbox provides the `LoadH5Dataset` class to load the datasets stored into HDF5 format.\n",
    "One can access each individual sample/source case by the h5f attribute of the class. To extract the first input feature ('csm' in this case) of the dataset:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['0', '1', '2', '3', '4']\n"
     ]
    }
   ],
   "source": [
    "from acoupipe.loader import LoadH5Dataset\n",
    "\n",
    "dataset_h5 = LoadH5Dataset(name=\"/tmp/tmp_dataset.h5\")\n",
    "\n",
    "print(list(dataset_h5.h5f.keys())) # sample indices are the keys of the dataset file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly as with the `generate` method, the `get_data` method can be used to retrieve the stored data iteratively"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "index 0 includes features: ['csm', 'idx', 'seeds']\n",
      "index 1 includes features: ['csm', 'idx', 'seeds']\n",
      "index 2 includes features: ['csm', 'idx', 'seeds']\n",
      "index 3 includes features: ['csm', 'idx', 'seeds']\n",
      "index 4 includes features: ['csm', 'idx', 'seeds']\n"
     ]
    }
   ],
   "source": [
    "for data in dataset_h5.get_data():\n",
    "    print(f\"index {data['idx']} includes features:\", list(data.keys())) # keys are the names of the features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a TensorFlow/Keras Dataset \n",
    "\n",
    "With these definitions, a Python generator can be created which can be consumed by the Tensorflow Dataset API. Here, the dataset comprises the CSM, idx and seeds features. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_generator = dataset_h5.get_dataset_generator()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To build a TensorFlow Dataset, the output signature corresponding to the data must be known, which would be something like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "index 0 includes features: ['csm', 'seeds', 'idx']\n",
      "key: csm, shape: (65, 64, 64)\n",
      "key: seeds, shape: (6, 2)\n",
      "key: idx, shape: ()\n"
     ]
    }
   ],
   "source": [
    "# provide the signature of the features\n",
    "output_signature = {\n",
    "            'csm':  tf.TensorSpec(shape=(None,64,64), dtype=tf.complex64),\n",
    "            'seeds' : tf.TensorSpec(shape=(None,2), dtype=tf.float32),\n",
    "            'idx' : tf.TensorSpec(shape=(), dtype=tf.int64)\n",
    "            }\n",
    "\n",
    "tf_dataset = tf.data.Dataset.from_generator(\n",
    "            generator=data_generator,\n",
    "            output_signature=output_signature\n",
    "            )\n",
    "\n",
    "data = next(iter(tf_dataset))\n",
    "\n",
    "print(f\"index {data['idx']} includes features:\", list(data.keys()))\n",
    "for key, value in data.items():\n",
    "    print(f\"key: {key}, shape: {value.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, the output signature can be retrieved from the dataset by using the `get_output_signature` method of the corresponding Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "index 0 includes features: ['csm', 'idx', 'seeds']\n",
      "key: csm, shape: (65, 64, 64)\n",
      "key: idx, shape: ()\n",
      "key: seeds, shape: (6, 2)\n"
     ]
    }
   ],
   "source": [
    "d1 = DatasetSynthetic()\n",
    "\n",
    "signature = d1.get_output_signature(features=[\"csm\",\"idx\",\"seeds\"])\n",
    "\n",
    "tf_dataset = tf.data.Dataset.from_generator(\n",
    "            generator=data_generator,\n",
    "            output_signature=signature\n",
    "            )\n",
    "\n",
    "data = next(iter(tf_dataset))\n",
    "\n",
    "print(f\"index {data['idx']} includes features:\", list(data.keys()))\n",
    "for key, value in data.items():\n",
    "    print(f\"key: {key}, shape: {value.shape}\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py39",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "8b84133aa5d27198834684dc5cf37286f31547fcb562f18c04d9e25d99e7281e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}