{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Preliminaries"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "## Downloading the demo dataset\n",
    "In this tutorial, we'll integrate the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/), a text classification dataset, with Bridge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "TMP_NOTEBOOK_ROOT = Path(\"/tmp/bridge-ds/tutorials\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from bridge.utils import download_and_extract_archive\n",
    "\n",
    "download_and_extract_archive(\n",
    "    \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\", TMP_NOTEBOOK_ROOT / \"imdb\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "### File Tree\n",
    "\n",
    "After extracting, we can observe the following file structure:\n",
    "\n",
    "```\n",
    "├── README\n",
    "├── imdb.vocab\n",
    "├── imdbEr.txt\n",
    "├── test\n",
    "│   ├── labeledBow.feat\n",
    "│   ├── neg  [12500 entries]\n",
    "│   ├── pos  [12500 entries]\n",
    "│   ├── urls_neg.txt\n",
    "│   └── urls_pos.txt\n",
    "└── train\n",
    "    ├── labeledBow.feat\n",
    "    ├── neg  [12500 entries]\n",
    "    ├── pos  [12500 entries]\n",
    "    ├── unsup  [50000 entries]\n",
    "    ├── unsupBow.feat\n",
    "    ├── urls_neg.txt\n",
    "    ├── urls_pos.txt\n",
    "    └── urls_unsup.txt\n",
    "```\n",
    "\n",
    "In the next steps, we will learn how to load this dataset to BridgeDS."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "# DatasetProvider\n",
    "The recommended way to create Bridge Datasets is by using DatasetProviders. They implement a single method, `provider.build_dataset()`.\n",
    "\n",
    "Here is the outline:\n",
    "\n",
    "```python\n",
    "class YourDatasetProvider(DatasetProvider):\n",
    "    def __init__(self, *args,**kwargs):\n",
    "        \"\"\"\n",
    "        Load the original dataset. This usually means downloading the dataset from a source, storing samples in a list, etc.\n",
    "        Remember that in Bridge it's enough to store references to your data, not necessarily the actual data.\n",
    "        \"\"\"\n",
    "        super().__init__(dataset_dir, download)\n",
    "\n",
    "    def build_dataset(self, display_engine=None, cache_mechanisms=None):\n",
    "        \"\"\"\n",
    "        Convert the dataset from raw format into our own Dataset type.\n",
    "\n",
    "        Parameters:\n",
    "        - display_engine (DisplayEngine): The display engine to use for visualization.\n",
    "        - cache_mechanisms (Dict[str, CacheMechanism | None] | None): Cache mechanisms for different types of elements.\n",
    "        NOTE: Learn more about cache mechanisms and display engines in more advanced tutorials.\n",
    "        \"\"\"\n",
    "        # Implement dataset building logic here\n",
    "        pass\n",
    "```\n",
    "\n",
    "Let's start by writing the basic layout of the class, and the `__init__`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from bridge.primitives.dataset import SingularDataset\n",
    "from bridge.providers import DatasetProvider\n",
    "\n",
    "\n",
    "class LargeMovieReviewDataset(DatasetProvider):\n",
    "    dataset_url = \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\n",
    "\n",
    "    def __init__(self, root: str | os.PathLike, split: str = \"train\", download: bool = False):\n",
    "        root = Path(root)\n",
    "\n",
    "        if download:\n",
    "            if (root / \"aclImdb_v1.tar.gz\").exists():\n",
    "                print(\"Archive file aclImdb_v1.tar.gz already exists, skipping download.\")\n",
    "            else:\n",
    "                download_and_extract_archive(self.dataset_url, str(root))\n",
    "        self._split_root = root / \"aclImdb\" / split\n",
    "\n",
    "    def build_dataset(\n",
    "        self,\n",
    "        display_engine=None,\n",
    "        cache_mechanisms=None,\n",
    "    ) -> SingularDataset:\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "Now we can instantiate this provider and verify that it points to the right directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "provider = LargeMovieReviewDataset(TMP_NOTEBOOK_ROOT / \"imdb\", split=\"train\", download=False)\n",
    "provider._split_root"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.listdir(provider._split_root)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10",
   "metadata": {},
   "source": [
    "The next step will be to implement `build_dataset()`, which will load the relevant metadata from this directory into a Bridge Dataset.\n",
    "\n",
    "Concretely, we will iterate over the directories and convert every text file into **two elements**: a text element, and a class label element. To get the convenient API where the text elements are _samples_ and the class elements are _annotations_, we will keep two separate lists for elements during this process, but we will ensure elements from the same sample _share a sample id_:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "from bridge.primitives.dataset.singular_dataset import SingularDataset\n",
    "from bridge.primitives.element.data.load_mechanism import LoadMechanism\n",
    "from bridge.primitives.element.element import Element\n",
    "from bridge.utils.data_objects import ClassLabel\n",
    "\n",
    "\n",
    "class LargeMovieReviewDataset(DatasetProvider):\n",
    "    dataset_url = \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\n",
    "\n",
    "    def __init__(self, root: str | os.PathLike, split: str = \"train\", download: bool = False):\n",
    "        root = Path(root)\n",
    "\n",
    "        if download:\n",
    "            if (root / \"aclImdb_v1.tar.gz\").exists():\n",
    "                print(\"Archive file aclImdb_v1.tar.gz already exists, skipping download.\")\n",
    "            else:\n",
    "                download_and_extract_archive(self.dataset_url, str(root))\n",
    "        self._split_root = root / \"aclImdb\" / split\n",
    "\n",
    "    def build_dataset(\n",
    "        self,\n",
    "        display_engine=None,\n",
    "        cache_mechanisms=None,\n",
    "    ) -> SingularDataset:\n",
    "        samples = []\n",
    "        annotations = []\n",
    "\n",
    "        class_dir_list = [d for d in list(self._split_root.iterdir()) if d.is_dir()]\n",
    "        for class_idx, class_dir in enumerate(sorted(class_dir_list)):\n",
    "            for textfile in class_dir.iterdir():\n",
    "                load_mechanism = LoadMechanism.from_url_string(str(textfile), \"text\")\n",
    "                text_element = Element(\n",
    "                    element_id=f\"text_{textfile.stem}\",\n",
    "                    sample_id=textfile.stem,\n",
    "                    etype=\"text\",\n",
    "                    load_mechanism=load_mechanism,\n",
    "                )\n",
    "                load_mechanism = LoadMechanism(ClassLabel(class_idx, class_dir.name), category=\"obj\")\n",
    "                label_element = Element(\n",
    "                    element_id=f\"label_{textfile.stem}\",\n",
    "                    sample_id=textfile.stem,\n",
    "                    etype=\"class_label\",\n",
    "                    load_mechanism=load_mechanism,\n",
    "                )\n",
    "                samples.append(text_element)\n",
    "                annotations.append(label_element)\n",
    "\n",
    "        return SingularDataset.from_lists(\n",
    "            samples, annotations, display_engine=display_engine, cache_mechanisms=cache_mechanisms\n",
    "        )"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "There's quite a bit of code here, so let's break it down a little:\n",
    "\n",
    "#### Iterating Class Dirs\n",
    "\n",
    "```python\n",
    "    class_dir_list = [d for d in list(self._split_root.iterdir()) if d.is_dir()] \n",
    "    for class_idx, class_dir in enumerate(sorted(class_dir_list)):\n",
    "        for textfile in class_dir.iterdir():\n",
    "```\n",
    "Create a nested loop, where for every class, we iterate on all samples of that class.\n",
    "\n",
    "\n",
    "#### Create Text Element\n",
    "```python\n",
    "load_mechanism = LoadMechanism.from_url_string(str(textfile), 'text')\n",
    "text_element = Element(\n",
    "    element_id=f\"text_{textfile.stem}\",\n",
    "    sample_id=textfile.stem,\n",
    "    etype='text',\n",
    "    load_mechanism=load_mechanism,\n",
    ")\n",
    "```\n",
    "\n",
    "* Create a LoadMechanism for our text file\n",
    "* Create the text element. This means defining a unique element id, a sample id, and using the load mechanism we just defined.\n",
    "\n",
    "### Create Class Element\n",
    "\n",
    "```python\n",
    "load_mechanism = LoadMechanism(ClassLabel(class_idx, class_dir.name), category='obj')\n",
    "label_element = Element(\n",
    "    element_id=f\"label_{textfile.stem}\",\n",
    "    sample_id=textfile.stem,\n",
    "    etype='class_label',\n",
    "    load_mechanism=load_mechanism,\n",
    ")\n",
    "```\n",
    "\n",
    "* The LoadMechanism in this case will simply store the class label we define in-memory, rather than a URL.\n",
    "* The element is defined with a different element id than the one above, but the same sample id, so we know they relate.\n",
    "\n",
    "NOTE: Each sample comprises of one text element and one label element, because we're doing a classification task. In the SingularDataset regime, these will be separated into `samples` (text) and `annotations` (class labels).\n",
    "\n",
    "### Wrapping up\n",
    "\n",
    "Let's create a Bridge Dataset and see what we've got:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "from bridge.display.basic import SimplePrints\n",
    "\n",
    "ds = LargeMovieReviewDataset(TMP_NOTEBOOK_ROOT / \"imdb\", split=\"train\", download=False).build_dataset(\n",
    "    display_engine=SimplePrints()\n",
    ")\n",
    "ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.samples.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.annotations.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.annotations.data.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = ds.iget(0)\n",
    "sample.data  # SingularSample exposes the sample element data directly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds.select_samples(lambda samples, anns: samples.index[:2]).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "We have an operational Bridge Dataset, which we can manipulate as we see fit. \n",
    "\n",
    "For example, observe the [tree](#File-Tree) above, and note that the `unsup` dir only exists in the training set, and not in the test set. This is because this \"class\" is not a class per se, but rather unlabeled data which is included in our archive.\n",
    "\n",
    "Let's use a simple selection to clear out all samples of this class from our dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = ds.select_samples(lambda samples, anns: anns[anns.data != \"unsup\"].index.get_level_values(\"sample_id\"))\n",
    "\n",
    "ds.annotations.data.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21",
   "metadata": {},
   "source": [
    "## In Summary\n",
    "1. To create our own custom datasets, it's recommended to use a **DatasetProvider**\n",
    "2. For SingularDatasets, we create two lists of elements - one for _samples_ and one for _annotations. For any other kind of Dataset, we will create a single list of elements.\n",
    "3. Elements have unique IDs across the Dataset, and share Sample IDs with Elements of the same Sample.\n",
    "4. Elements are the low-level object which contains raw data, by using a **LoadMechanism**.\n",
    "\n",
    "## Up Next\n",
    "In this tutorial, we've used a primitive DisplayEngine called **SimplePrints**. If you would prefer a more sophisticated one like the Panel one in previous tutorials, continue to the next tutorial where we learn how to create our own **DisplayEngine** for a text dataset. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}