{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Preliminaries" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Downloading the demo dataset\n", "In this tutorial, we'll integrate the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/), a text classification dataset, with Bridge." ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "TMP_NOTEBOOK_ROOT = Path(\"/tmp/bridge-ds/tutorials\")" ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "from bridge.utils import download_and_extract_archive\n", "\n", "download_and_extract_archive(\n", " \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\", TMP_NOTEBOOK_ROOT / \"imdb\"\n", ")" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "### File Tree\n", "\n", "After extracting, we can observe the following file structure:\n", "\n", "```\n", "├── README\n", "├── imdb.vocab\n", "├── imdbEr.txt\n", "├── test\n", "│   ├── labeledBow.feat\n", "│   ├── neg [12500 entries]\n", "│   ├── pos [12500 entries]\n", "│   ├── urls_neg.txt\n", "│   └── urls_pos.txt\n", "└── train\n", " ├── labeledBow.feat\n", " ├── neg [12500 entries]\n", " ├── pos [12500 entries]\n", " ├── unsup [50000 entries]\n", " ├── unsupBow.feat\n", " ├── urls_neg.txt\n", " ├── urls_pos.txt\n", " └── urls_unsup.txt\n", "```\n", "\n", "In the next steps, we will learn how to load this dataset to BridgeDS." ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "# DatasetProvider\n", "The recommended way to create Bridge Datasets is by using DatasetProviders. They implement a single method, `provider.build_dataset()`.\n", "\n", "Here is the outline:\n", "\n", "```python\n", "class YourDatasetProvider(DatasetProvider):\n", " def __init__(self, *args,**kwargs):\n", " \"\"\"\n", " Load the original dataset. This usually means downloading the dataset from a source, storing samples in a list, etc.\n", " Remember that in Bridge it's enough to store references to your data, not necessarily the actual data.\n", " \"\"\"\n", " super().__init__(dataset_dir, download)\n", "\n", " def build_dataset(self, display_engine=None, cache_mechanisms=None):\n", " \"\"\"\n", " Convert the dataset from raw format into our own Dataset type.\n", "\n", " Parameters:\n", " - display_engine (DisplayEngine): The display engine to use for visualization.\n", " - cache_mechanisms (Dict[str, CacheMechanism | None] | None): Cache mechanisms for different types of elements.\n", " NOTE: Learn more about cache mechanisms and display engines in more advanced tutorials.\n", " \"\"\"\n", " # Implement dataset building logic here\n", " pass\n", "```\n", "\n", "Let's start by writing the basic layout of the class, and the `__init__`:" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "from bridge.primitives.dataset import SingularDataset\n", "from bridge.providers import DatasetProvider\n", "\n", "\n", "class LargeMovieReviewDataset(DatasetProvider):\n", " dataset_url = \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\n", "\n", " def __init__(self, root: str | os.PathLike, split: str = \"train\", download: bool = False):\n", " root = Path(root)\n", "\n", " if download:\n", " if (root / \"aclImdb_v1.tar.gz\").exists():\n", " print(\"Archive file aclImdb_v1.tar.gz already exists, skipping download.\")\n", " else:\n", " download_and_extract_archive(self.dataset_url, str(root))\n", " self._split_root = root / \"aclImdb\" / split\n", "\n", " def build_dataset(\n", " self,\n", " display_engine=None,\n", " cache_mechanisms=None,\n", " ) -> SingularDataset:\n", " pass" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "Now we can instantiate this provider and verify that it points to the right directory:" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "provider = LargeMovieReviewDataset(TMP_NOTEBOOK_ROOT / \"imdb\", split=\"train\", download=False)\n", "provider._split_root" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "os.listdir(provider._split_root)" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "The next step will be to implement `build_dataset()`, which will load the relevant metadata from this directory into a Bridge Dataset.\n", "\n", "Concretely, we will iterate over the directories and convert every text file into **two elements**: a text element, and a class label element. To get the convenient API where the text elements are _samples_ and the class elements are _annotations_, we will keep two separate lists for elements during this process, but we will ensure elements from the same sample _share a sample id_:" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", "\n", "from bridge.primitives.dataset.singular_dataset import SingularDataset\n", "from bridge.primitives.element.data.load_mechanism import LoadMechanism\n", "from bridge.primitives.element.element import Element\n", "from bridge.utils.data_objects import ClassLabel\n", "\n", "\n", "class LargeMovieReviewDataset(DatasetProvider):\n", " dataset_url = \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\n", "\n", " def __init__(self, root: str | os.PathLike, split: str = \"train\", download: bool = False):\n", " root = Path(root)\n", "\n", " if download:\n", " if (root / \"aclImdb_v1.tar.gz\").exists():\n", " print(\"Archive file aclImdb_v1.tar.gz already exists, skipping download.\")\n", " else:\n", " download_and_extract_archive(self.dataset_url, str(root))\n", " self._split_root = root / \"aclImdb\" / split\n", "\n", " def build_dataset(\n", " self,\n", " display_engine=None,\n", " cache_mechanisms=None,\n", " ) -> SingularDataset:\n", " samples = []\n", " annotations = []\n", "\n", " class_dir_list = [d for d in list(self._split_root.iterdir()) if d.is_dir()]\n", " for class_idx, class_dir in enumerate(sorted(class_dir_list)):\n", " for textfile in class_dir.iterdir():\n", " load_mechanism = LoadMechanism.from_url_string(str(textfile), \"text\")\n", " text_element = Element(\n", " element_id=f\"text_{textfile.stem}\",\n", " sample_id=textfile.stem,\n", " etype=\"text\",\n", " load_mechanism=load_mechanism,\n", " )\n", " load_mechanism = LoadMechanism(ClassLabel(class_idx, class_dir.name), category=\"obj\")\n", " label_element = Element(\n", " element_id=f\"label_{textfile.stem}\",\n", " sample_id=textfile.stem,\n", " etype=\"class_label\",\n", " load_mechanism=load_mechanism,\n", " )\n", " samples.append(text_element)\n", " annotations.append(label_element)\n", "\n", " return SingularDataset.from_lists(\n", " samples, annotations, display_engine=display_engine, cache_mechanisms=cache_mechanisms\n", " )" ] }, { "attachments": {}, "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "There's quite a bit of code here, so let's break it down a little:\n", "\n", "#### Iterating Class Dirs\n", "\n", "```python\n", " class_dir_list = [d for d in list(self._split_root.iterdir()) if d.is_dir()] \n", " for class_idx, class_dir in enumerate(sorted(class_dir_list)):\n", " for textfile in class_dir.iterdir():\n", "```\n", "Create a nested loop, where for every class, we iterate on all samples of that class.\n", "\n", "\n", "#### Create Text Element\n", "```python\n", "load_mechanism = LoadMechanism.from_url_string(str(textfile), 'text')\n", "text_element = Element(\n", " element_id=f\"text_{textfile.stem}\",\n", " sample_id=textfile.stem,\n", " etype='text',\n", " load_mechanism=load_mechanism,\n", ")\n", "```\n", "\n", "* Create a LoadMechanism for our text file\n", "* Create the text element. This means defining a unique element id, a sample id, and using the load mechanism we just defined.\n", "\n", "### Create Class Element\n", "\n", "```python\n", "load_mechanism = LoadMechanism(ClassLabel(class_idx, class_dir.name), category='obj')\n", "label_element = Element(\n", " element_id=f\"label_{textfile.stem}\",\n", " sample_id=textfile.stem,\n", " etype='class_label',\n", " load_mechanism=load_mechanism,\n", ")\n", "```\n", "\n", "* The LoadMechanism in this case will simply store the class label we define in-memory, rather than a URL.\n", "* The element is defined with a different element id than the one above, but the same sample id, so we know they relate.\n", "\n", "NOTE: Each sample comprises of one text element and one label element, because we're doing a classification task. In the SingularDataset regime, these will be separated into `samples` (text) and `annotations` (class labels).\n", "\n", "### Wrapping up\n", "\n", "Let's create a Bridge Dataset and see what we've got:" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "from bridge.display.basic import SimplePrints\n", "\n", "ds = LargeMovieReviewDataset(TMP_NOTEBOOK_ROOT / \"imdb\", split=\"train\", download=False).build_dataset(\n", " display_engine=SimplePrints()\n", ")\n", "ds" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "ds.samples.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "ds.annotations.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "ds.annotations.data.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "sample = ds.iget(0)\n", "sample.data # SingularSample exposes the sample element data directly" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "ds.select_samples(lambda samples, anns: samples.index[:2]).show()" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "We have an operational Bridge Dataset, which we can manipulate as we see fit. \n", "\n", "For example, observe the [tree](#File-Tree) above, and note that the `unsup` dir only exists in the training set, and not in the test set. This is because this \"class\" is not a class per se, but rather unlabeled data which is included in our archive.\n", "\n", "Let's use a simple selection to clear out all samples of this class from our dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "ds = ds.select_samples(lambda samples, anns: anns[anns.data != \"unsup\"].index.get_level_values(\"sample_id\"))\n", "\n", "ds.annotations.data.value_counts()" ] }, { "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "## In Summary\n", "1. To create our own custom datasets, it's recommended to use a **DatasetProvider**\n", "2. For SingularDatasets, we create two lists of elements - one for _samples_ and one for _annotations. For any other kind of Dataset, we will create a single list of elements.\n", "3. Elements have unique IDs across the Dataset, and share Sample IDs with Elements of the same Sample.\n", "4. Elements are the low-level object which contains raw data, by using a **LoadMechanism**.\n", "\n", "## Up Next\n", "In this tutorial, we've used a primitive DisplayEngine called **SimplePrints**. If you would prefer a more sophisticated one like the Panel one in previous tutorials, continue to the next tutorial where we learn how to create our own **DisplayEngine** for a text dataset. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }