{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Preliminaries" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import holoviews as hv\n", "import hvplot.pandas # noqa\n", "import panel as pn\n", "\n", "hv.extension(\"bokeh\")\n", "pn.extension()\n", "# If in google colab, run hack that allows holoviews to work properly\n", "try:\n", " import google.colab # noqa\n", "\n", " def _render(self, **kwargs):\n", " hv.extension(\"bokeh\")\n", " return hv.Store.render(self)\n", "\n", " hv.core.Dimensioned._repr_mimebundle_ = _render\n", "except ModuleNotFoundError:\n", " pass\n", "\n", "TMP_NOTEBOOK_ROOT = Path(\"/tmp/bridge-ds/tutorials\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "# CacheMechanisms\n", "\n", "## Motivation\n", "Consider the following Dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "from bridge.providers.vision import Coco2017Detection\n", "\n", "root_dir = TMP_NOTEBOOK_ROOT / \"coco\"\n", "\n", "provider = Coco2017Detection(root_dir, split=\"val\", img_source=\"stream\")\n", "stream_ds = provider.build_dataset()\n", "stream_ds" ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "stream_ds.samples.head(3)" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "This Dataset has samples with url sources, which means we need to request them on each `sample.data` call, which is takes a long time:" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "stream_ds.iget(0).data" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "One way to speed this up is to use a `CacheMechanism`: an object that, once `image_element.data` is called once, stores the data in a different location (e.g. a local file or in-memory). This action is transparent to the user but making subsequent `.data` calls significantly faster. \n", "\n", "In our scenario, we can assign a cache mechanism for every `etype`. The Dataset has two etypes:\n", "1. `'bbox'` - already stored in-memory, no need to re-cache them\n", "2. `'image'` - we want to cache them in the filesystem." ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "from bridge.primitives.element.data.cache_mechanism import CacheMechanism\n", "from bridge.primitives.element.data.uri_components import URIComponents\n", "\n", "root_dir = TMP_NOTEBOOK_ROOT / \"coco\"\n", "\n", "provider = Coco2017Detection(root_dir)\n", "stream_ds = provider.build_dataset(\n", " cache_mechanisms={\n", " \"image\": CacheMechanism(\n", " root_uri=URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / \"my_local_cache\")),\n", " ),\n", " \"bbox\": None,\n", " },\n", ")\n", "stream_ds" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "NOTE: `cache_mechanism == None` means we don't cache anything and keep the original LoadMechanism. `cache_mechanism==CacheMechanism()` means we save to memory. for bboxes, they're already in-memory so there's no point in saving them again." ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "stream_ds.samples.head(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "stream_ds.iget(0).data\n", "stream_ds.samples.head(3)" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "See how the first sample's `data` column has changed to a local path?" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "%%timeit\n", "stream_ds.iget(0).data" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "So now, subsequent loads of data will be a fraction of the original download-from-url scenario." ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "## CacheMechanism Roles\n", "The `CacheMechanism` object has two responsibilities:\n", "\n", "1. Use a `CacheMethod` to store the data to a certain location (disk, RAM, etc.) and to return a `LoadMechanism` which can load this data back:\n", "\n", "```python\n", "def store(\n", " self,\n", " element,\n", " data,\n", " as_category: str | None = None,\n", " should_update_elements: bool = False,\n", ") -> LoadMechanism:\n", " ...\n", "```\n", "\n", "2. Update the `ds.elements` table (of which `ds.samples` and `ds.annotations` are derived) when we call `element.data`, with the new LoadMechanism we got from `cache_mechanism.store()` (So the **TableAPI** will align with the new source)\n", "\n", "In fact, every element holds a reference to a CacheMechanism just like it holds a LoadMechanism. Using this knowledge, here is the actual code for `element.data`:\n", "\n", "```python\n", "@property\n", "def data(self) -> Any:\n", " data = self._load_mechanism.load_data()\n", " if self._cache_mechanism:\n", " new_load_mechanism = self._cache_mechanism.store_image(self.id, self.type, data)\n", " self._load_mechanism = new_load_mechanism\n", " return data\n", " return data\n", "\n", "```\n", "\n", "## CacheMechanisms and Transforms\n", "How does this relate back to transforms? Well, when we execute `sample.transform()`, here's what happens:\n", "1. We apply the transform to each element to get new data\n", "2. We _store_ this new data using a CacheMechanism\n", "3. We create a _new sample_ from the old one, but replace the LoadMechanisms for every element with the ones returned from this CacheMechanism.\n", "\n", "By default, `sample.transform()` saves outputs as variables in-memory. However, this doesn't scale for large datasets, so it's better to use something like we've used above, such as saving to path. This way, when we call `ds.transform_samples()`, the method will iterate over all samples, transform them, and save them. All while allowing us to treat this newly created Dataset just like the original one.\n", "\n", "In the following snippet, we will transform samples from COCO. We will limit the Dataset to a few samples because it is remote so most of the time is spent just downloading images:" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "from torchvision.transforms import v2\n", "\n", "from bridge.display.vision import Panel\n", "from bridge.primitives.sample.transform.vision import TorchvisionV2Transform\n", "\n", "transform = TorchvisionV2Transform(\n", " [\n", " v2.RandomHorizontalFlip(p=1),\n", " ]\n", ")\n", "\n", "flipped_ds = stream_ds.select_samples(lambda samples, anns: samples.index[:20]).transform_samples(\n", " transform=transform,\n", " cache_mechanisms={\n", " \"image\": CacheMechanism(\n", " URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / \"flipped\")),\n", " )\n", " },\n", " display_engine=Panel(bbox_format=\"xywh\"),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "flipped_ds.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "list(Path(TMP_NOTEBOOK_ROOT / \"flipped\").iterdir())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.13" } }, "nbformat": 4, "nbformat_minor": 5 }