Cache Mechanism#

Download this notebook from GitHub

Open In Colab

Preliminaries#

Imports#

[1]:
from pathlib import Path

import holoviews as hv
import hvplot.pandas  # noqa
import panel as pn

hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
    import google.colab  # noqa

    def _render(self, **kwargs):
        hv.extension("bokeh")
        return hv.Store.render(self)

    hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
    pass

TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")
%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)

CacheMechanisms#

Motivation#

Consider the following Dataset:

[2]:
from bridge.providers.vision import Coco2017Detection

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir, split="val", img_source="stream")
stream_ds = provider.build_dataset()
stream_ds
Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_val2017.json already exists, skipping download.
loading annotations into memory...
Done (t=0.53s)
creating index...
index created!
[2]:
Dataset: {'n_samples': 5000, 'n_bbox': 36781, 'n_image': 5000}
[3]:
stream_ds.samples.head(3)
[3]:
element_type data category license file_name coco_url height width date_captured flickr_url
sample_id element_id
139 139_img image http://images.cocodataset.org/val2017/00000000... image 2.0 000000000139.jpg http://images.cocodataset.org/val2017/00000000... 426.0 640.0 2013-11-21 01:34:01 http://farm9.staticflickr.com/8035/8024364858_...
285 285_img image http://images.cocodataset.org/val2017/00000000... image 4.0 000000000285.jpg http://images.cocodataset.org/val2017/00000000... 640.0 586.0 2013-11-18 13:09:47 http://farm8.staticflickr.com/7434/9138147604_...
632 632_img image http://images.cocodataset.org/val2017/00000000... image 3.0 000000000632.jpg http://images.cocodataset.org/val2017/00000000... 483.0 640.0 2013-11-20 21:14:01 http://farm2.staticflickr.com/1241/1243324748_...

This Dataset has samples with url sources, which means we need to request them on each sample.data call, which is takes a long time:

[4]:
%%timeit
stream_ds.iget(0).data
95.7 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One way to speed this up is to use a CacheMechanism: an object that, once image_element.data is called once, stores the data in a different location (e.g. a local file or in-memory). This action is transparent to the user but making subsequent .data calls significantly faster.

In our scenario, we can assign a cache mechanism for every etype. The Dataset has two etypes:

  1. 'bbox' - already stored in-memory, no need to re-cache them

  2. 'image' - we want to cache them in the filesystem.

[5]:
from bridge.primitives.element.data.cache_mechanism import CacheMechanism
from bridge.primitives.element.data.uri_components import URIComponents

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir)
stream_ds = provider.build_dataset(
    cache_mechanisms={
        "image": CacheMechanism(
            root_uri=URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / "my_local_cache")),
        ),
        "bbox": None,
    },
)
stream_ds
Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_train2017.json already exists, skipping download.
loading annotations into memory...
Done (t=13.56s)
creating index...
index created!
[5]:
Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}

NOTE: cache_mechanism == None means we don’t cache anything and keep the original LoadMechanism. cache_mechanism==CacheMechanism() means we save to memory. for bboxes, they’re already in-memory so there’s no point in saving them again.

[6]:
stream_ds.samples.head(3)
[6]:
element_type data category license file_name coco_url height width date_captured flickr_url
sample_id element_id
9 9_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000000009.jpg http://images.cocodataset.org/train2017/000000... 480.0 640.0 2013-11-19 20:40:11 http://farm5.staticflickr.com/4026/4622125393_...
25 25_img image http://images.cocodataset.org/train2017/000000... image 1.0 000000000025.jpg http://images.cocodataset.org/train2017/000000... 426.0 640.0 2013-11-16 14:11:30 http://farm1.staticflickr.com/94/241612385_d9e...
30 30_img image http://images.cocodataset.org/train2017/000000... image 4.0 000000000030.jpg http://images.cocodataset.org/train2017/000000... 428.0 640.0 2013-11-24 03:32:32 http://farm4.staticflickr.com/3377/3573516590_...
[7]:
stream_ds.iget(0).data
stream_ds.samples.head(3)
[7]:
element_type data category license file_name coco_url height width date_captured flickr_url
sample_id element_id
9 9_img image /tmp/bridge-ds/tutorials/my_local_cache/9_img.jpg image 3.0 000000000009.jpg http://images.cocodataset.org/train2017/000000... 480.0 640.0 2013-11-19 20:40:11 http://farm5.staticflickr.com/4026/4622125393_...
25 25_img image http://images.cocodataset.org/train2017/000000... image 1.0 000000000025.jpg http://images.cocodataset.org/train2017/000000... 426.0 640.0 2013-11-16 14:11:30 http://farm1.staticflickr.com/94/241612385_d9e...
30 30_img image http://images.cocodataset.org/train2017/000000... image 4.0 000000000030.jpg http://images.cocodataset.org/train2017/000000... 428.0 640.0 2013-11-24 03:32:32 http://farm4.staticflickr.com/3377/3573516590_...

See how the first sample’s data column has changed to a local path?

[8]:
%%timeit
stream_ds.iget(0).data
62.6 ms ± 588 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So now, subsequent loads of data will be a fraction of the original download-from-url scenario.

CacheMechanism Roles#

The CacheMechanism object has two responsibilities:

  1. Use a CacheMethod to store the data to a certain location (disk, RAM, etc.) and to return a LoadMechanism which can load this data back:

def store(
    self,
    element,
    data,
    as_category: str | None = None,
    should_update_elements: bool = False,
) -> LoadMechanism:
    ...
  1. Update the ds.elements table (of which ds.samples and ds.annotations are derived) when we call element.data, with the new LoadMechanism we got from cache_mechanism.store() (So the TableAPI will align with the new source)

In fact, every element holds a reference to a CacheMechanism just like it holds a LoadMechanism. Using this knowledge, here is the actual code for element.data:

@property
def data(self) -> Any:
    data = self._load_mechanism.load_data()
    if self._cache_mechanism:
        new_load_mechanism = self._cache_mechanism.store_image(self.id, self.type, data)
        self._load_mechanism = new_load_mechanism
        return data
    return data

CacheMechanisms and Transforms#

How does this relate back to transforms? Well, when we execute sample.transform(), here’s what happens:

  1. We apply the transform to each element to get new data

  2. We store this new data using a CacheMechanism

  3. We create a new sample from the old one, but replace the LoadMechanisms for every element with the ones returned from this CacheMechanism.

By default, sample.transform() saves outputs as variables in-memory. However, this doesn’t scale for large datasets, so it’s better to use something like we’ve used above, such as saving to path. This way, when we call ds.transform_samples(), the method will iterate over all samples, transform them, and save them. All while allowing us to treat this newly created Dataset just like the original one.

In the following snippet, we will transform samples from COCO. We will limit the Dataset to a few samples because it is remote so most of the time is spent just downloading images:

[9]:
from torchvision.transforms import v2

from bridge.display.vision import Panel
from bridge.primitives.sample.transform.vision import TorchvisionV2Transform

transform = TorchvisionV2Transform(
    [
        v2.RandomHorizontalFlip(p=1),
    ]
)

flipped_ds = stream_ds.select_samples(lambda samples, anns: samples.index[:20]).transform_samples(
    transform=transform,
    cache_mechanisms={
        "image": CacheMechanism(
            URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / "flipped")),
        )
    },
    display_engine=Panel(bbox_format="xywh"),
)
[10]:
flipped_ds.show()
[10]:
[11]:
list(Path(TMP_NOTEBOOK_ROOT / "flipped").iterdir())
[11]:
[PosixPath('/tmp/bridge-ds/tutorials/flipped/30_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/49_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/36_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/34_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/81_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/78_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/86_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/72_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/25_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/42_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/9_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/89_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/92_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/64_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/73_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/71_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/61_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/74_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/94_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/77_img.jpg')]