Cache Mechanism#

Preliminaries#

Imports#

[1]:

from pathlib import Path

import holoviews as hv
import hvplot.pandas  # noqa
import panel as pn

hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
    import google.colab  # noqa

    def _render(self, **kwargs):
        hv.extension("bokeh")
        return hv.Store.render(self)

    hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
    pass

TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")

%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)

CacheMechanisms#

Motivation#

Consider the following Dataset:

[2]:

from bridge.providers.vision import Coco2017Detection

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir, split="val", img_source="stream")
stream_ds = provider.build_dataset()
stream_ds

Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_val2017.json already exists, skipping download.
loading annotations into memory...
Done (t=0.53s)
creating index...
index created!

[2]:

Dataset: {'n_samples': 5000, 'n_bbox': 36781, 'n_image': 5000}

[3]:

stream_ds.samples.head(3)

[3]:

		element_type	data	category	license	file_name	coco_url	height	width	date_captured	flickr_url
sample_id	element_id
139	139_img	image	http://images.cocodataset.org/val2017/00000000...	image	2.0	000000000139.jpg	http://images.cocodataset.org/val2017/00000000...	426.0	640.0	2013-11-21 01:34:01	http://farm9.staticflickr.com/8035/8024364858_...
285	285_img	image	http://images.cocodataset.org/val2017/00000000...	image	4.0	000000000285.jpg	http://images.cocodataset.org/val2017/00000000...	640.0	586.0	2013-11-18 13:09:47	http://farm8.staticflickr.com/7434/9138147604_...
632	632_img	image	http://images.cocodataset.org/val2017/00000000...	image	3.0	000000000632.jpg	http://images.cocodataset.org/val2017/00000000...	483.0	640.0	2013-11-20 21:14:01	http://farm2.staticflickr.com/1241/1243324748_...

This Dataset has samples with url sources, which means we need to request them on each sample.data call, which is takes a long time:

[4]:

%%timeit
stream_ds.iget(0).data

95.7 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One way to speed this up is to use a CacheMechanism: an object that, once image_element.data is called once, stores the data in a different location (e.g. a local file or in-memory). This action is transparent to the user but making subsequent .data calls significantly faster.

In our scenario, we can assign a cache mechanism for every etype. The Dataset has two etypes:

'bbox' - already stored in-memory, no need to re-cache them
'image' - we want to cache them in the filesystem.

[5]:

from bridge.primitives.element.data.cache_mechanism import CacheMechanism
from bridge.primitives.element.data.uri_components import URIComponents

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir)
stream_ds = provider.build_dataset(
    cache_mechanisms={
        "image": CacheMechanism(
            root_uri=URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / "my_local_cache")),
        ),
        "bbox": None,
    },
)
stream_ds

Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_train2017.json already exists, skipping download.
loading annotations into memory...
Done (t=13.56s)
creating index...
index created!

[5]:

Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}

NOTE: cache_mechanism == None means we don’t cache anything and keep the original LoadMechanism. cache_mechanism==CacheMechanism() means we save to memory. for bboxes, they’re already in-memory so there’s no point in saving them again.

[6]:

stream_ds.samples.head(3)

[6]:

		element_type	data	category	license	file_name	coco_url	height	width	date_captured	flickr_url
sample_id	element_id
9	9_img	image	http://images.cocodataset.org/train2017/000000...	image	3.0	000000000009.jpg	http://images.cocodataset.org/train2017/000000...	480.0	640.0	2013-11-19 20:40:11	http://farm5.staticflickr.com/4026/4622125393_...
25	25_img	image	http://images.cocodataset.org/train2017/000000...	image	1.0	000000000025.jpg	http://images.cocodataset.org/train2017/000000...	426.0	640.0	2013-11-16 14:11:30	http://farm1.staticflickr.com/94/241612385_d9e...
30	30_img	image	http://images.cocodataset.org/train2017/000000...	image	4.0	000000000030.jpg	http://images.cocodataset.org/train2017/000000...	428.0	640.0	2013-11-24 03:32:32	http://farm4.staticflickr.com/3377/3573516590_...

[7]:

stream_ds.iget(0).data
stream_ds.samples.head(3)

[7]:

		element_type	data	category	license	file_name	coco_url	height	width	date_captured	flickr_url
sample_id	element_id
9	9_img	image	/tmp/bridge-ds/tutorials/my_local_cache/9_img.jpg	image	3.0	000000000009.jpg	http://images.cocodataset.org/train2017/000000...	480.0	640.0	2013-11-19 20:40:11	http://farm5.staticflickr.com/4026/4622125393_...
25	25_img	image	http://images.cocodataset.org/train2017/000000...	image	1.0	000000000025.jpg	http://images.cocodataset.org/train2017/000000...	426.0	640.0	2013-11-16 14:11:30	http://farm1.staticflickr.com/94/241612385_d9e...
30	30_img	image	http://images.cocodataset.org/train2017/000000...	image	4.0	000000000030.jpg	http://images.cocodataset.org/train2017/000000...	428.0	640.0	2013-11-24 03:32:32	http://farm4.staticflickr.com/3377/3573516590_...

See how the first sample’s data column has changed to a local path?

[8]:

%%timeit
stream_ds.iget(0).data

62.6 ms ± 588 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So now, subsequent loads of data will be a fraction of the original download-from-url scenario.

CacheMechanism Roles#

The CacheMechanism object has two responsibilities:

Use a CacheMethod to store the data to a certain location (disk, RAM, etc.) and to return a LoadMechanism which can load this data back:

def store(
    self,
    element,
    data,
    as_category: str | None = None,
    should_update_elements: bool = False,
) -> LoadMechanism:
    ...

Update the ds.elements table (of which ds.samples and ds.annotations are derived) when we call element.data, with the new LoadMechanism we got from cache_mechanism.store() (So the TableAPI will align with the new source)

In fact, every element holds a reference to a CacheMechanism just like it holds a LoadMechanism. Using this knowledge, here is the actual code for element.data:

@property
def data(self) -> Any:
    data = self._load_mechanism.load_data()
    if self._cache_mechanism:
        new_load_mechanism = self._cache_mechanism.store_image(self.id, self.type, data)
        self._load_mechanism = new_load_mechanism
        return data
    return data

CacheMechanisms and Transforms#

How does this relate back to transforms? Well, when we execute sample.transform(), here’s what happens:

We apply the transform to each element to get new data
We store this new data using a CacheMechanism
We create a new sample from the old one, but replace the LoadMechanisms for every element with the ones returned from this CacheMechanism.

By default, sample.transform() saves outputs as variables in-memory. However, this doesn’t scale for large datasets, so it’s better to use something like we’ve used above, such as saving to path. This way, when we call ds.transform_samples(), the method will iterate over all samples, transform them, and save them. All while allowing us to treat this newly created Dataset just like the original one.

In the following snippet, we will transform samples from COCO. We will limit the Dataset to a few samples because it is remote so most of the time is spent just downloading images:

[9]:

from torchvision.transforms import v2

from bridge.display.vision import Panel
from bridge.primitives.sample.transform.vision import TorchvisionV2Transform

transform = TorchvisionV2Transform(
    [
        v2.RandomHorizontalFlip(p=1),
    ]
)

flipped_ds = stream_ds.select_samples(lambda samples, anns: samples.index[:20]).transform_samples(
    transform=transform,
    cache_mechanisms={
        "image": CacheMechanism(
            URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / "flipped")),
        )
    },
    display_engine=Panel(bbox_format="xywh"),
)

[10]:

flipped_ds.show()

[10]:

[11]:

list(Path(TMP_NOTEBOOK_ROOT / "flipped").iterdir())

[11]:

[PosixPath('/tmp/bridge-ds/tutorials/flipped/30_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/49_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/36_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/34_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/81_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/78_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/86_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/72_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/25_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/42_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/9_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/89_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/92_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/64_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/73_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/71_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/61_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/74_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/94_img.jpg'),
 PosixPath('/tmp/bridge-ds/tutorials/flipped/77_img.jpg')]