Cache Mechanism#
Download this notebook from GitHub
Preliminaries#
Imports#
[1]:
from pathlib import Path
import holoviews as hv
import hvplot.pandas # noqa
import panel as pn
hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
import google.colab # noqa
def _render(self, **kwargs):
hv.extension("bokeh")
return hv.Store.render(self)
hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
pass
TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")
%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)
CacheMechanisms#
Motivation#
Consider the following Dataset:
[2]:
from bridge.providers.vision import Coco2017Detection
root_dir = TMP_NOTEBOOK_ROOT / "coco"
provider = Coco2017Detection(root_dir, split="val", img_source="stream")
stream_ds = provider.build_dataset()
stream_ds
Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_val2017.json already exists, skipping download.
loading annotations into memory...
Done (t=0.53s)
creating index...
index created!
[2]:
Dataset: {'n_samples': 5000, 'n_bbox': 36781, 'n_image': 5000}
[3]:
stream_ds.samples.head(3)
[3]:
| element_type | data | category | license | file_name | coco_url | height | width | date_captured | flickr_url | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| sample_id | element_id | ||||||||||
| 139 | 139_img | image | http://images.cocodataset.org/val2017/00000000... | image | 2.0 | 000000000139.jpg | http://images.cocodataset.org/val2017/00000000... | 426.0 | 640.0 | 2013-11-21 01:34:01 | http://farm9.staticflickr.com/8035/8024364858_... |
| 285 | 285_img | image | http://images.cocodataset.org/val2017/00000000... | image | 4.0 | 000000000285.jpg | http://images.cocodataset.org/val2017/00000000... | 640.0 | 586.0 | 2013-11-18 13:09:47 | http://farm8.staticflickr.com/7434/9138147604_... |
| 632 | 632_img | image | http://images.cocodataset.org/val2017/00000000... | image | 3.0 | 000000000632.jpg | http://images.cocodataset.org/val2017/00000000... | 483.0 | 640.0 | 2013-11-20 21:14:01 | http://farm2.staticflickr.com/1241/1243324748_... |
This Dataset has samples with url sources, which means we need to request them on each sample.data call, which is takes a long time:
[4]:
%%timeit
stream_ds.iget(0).data
95.7 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One way to speed this up is to use a CacheMechanism: an object that, once image_element.data is called once, stores the data in a different location (e.g. a local file or in-memory). This action is transparent to the user but making subsequent .data calls significantly faster.
In our scenario, we can assign a cache mechanism for every etype. The Dataset has two etypes:
'bbox'- already stored in-memory, no need to re-cache them'image'- we want to cache them in the filesystem.
[5]:
from bridge.primitives.element.data.cache_mechanism import CacheMechanism
from bridge.primitives.element.data.uri_components import URIComponents
root_dir = TMP_NOTEBOOK_ROOT / "coco"
provider = Coco2017Detection(root_dir)
stream_ds = provider.build_dataset(
cache_mechanisms={
"image": CacheMechanism(
root_uri=URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / "my_local_cache")),
),
"bbox": None,
},
)
stream_ds
Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_train2017.json already exists, skipping download.
loading annotations into memory...
Done (t=13.56s)
creating index...
index created!
[5]:
Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}
NOTE: cache_mechanism == None means we don’t cache anything and keep the original LoadMechanism. cache_mechanism==CacheMechanism() means we save to memory. for bboxes, they’re already in-memory so there’s no point in saving them again.
[6]:
stream_ds.samples.head(3)
[6]:
| element_type | data | category | license | file_name | coco_url | height | width | date_captured | flickr_url | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| sample_id | element_id | ||||||||||
| 9 | 9_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000000009.jpg | http://images.cocodataset.org/train2017/000000... | 480.0 | 640.0 | 2013-11-19 20:40:11 | http://farm5.staticflickr.com/4026/4622125393_... |
| 25 | 25_img | image | http://images.cocodataset.org/train2017/000000... | image | 1.0 | 000000000025.jpg | http://images.cocodataset.org/train2017/000000... | 426.0 | 640.0 | 2013-11-16 14:11:30 | http://farm1.staticflickr.com/94/241612385_d9e... |
| 30 | 30_img | image | http://images.cocodataset.org/train2017/000000... | image | 4.0 | 000000000030.jpg | http://images.cocodataset.org/train2017/000000... | 428.0 | 640.0 | 2013-11-24 03:32:32 | http://farm4.staticflickr.com/3377/3573516590_... |
[7]:
stream_ds.iget(0).data
stream_ds.samples.head(3)
[7]:
| element_type | data | category | license | file_name | coco_url | height | width | date_captured | flickr_url | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| sample_id | element_id | ||||||||||
| 9 | 9_img | image | /tmp/bridge-ds/tutorials/my_local_cache/9_img.jpg | image | 3.0 | 000000000009.jpg | http://images.cocodataset.org/train2017/000000... | 480.0 | 640.0 | 2013-11-19 20:40:11 | http://farm5.staticflickr.com/4026/4622125393_... |
| 25 | 25_img | image | http://images.cocodataset.org/train2017/000000... | image | 1.0 | 000000000025.jpg | http://images.cocodataset.org/train2017/000000... | 426.0 | 640.0 | 2013-11-16 14:11:30 | http://farm1.staticflickr.com/94/241612385_d9e... |
| 30 | 30_img | image | http://images.cocodataset.org/train2017/000000... | image | 4.0 | 000000000030.jpg | http://images.cocodataset.org/train2017/000000... | 428.0 | 640.0 | 2013-11-24 03:32:32 | http://farm4.staticflickr.com/3377/3573516590_... |
See how the first sample’s data column has changed to a local path?
[8]:
%%timeit
stream_ds.iget(0).data
62.6 ms ± 588 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So now, subsequent loads of data will be a fraction of the original download-from-url scenario.
CacheMechanism Roles#
The CacheMechanism object has two responsibilities:
Use a
CacheMethodto store the data to a certain location (disk, RAM, etc.) and to return aLoadMechanismwhich can load this data back:
def store(
self,
element,
data,
as_category: str | None = None,
should_update_elements: bool = False,
) -> LoadMechanism:
...
Update the
ds.elementstable (of whichds.samplesandds.annotationsare derived) when we callelement.data, with the new LoadMechanism we got fromcache_mechanism.store()(So the TableAPI will align with the new source)
In fact, every element holds a reference to a CacheMechanism just like it holds a LoadMechanism. Using this knowledge, here is the actual code for element.data:
@property
def data(self) -> Any:
data = self._load_mechanism.load_data()
if self._cache_mechanism:
new_load_mechanism = self._cache_mechanism.store_image(self.id, self.type, data)
self._load_mechanism = new_load_mechanism
return data
return data
CacheMechanisms and Transforms#
How does this relate back to transforms? Well, when we execute sample.transform(), here’s what happens:
We apply the transform to each element to get new data
We store this new data using a CacheMechanism
We create a new sample from the old one, but replace the LoadMechanisms for every element with the ones returned from this CacheMechanism.
By default, sample.transform() saves outputs as variables in-memory. However, this doesn’t scale for large datasets, so it’s better to use something like we’ve used above, such as saving to path. This way, when we call ds.transform_samples(), the method will iterate over all samples, transform them, and save them. All while allowing us to treat this newly created Dataset just like the original one.
In the following snippet, we will transform samples from COCO. We will limit the Dataset to a few samples because it is remote so most of the time is spent just downloading images:
[9]:
from torchvision.transforms import v2
from bridge.display.vision import Panel
from bridge.primitives.sample.transform.vision import TorchvisionV2Transform
transform = TorchvisionV2Transform(
[
v2.RandomHorizontalFlip(p=1),
]
)
flipped_ds = stream_ds.select_samples(lambda samples, anns: samples.index[:20]).transform_samples(
transform=transform,
cache_mechanisms={
"image": CacheMechanism(
URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / "flipped")),
)
},
display_engine=Panel(bbox_format="xywh"),
)
[10]:
flipped_ds.show()
[10]:
[11]:
list(Path(TMP_NOTEBOOK_ROOT / "flipped").iterdir())
[11]:
[PosixPath('/tmp/bridge-ds/tutorials/flipped/30_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/49_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/36_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/34_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/81_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/78_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/86_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/72_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/25_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/42_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/9_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/89_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/92_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/64_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/73_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/71_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/61_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/74_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/94_img.jpg'),
PosixPath('/tmp/bridge-ds/tutorials/flipped/77_img.jpg')]