Load Mechanism#

Preliminaries#

Imports#

[1]:

from pathlib import Path

import holoviews as hv
import panel as pn

from bridge.providers.vision import Coco2017Detection

hv.extension("bokeh")
pn.extension()

TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")

%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)

ⓘ

Load Dataset#

[2]:

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir, split="val", img_source="stream")
ds = provider.build_dataset()
ds

Downloading annotations...
Downloading http://images.cocodataset.org/annotations/annotations_trainval2017.zip to /tmp/bridge-ds/tutorials/coco/annotations_trainval2017.zip
Extracting /tmp/bridge-ds/tutorials/coco/annotations_trainval2017.zip to /tmp/bridge-ds/tutorials/coco
loading annotations into memory...
Done (t=0.49s)
creating index...
index created!

[2]:

Dataset: {'n_samples': 5000, 'n_bbox': 36781, 'n_image': 5000}

LoadMechanism#

In this tutorial we will learn about the LoadMechanism, Bridge’s way of loading raw data from different sources.

A quick reminder: to access the raw data within each element, we need to use the SampleAPI with sample.data / element.data. The column data in the TableAPI usually (but not always) contains a reference to the data rather than the data itself:

[3]:

ds.samples.head(1)

[3]:

		element_type	data	category	license	file_name	coco_url	height	width	date_captured	flickr_url
sample_id	element_id
139	139_img	image	http://images.cocodataset.org/val2017/00000000...	image	2.0	000000000139.jpg	http://images.cocodataset.org/val2017/00000000...	426.0	640.0	2013-11-21 01:34:01	http://farm9.staticflickr.com/8035/8024364858_...

When we want to access data for a given element, we need to call the element.data property. In COCO, we have elements for images and for bboxes. Because COCO is a SingularDataset, every sample has a special element, in this case the image, and we can access its data directly with sample.data.

[4]:

sample = ds.iget(0)
print("img_data:", sample.data.shape, "\n")

bbox_elements = [ann for ann in sample.elements["bbox"]]
print(*[bb.data for bb in bbox_elements], sep="\n")

img_data: (426, 640, 3)

BoundingBox(class_name=64,coords=[236.98 142.51  24.7   69.5 ]
BoundingBox(class_name=72,coords=[  7.03 167.76 149.32  94.87]
BoundingBox(class_name=72,coords=[557.21 209.19  81.35  78.73]
BoundingBox(class_name=62,coords=[358.98 218.05  56.   102.83]
BoundingBox(class_name=62,coords=[290.69 218.    61.83  98.48]
BoundingBox(class_name=62,coords=[413.2  223.01  30.17  81.36]
BoundingBox(class_name=62,coords=[317.4  219.24  21.58  11.59]
BoundingBox(class_name=1,coords=[412.8  157.61  53.05 138.01]
BoundingBox(class_name=1,coords=[384.43 172.21  15.12  35.74]
BoundingBox(class_name=78,coords=[512.22 205.75  14.74  15.97]
BoundingBox(class_name=82,coords=[493.1  174.34  20.29 108.31]
BoundingBox(class_name=84,coords=[604.77 305.89  14.34  45.71]
BoundingBox(class_name=84,coords=[613.24 308.24  12.88  46.44]
BoundingBox(class_name=85,coords=[447.77 121.12  13.97  21.88]
BoundingBox(class_name=86,coords=[549.06 309.43  36.68  89.67]
BoundingBox(class_name=86,coords=[350.76 208.84  11.37  22.55]
BoundingBox(class_name=62,coords=[412.25 219.02   9.63  12.52]
BoundingBox(class_name=86,coords=[241.24 194.99  14.22  17.63]
BoundingBox(class_name=86,coords=[336.79 199.5    9.73  16.73]
BoundingBox(class_name=67,coords=[321.21 231.22 125.56  88.93]

Every element holds a LoadMechanism, an object responsible for loading data from different sources. In this case, for images, element.data will perform an HTTP request and load the image in the response. For bboxes, which already exist in-memory (note that we can see them directly in the annotations table), element.data will simply load the stored object.

The LoadMechanism is defined by two variables:

[5]:

img_element = sample.element
print("Image element, loaded over HTTP:")
print("url_or_data:", img_element._load_mechanism.url_or_data)
print("category:", img_element._load_mechanism.category)
print()
print("Bbox elements, loaded from memory:")
print("url_or_data:", bbox_elements[0]._load_mechanism.url_or_data)
print("category:", bbox_elements[0]._load_mechanism.category)

Image element, loaded over HTTP:
url_or_data: http://images.cocodataset.org/val2017/000000000139.jpg
category: image

Bbox elements, loaded from memory:
url_or_data: BoundingBox(class_name=64,coords=[236.98 142.51  24.7   69.5 ]
category: obj

url_or_data, as its name suggests, contains either a url that references the object (url broadly speaking - including local paths, s3 paths, etc.), or contains the actual object, in case we want to store it directly in-memory.
category - accepts a string that is used to determine which logic is used to load the object. Should we load the image using PIL? or a text file using simple with open()? this value determines that. To find which categories are supported, use list_registered_categories.

In summary#

Bridge loads data lazily, only when element.data is called
The loading mechanism function accepts url_or_data which defines where to load from (or what to load), and category which defines how to load it.