Load Mechanism#
Download this notebook from GitHub
Preliminaries#
Imports#
[1]:
from pathlib import Path
import holoviews as hv
import panel as pn
from bridge.providers.vision import Coco2017Detection
hv.extension("bokeh")
pn.extension()
TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")
%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)
Load Dataset#
[2]:
root_dir = TMP_NOTEBOOK_ROOT / "coco"
provider = Coco2017Detection(root_dir, split="val", img_source="stream")
ds = provider.build_dataset()
ds
Downloading annotations...
Downloading http://images.cocodataset.org/annotations/annotations_trainval2017.zip to /tmp/bridge-ds/tutorials/coco/annotations_trainval2017.zip
Extracting /tmp/bridge-ds/tutorials/coco/annotations_trainval2017.zip to /tmp/bridge-ds/tutorials/coco
loading annotations into memory...
Done (t=0.49s)
creating index...
index created!
[2]:
Dataset: {'n_samples': 5000, 'n_bbox': 36781, 'n_image': 5000}
LoadMechanism#
In this tutorial we will learn about the LoadMechanism, Bridge’s way of loading raw data from different sources.
A quick reminder: to access the raw data within each element, we need to use the SampleAPI with sample.data / element.data. The column data in the TableAPI usually (but not always) contains a reference to the data rather than the data itself:
[3]:
ds.samples.head(1)
[3]:
| element_type | data | category | license | file_name | coco_url | height | width | date_captured | flickr_url | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| sample_id | element_id | ||||||||||
| 139 | 139_img | image | http://images.cocodataset.org/val2017/00000000... | image | 2.0 | 000000000139.jpg | http://images.cocodataset.org/val2017/00000000... | 426.0 | 640.0 | 2013-11-21 01:34:01 | http://farm9.staticflickr.com/8035/8024364858_... |
When we want to access data for a given element, we need to call the element.data property. In COCO, we have elements for images and for bboxes. Because COCO is a SingularDataset, every sample has a special element, in this case the image, and we can access its data directly with sample.data.
[4]:
sample = ds.iget(0)
print("img_data:", sample.data.shape, "\n")
bbox_elements = [ann for ann in sample.elements["bbox"]]
print(*[bb.data for bb in bbox_elements], sep="\n")
img_data: (426, 640, 3)
BoundingBox(class_name=64,coords=[236.98 142.51 24.7 69.5 ]
BoundingBox(class_name=72,coords=[ 7.03 167.76 149.32 94.87]
BoundingBox(class_name=72,coords=[557.21 209.19 81.35 78.73]
BoundingBox(class_name=62,coords=[358.98 218.05 56. 102.83]
BoundingBox(class_name=62,coords=[290.69 218. 61.83 98.48]
BoundingBox(class_name=62,coords=[413.2 223.01 30.17 81.36]
BoundingBox(class_name=62,coords=[317.4 219.24 21.58 11.59]
BoundingBox(class_name=1,coords=[412.8 157.61 53.05 138.01]
BoundingBox(class_name=1,coords=[384.43 172.21 15.12 35.74]
BoundingBox(class_name=78,coords=[512.22 205.75 14.74 15.97]
BoundingBox(class_name=82,coords=[493.1 174.34 20.29 108.31]
BoundingBox(class_name=84,coords=[604.77 305.89 14.34 45.71]
BoundingBox(class_name=84,coords=[613.24 308.24 12.88 46.44]
BoundingBox(class_name=85,coords=[447.77 121.12 13.97 21.88]
BoundingBox(class_name=86,coords=[549.06 309.43 36.68 89.67]
BoundingBox(class_name=86,coords=[350.76 208.84 11.37 22.55]
BoundingBox(class_name=62,coords=[412.25 219.02 9.63 12.52]
BoundingBox(class_name=86,coords=[241.24 194.99 14.22 17.63]
BoundingBox(class_name=86,coords=[336.79 199.5 9.73 16.73]
BoundingBox(class_name=67,coords=[321.21 231.22 125.56 88.93]
Every element holds a LoadMechanism, an object responsible for loading data from different sources. In this case, for images, element.data will perform an HTTP request and load the image in the response. For bboxes, which already exist in-memory (note that we can see them directly in the annotations table), element.data will simply load the stored object.
The LoadMechanism is defined by two variables:
[5]:
img_element = sample.element
print("Image element, loaded over HTTP:")
print("url_or_data:", img_element._load_mechanism.url_or_data)
print("category:", img_element._load_mechanism.category)
print()
print("Bbox elements, loaded from memory:")
print("url_or_data:", bbox_elements[0]._load_mechanism.url_or_data)
print("category:", bbox_elements[0]._load_mechanism.category)
Image element, loaded over HTTP:
url_or_data: http://images.cocodataset.org/val2017/000000000139.jpg
category: image
Bbox elements, loaded from memory:
url_or_data: BoundingBox(class_name=64,coords=[236.98 142.51 24.7 69.5 ]
category: obj
url_or_data, as its name suggests, contains either a url that references the object (url broadly speaking - including local paths, s3 paths, etc.), or contains the actual object, in case we want to store it directly in-memory.
category - accepts a string that is used to determine which logic is used to load the object. Should we load the image using PIL? or a text file using simple
with open()? this value determines that. To find which categories are supported, uselist_registered_categories.
In summary#
Bridge loads data lazily, only when
element.datais calledThe loading mechanism function accepts url_or_data which defines where to load from (or what to load), and category which defines how to load it.