Table Api#

Preliminaries#

Imports#

[1]:

from pathlib import Path

import holoviews as hv
import panel as pn

hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
    import google.colab  # noqa

    def _render(self, **kwargs):
        hv.extension("bokeh")
        return hv.Store.render(self)

    hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
    pass

TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")

%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)

ⓘ

Loading a dataset#

To create BridgeDS Dataset objects, it’s recommended to utilize a DatasetProvider. In this instance, we’ll employ the Coco2017Detection provider:

[2]:

from bridge.providers.vision import Coco2017Detection

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir, split="train", img_source="stream")
ds = provider.build_dataset()
ds

Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_train2017.json already exists, skipping download.
loading annotations into memory...
Done (t=12.30s)
creating index...
index created!

[2]:

Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}

TableAPI#

In BridgeDS, we use two complementing approaches to view datasets. We call them the Sample API and the Table API. This tutorial is about the latter.

The Table API can be described as:

A dataset can be viewed as a table where every row represents a single element. Elements have a unique element_id but share their sample_id with other elements from the same Sample. The element_id and sample_id columns serve as the table's multi-index.

Tables#

Like in the previous tutorial, we semantically split the elements into two groups: ds.samples containing images and ds.annotations containing bboxes:

[3]:

ds.samples.head()

[3]:

		element_type	data	category	license	file_name	coco_url	height	width	date_captured	flickr_url
sample_id	element_id
9	9_img	image	http://images.cocodataset.org/train2017/000000...	image	3.0	000000000009.jpg	http://images.cocodataset.org/train2017/000000...	480.0	640.0	2013-11-19 20:40:11	http://farm5.staticflickr.com/4026/4622125393_...
25	25_img	image	http://images.cocodataset.org/train2017/000000...	image	1.0	000000000025.jpg	http://images.cocodataset.org/train2017/000000...	426.0	640.0	2013-11-16 14:11:30	http://farm1.staticflickr.com/94/241612385_d9e...
30	30_img	image	http://images.cocodataset.org/train2017/000000...	image	4.0	000000000030.jpg	http://images.cocodataset.org/train2017/000000...	428.0	640.0	2013-11-24 03:32:32	http://farm4.staticflickr.com/3377/3573516590_...
34	34_img	image	http://images.cocodataset.org/train2017/000000...	image	6.0	000000000034.jpg	http://images.cocodataset.org/train2017/000000...	425.0	640.0	2013-11-18 16:32:48	http://farm5.staticflickr.com/4024/4599442031_...
36	36_img	image	http://images.cocodataset.org/train2017/000000...	image	3.0	000000000036.jpg	http://images.cocodataset.org/train2017/000000...	640.0	481.0	2013-11-18 06:56:10	http://farm8.staticflickr.com/7216/7200825264_...

[4]:

ds.annotations.head()

[4]:

		element_type	data	category	category_id	area	iscrowd
sample_id	element_id
9	9_1038967	bbox	BoundingBox(class_name=51,coords=[ 1.08 187.6...	obj	51.0	120057.13925	0.0
	9_1039564	bbox	BoundingBox(class_name=51,coords=[311.73 4.3...	obj	51.0	44434.75110	0.0
	9_1058555	bbox	BoundingBox(class_name=56,coords=[249.6 229.2...	obj	56.0	49577.94435	0.0
	9_1534147	bbox	BoundingBox(class_name=51,coords=[ 0. 13.5...	obj	51.0	24292.78170	0.0
	9_1913551	bbox	BoundingBox(class_name=55,coords=[376.2 40.3...	obj	55.0	2239.29240	0.0

Methods#

The Table API is designed to expose callables that accept Pandas DataFrames as arguments, due to their simple and familiar API. The following sections showcase methods that allow users to perform different actions on Datasets, and these methods accept tuples of DataFrames (samples, annotations).

Filter#

Using tables allows us to easily filter out images or bboxes using familiar Pandas syntax. Note that when filtering samples, BridgeDS automatically filters out corresponding annotations:

[5]:

print("Original dataset:")
print(ds, "\n")
print("Filter out images (and corresponding bboxes) where the license < 3:")
print(ds.select_samples(lambda samples, anns: samples.license < 3), "\n")
print("Filter all bboxes with iscrowd==0. This leaves us with some empty images:")
print(ds.select_annotations(lambda samples, anns: anns.iscrowd != 0), "\n")
print("We can pipe both selectors to filter out the bboxes, and subsequently filter out empty images:")
print(
    ds.select_annotations(lambda samples, anns: anns.iscrowd != 0).select_samples(
        lambda samples, anns: samples.index.get_level_values("sample_id").isin(anns.index.get_level_values("sample_id"))
    )
)

Original dataset:
Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}

Filter out images (and corresponding bboxes) where the license < 3:
Dataset: {'n_samples': 49275, 'n_bbox': 363821, 'n_image': 49275}

Filter all bboxes with iscrowd==0. This leaves us with some empty images:
Dataset: {'n_samples': 118287, 'n_bbox': 10052, 'n_image': 118287}

We can pipe both selectors to filter out the bboxes, and subsequently filter out empty images:
Dataset: {'n_samples': 9115, 'n_bbox': 10052, 'n_image': 9115}

Assign#

We can assign new columns to either ds.samples or ds.annotations using familiar syntax. Let’s assign the value n_bboxes to every sample:

[6]:

ds = ds.assign_samples(
    n_bboxes=lambda samples, anns: anns.groupby("sample_id")
    .size()
    .reindex(samples.index.get_level_values("sample_id"), fill_value=0)
    .values
)
ds.samples.head()

[6]:

		element_type	data	category	license	file_name	coco_url	height	width	date_captured	flickr_url	n_bboxes
sample_id	element_id
9	9_img	image	http://images.cocodataset.org/train2017/000000...	image	3.0	000000000009.jpg	http://images.cocodataset.org/train2017/000000...	480.0	640.0	2013-11-19 20:40:11	http://farm5.staticflickr.com/4026/4622125393_...	8.0
25	25_img	image	http://images.cocodataset.org/train2017/000000...	image	1.0	000000000025.jpg	http://images.cocodataset.org/train2017/000000...	426.0	640.0	2013-11-16 14:11:30	http://farm1.staticflickr.com/94/241612385_d9e...	2.0
30	30_img	image	http://images.cocodataset.org/train2017/000000...	image	4.0	000000000030.jpg	http://images.cocodataset.org/train2017/000000...	428.0	640.0	2013-11-24 03:32:32	http://farm4.staticflickr.com/3377/3573516590_...	2.0
34	34_img	image	http://images.cocodataset.org/train2017/000000...	image	6.0	000000000034.jpg	http://images.cocodataset.org/train2017/000000...	425.0	640.0	2013-11-18 16:32:48	http://farm5.staticflickr.com/4024/4599442031_...	1.0
36	36_img	image	http://images.cocodataset.org/train2017/000000...	image	3.0	000000000036.jpg	http://images.cocodataset.org/train2017/000000...	640.0	481.0	2013-11-18 06:56:10	http://farm8.staticflickr.com/7216/7200825264_...	2.0

Sorting#

We can sort the tables using familiar Pandas syntax:

[7]:

sorted_ds = ds.sort_samples("n_bboxes", ascending=False)
sorted_ds.samples.head()

[7]:

		element_type	data	category	license	file_name	coco_url	height	width	date_captured	flickr_url	n_bboxes
sample_id	element_id
157105	157105_img	image	http://images.cocodataset.org/train2017/000000...	image	2.0	000000157105.jpg	http://images.cocodataset.org/train2017/000000...	427.0	640.0	2013-11-25 14:07:08	http://farm8.staticflickr.com/7407/10068104546...	93.0
171270	171270_img	image	http://images.cocodataset.org/train2017/000000...	image	5.0	000000171270.jpg	http://images.cocodataset.org/train2017/000000...	640.0	640.0	2013-11-25 07:42:38	http://farm4.staticflickr.com/3793/10067342296...	80.0
122263	122263_img	image	http://images.cocodataset.org/train2017/000000...	image	3.0	000000122263.jpg	http://images.cocodataset.org/train2017/000000...	426.0	640.0	2013-11-25 07:42:36	http://farm8.staticflickr.com/7350/10283604383...	78.0
579329	579329_img	image	http://images.cocodataset.org/train2017/000000...	image	3.0	000000579329.jpg	http://images.cocodataset.org/train2017/000000...	480.0	640.0	2013-11-25 08:21:41	http://farm8.staticflickr.com/7288/9458101590_...	75.0
307238	307238_img	image	http://images.cocodataset.org/train2017/000000...	image	4.0	000000307238.jpg	http://images.cocodataset.org/train2017/000000...	427.0	640.0	2013-11-25 15:08:50	http://farm8.staticflickr.com/7325/9060312405_...	74.0

Note that if we sort the samples table, we can change the positional index used by the Sample API (ds.iget) which dictates the order of the samples below. The next cell will show the dataset in order from most bboxes per image to least:

[8]:

sorted_ds.show()

[8]: