Table Api#

Download this notebook from GitHub

Open In Colab

Preliminaries#

Imports#

[1]:
from pathlib import Path

import holoviews as hv
import panel as pn

hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
    import google.colab  # noqa

    def _render(self, **kwargs):
        hv.extension("bokeh")
        return hv.Store.render(self)

    hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
    pass

TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")
%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)

Loading a dataset#

To create BridgeDS Dataset objects, it’s recommended to utilize a DatasetProvider. In this instance, we’ll employ the Coco2017Detection provider:

[2]:
from bridge.providers.vision import Coco2017Detection

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir, split="train", img_source="stream")
ds = provider.build_dataset()
ds
Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_train2017.json already exists, skipping download.
loading annotations into memory...
Done (t=13.08s)
creating index...
index created!
[2]:
Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}

TableAPI#

In BridgeDS, we use two complementing approaches to view datasets. We call them the Sample API and the Table API. This tutorial is about the latter.

The Table API can be described as:

A dataset can be viewed as a table where every row represents a single element. Elements have a unique element_id but share their sample_id with other elements from the same Sample. The element_id and sample_id columns serve as the table's multi-index.

Tables#

Like in the previous tutorial, we semantically split the elements into two groups: ds.samples containing images and ds.annotations containing bboxes:

[3]:
ds.samples.head()
[3]:
element_type data category license file_name coco_url height width date_captured flickr_url
sample_id element_id
9 9_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000000009.jpg http://images.cocodataset.org/train2017/000000... 480.0 640.0 2013-11-19 20:40:11 http://farm5.staticflickr.com/4026/4622125393_...
25 25_img image http://images.cocodataset.org/train2017/000000... image 1.0 000000000025.jpg http://images.cocodataset.org/train2017/000000... 426.0 640.0 2013-11-16 14:11:30 http://farm1.staticflickr.com/94/241612385_d9e...
30 30_img image http://images.cocodataset.org/train2017/000000... image 4.0 000000000030.jpg http://images.cocodataset.org/train2017/000000... 428.0 640.0 2013-11-24 03:32:32 http://farm4.staticflickr.com/3377/3573516590_...
34 34_img image http://images.cocodataset.org/train2017/000000... image 6.0 000000000034.jpg http://images.cocodataset.org/train2017/000000... 425.0 640.0 2013-11-18 16:32:48 http://farm5.staticflickr.com/4024/4599442031_...
36 36_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000000036.jpg http://images.cocodataset.org/train2017/000000... 640.0 481.0 2013-11-18 06:56:10 http://farm8.staticflickr.com/7216/7200825264_...
[4]:
ds.annotations.head()
[4]:
element_type data category category_id area iscrowd
sample_id element_id
9 9_1038967 bbox BoundingBox(class_name=51,coords=[ 1.08 187.6... obj 51.0 120057.13925 0.0
9_1039564 bbox BoundingBox(class_name=51,coords=[311.73 4.3... obj 51.0 44434.75110 0.0
9_1058555 bbox BoundingBox(class_name=56,coords=[249.6 229.2... obj 56.0 49577.94435 0.0
9_1534147 bbox BoundingBox(class_name=51,coords=[ 0. 13.5... obj 51.0 24292.78170 0.0
9_1913551 bbox BoundingBox(class_name=55,coords=[376.2 40.3... obj 55.0 2239.29240 0.0

Methods#

The Table API is designed to expose callables that accept Pandas DataFrames as arguments, due to their simple and familiar API. The following sections showcase methods that allow users to perform different actions on Datasets, and these methods accept tuples of DataFrames (samples, annotations).

Filter#

Using tables allows us to easily filter out images or bboxes using familiar Pandas syntax. Note that when filtering samples, BridgeDS automatically filters out corresponding annotations:

[5]:
print("Original dataset:")
print(ds, "\n")
print("Filter out images (and corresponding bboxes) where the license < 3:")
print(ds.select_samples(lambda samples, anns: samples.license < 3), "\n")
print("Filter all bboxes with iscrowd==0. This leaves us with some empty images:")
print(ds.select_annotations(lambda samples, anns: anns.iscrowd != 0), "\n")
print("We can pipe both selectors to filter out the bboxes, and subsequently filter out empty images:")
print(
    ds.select_annotations(lambda samples, anns: anns.iscrowd != 0).select_samples(
        lambda samples, anns: samples.index.get_level_values("sample_id").isin(anns.index.get_level_values("sample_id"))
    )
)
Original dataset:
Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}

Filter out images (and corresponding bboxes) where the license < 3:
Dataset: {'n_samples': 49275, 'n_bbox': 363821, 'n_image': 49275}

Filter all bboxes with iscrowd==0. This leaves us with some empty images:
Dataset: {'n_samples': 118287, 'n_bbox': 10052, 'n_image': 118287}

We can pipe both selectors to filter out the bboxes, and subsequently filter out empty images:
Dataset: {'n_samples': 9115, 'n_bbox': 10052, 'n_image': 9115}

Assign#

We can assign new columns to either ds.samples or ds.annotations using familiar syntax. Let’s assign the value n_bboxes to every sample:

[6]:
ds = ds.assign_samples(
    n_bboxes=lambda samples, anns: anns.groupby("sample_id")
    .size()
    .reindex(samples.index.get_level_values("sample_id"), fill_value=0)
    .values
)
ds.samples.head()
[6]:
element_type data category license file_name coco_url height width date_captured flickr_url n_bboxes
sample_id element_id
9 9_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000000009.jpg http://images.cocodataset.org/train2017/000000... 480.0 640.0 2013-11-19 20:40:11 http://farm5.staticflickr.com/4026/4622125393_... 8.0
25 25_img image http://images.cocodataset.org/train2017/000000... image 1.0 000000000025.jpg http://images.cocodataset.org/train2017/000000... 426.0 640.0 2013-11-16 14:11:30 http://farm1.staticflickr.com/94/241612385_d9e... 2.0
30 30_img image http://images.cocodataset.org/train2017/000000... image 4.0 000000000030.jpg http://images.cocodataset.org/train2017/000000... 428.0 640.0 2013-11-24 03:32:32 http://farm4.staticflickr.com/3377/3573516590_... 2.0
34 34_img image http://images.cocodataset.org/train2017/000000... image 6.0 000000000034.jpg http://images.cocodataset.org/train2017/000000... 425.0 640.0 2013-11-18 16:32:48 http://farm5.staticflickr.com/4024/4599442031_... 1.0
36 36_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000000036.jpg http://images.cocodataset.org/train2017/000000... 640.0 481.0 2013-11-18 06:56:10 http://farm8.staticflickr.com/7216/7200825264_... 2.0

Sorting#

We can sort the tables using familiar Pandas syntax:

[7]:
sorted_ds = ds.sort_samples("n_bboxes", ascending=False)
sorted_ds.samples.head()
[7]:
element_type data category license file_name coco_url height width date_captured flickr_url n_bboxes
sample_id element_id
157105 157105_img image http://images.cocodataset.org/train2017/000000... image 2.0 000000157105.jpg http://images.cocodataset.org/train2017/000000... 427.0 640.0 2013-11-25 14:07:08 http://farm8.staticflickr.com/7407/10068104546... 93.0
171270 171270_img image http://images.cocodataset.org/train2017/000000... image 5.0 000000171270.jpg http://images.cocodataset.org/train2017/000000... 640.0 640.0 2013-11-25 07:42:38 http://farm4.staticflickr.com/3793/10067342296... 80.0
122263 122263_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000122263.jpg http://images.cocodataset.org/train2017/000000... 426.0 640.0 2013-11-25 07:42:36 http://farm8.staticflickr.com/7350/10283604383... 78.0
579329 579329_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000579329.jpg http://images.cocodataset.org/train2017/000000... 480.0 640.0 2013-11-25 08:21:41 http://farm8.staticflickr.com/7288/9458101590_... 75.0
307238 307238_img image http://images.cocodataset.org/train2017/000000... image 4.0 000000307238.jpg http://images.cocodataset.org/train2017/000000... 427.0 640.0 2013-11-25 15:08:50 http://farm8.staticflickr.com/7325/9060312405_... 74.0

Note that if we sort the samples table, we can change the positional index used by the Sample API (ds.iget) which dictates the order of the samples below. The next cell will show the dataset in order from most bboxes per image to least:

[8]:
sorted_ds.show()
[8]: