Table Api#
Download this notebook from GitHub
Preliminaries#
Imports#
[1]:
from pathlib import Path
import holoviews as hv
import panel as pn
hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
import google.colab # noqa
def _render(self, **kwargs):
hv.extension("bokeh")
return hv.Store.render(self)
hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
pass
TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")
%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)
Loading a dataset#
To create BridgeDS Dataset objects, it’s recommended to utilize a DatasetProvider. In this instance, we’ll employ the Coco2017Detection provider:
[2]:
from bridge.providers.vision import Coco2017Detection
root_dir = TMP_NOTEBOOK_ROOT / "coco"
provider = Coco2017Detection(root_dir, split="train", img_source="stream")
ds = provider.build_dataset()
ds
Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_train2017.json already exists, skipping download.
loading annotations into memory...
Done (t=12.30s)
creating index...
index created!
[2]:
Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}
TableAPI#
In BridgeDS, we use two complementing approaches to view datasets. We call them the Sample API and the Table API. This tutorial is about the latter.
The Table API can be described as:
A dataset can be viewed as a table where every row represents a single element. Elements have a unique element_id but share their sample_id with other elements from the same Sample. The element_id and sample_id columns serve as the table's multi-index.
Tables#
Like in the previous tutorial, we semantically split the elements into two groups: ds.samples containing images and ds.annotations containing bboxes:
[3]:
ds.samples.head()
[3]:
| element_type | data | category | license | file_name | coco_url | height | width | date_captured | flickr_url | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| sample_id | element_id | ||||||||||
| 9 | 9_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000000009.jpg | http://images.cocodataset.org/train2017/000000... | 480.0 | 640.0 | 2013-11-19 20:40:11 | http://farm5.staticflickr.com/4026/4622125393_... |
| 25 | 25_img | image | http://images.cocodataset.org/train2017/000000... | image | 1.0 | 000000000025.jpg | http://images.cocodataset.org/train2017/000000... | 426.0 | 640.0 | 2013-11-16 14:11:30 | http://farm1.staticflickr.com/94/241612385_d9e... |
| 30 | 30_img | image | http://images.cocodataset.org/train2017/000000... | image | 4.0 | 000000000030.jpg | http://images.cocodataset.org/train2017/000000... | 428.0 | 640.0 | 2013-11-24 03:32:32 | http://farm4.staticflickr.com/3377/3573516590_... |
| 34 | 34_img | image | http://images.cocodataset.org/train2017/000000... | image | 6.0 | 000000000034.jpg | http://images.cocodataset.org/train2017/000000... | 425.0 | 640.0 | 2013-11-18 16:32:48 | http://farm5.staticflickr.com/4024/4599442031_... |
| 36 | 36_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000000036.jpg | http://images.cocodataset.org/train2017/000000... | 640.0 | 481.0 | 2013-11-18 06:56:10 | http://farm8.staticflickr.com/7216/7200825264_... |
[4]:
ds.annotations.head()
[4]:
| element_type | data | category | category_id | area | iscrowd | ||
|---|---|---|---|---|---|---|---|
| sample_id | element_id | ||||||
| 9 | 9_1038967 | bbox | BoundingBox(class_name=51,coords=[ 1.08 187.6... | obj | 51.0 | 120057.13925 | 0.0 |
| 9_1039564 | bbox | BoundingBox(class_name=51,coords=[311.73 4.3... | obj | 51.0 | 44434.75110 | 0.0 | |
| 9_1058555 | bbox | BoundingBox(class_name=56,coords=[249.6 229.2... | obj | 56.0 | 49577.94435 | 0.0 | |
| 9_1534147 | bbox | BoundingBox(class_name=51,coords=[ 0. 13.5... | obj | 51.0 | 24292.78170 | 0.0 | |
| 9_1913551 | bbox | BoundingBox(class_name=55,coords=[376.2 40.3... | obj | 55.0 | 2239.29240 | 0.0 |
Methods#
The Table API is designed to expose callables that accept Pandas DataFrames as arguments, due to their simple and familiar API. The following sections showcase methods that allow users to perform different actions on Datasets, and these methods accept tuples of DataFrames (samples, annotations).
Filter#
Using tables allows us to easily filter out images or bboxes using familiar Pandas syntax. Note that when filtering samples, BridgeDS automatically filters out corresponding annotations:
[5]:
print("Original dataset:")
print(ds, "\n")
print("Filter out images (and corresponding bboxes) where the license < 3:")
print(ds.select_samples(lambda samples, anns: samples.license < 3), "\n")
print("Filter all bboxes with iscrowd==0. This leaves us with some empty images:")
print(ds.select_annotations(lambda samples, anns: anns.iscrowd != 0), "\n")
print("We can pipe both selectors to filter out the bboxes, and subsequently filter out empty images:")
print(
ds.select_annotations(lambda samples, anns: anns.iscrowd != 0).select_samples(
lambda samples, anns: samples.index.get_level_values("sample_id").isin(anns.index.get_level_values("sample_id"))
)
)
Original dataset:
Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}
Filter out images (and corresponding bboxes) where the license < 3:
Dataset: {'n_samples': 49275, 'n_bbox': 363821, 'n_image': 49275}
Filter all bboxes with iscrowd==0. This leaves us with some empty images:
Dataset: {'n_samples': 118287, 'n_bbox': 10052, 'n_image': 118287}
We can pipe both selectors to filter out the bboxes, and subsequently filter out empty images:
Dataset: {'n_samples': 9115, 'n_bbox': 10052, 'n_image': 9115}
Assign#
We can assign new columns to either ds.samples or ds.annotations using familiar syntax. Let’s assign the value n_bboxes to every sample:
[6]:
ds = ds.assign_samples(
n_bboxes=lambda samples, anns: anns.groupby("sample_id")
.size()
.reindex(samples.index.get_level_values("sample_id"), fill_value=0)
.values
)
ds.samples.head()
[6]:
| element_type | data | category | license | file_name | coco_url | height | width | date_captured | flickr_url | n_bboxes | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sample_id | element_id | |||||||||||
| 9 | 9_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000000009.jpg | http://images.cocodataset.org/train2017/000000... | 480.0 | 640.0 | 2013-11-19 20:40:11 | http://farm5.staticflickr.com/4026/4622125393_... | 8.0 |
| 25 | 25_img | image | http://images.cocodataset.org/train2017/000000... | image | 1.0 | 000000000025.jpg | http://images.cocodataset.org/train2017/000000... | 426.0 | 640.0 | 2013-11-16 14:11:30 | http://farm1.staticflickr.com/94/241612385_d9e... | 2.0 |
| 30 | 30_img | image | http://images.cocodataset.org/train2017/000000... | image | 4.0 | 000000000030.jpg | http://images.cocodataset.org/train2017/000000... | 428.0 | 640.0 | 2013-11-24 03:32:32 | http://farm4.staticflickr.com/3377/3573516590_... | 2.0 |
| 34 | 34_img | image | http://images.cocodataset.org/train2017/000000... | image | 6.0 | 000000000034.jpg | http://images.cocodataset.org/train2017/000000... | 425.0 | 640.0 | 2013-11-18 16:32:48 | http://farm5.staticflickr.com/4024/4599442031_... | 1.0 |
| 36 | 36_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000000036.jpg | http://images.cocodataset.org/train2017/000000... | 640.0 | 481.0 | 2013-11-18 06:56:10 | http://farm8.staticflickr.com/7216/7200825264_... | 2.0 |
Sorting#
We can sort the tables using familiar Pandas syntax:
[7]:
sorted_ds = ds.sort_samples("n_bboxes", ascending=False)
sorted_ds.samples.head()
[7]:
| element_type | data | category | license | file_name | coco_url | height | width | date_captured | flickr_url | n_bboxes | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sample_id | element_id | |||||||||||
| 157105 | 157105_img | image | http://images.cocodataset.org/train2017/000000... | image | 2.0 | 000000157105.jpg | http://images.cocodataset.org/train2017/000000... | 427.0 | 640.0 | 2013-11-25 14:07:08 | http://farm8.staticflickr.com/7407/10068104546... | 93.0 |
| 171270 | 171270_img | image | http://images.cocodataset.org/train2017/000000... | image | 5.0 | 000000171270.jpg | http://images.cocodataset.org/train2017/000000... | 640.0 | 640.0 | 2013-11-25 07:42:38 | http://farm4.staticflickr.com/3793/10067342296... | 80.0 |
| 122263 | 122263_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000122263.jpg | http://images.cocodataset.org/train2017/000000... | 426.0 | 640.0 | 2013-11-25 07:42:36 | http://farm8.staticflickr.com/7350/10283604383... | 78.0 |
| 579329 | 579329_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000579329.jpg | http://images.cocodataset.org/train2017/000000... | 480.0 | 640.0 | 2013-11-25 08:21:41 | http://farm8.staticflickr.com/7288/9458101590_... | 75.0 |
| 307238 | 307238_img | image | http://images.cocodataset.org/train2017/000000... | image | 4.0 | 000000307238.jpg | http://images.cocodataset.org/train2017/000000... | 427.0 | 640.0 | 2013-11-25 15:08:50 | http://farm8.staticflickr.com/7325/9060312405_... | 74.0 |
Note that if we sort the samples table, we can change the positional index used by the Sample API (ds.iget) which dictates the order of the samples below. The next cell will show the dataset in order from most bboxes per image to least:
[8]:
sorted_ds.show()
[8]: