Coco Eda Demo#

Download this notebook from GitHub

Open In Colab

Preliminaries#

Imports#

[1]:
from pathlib import Path

import holoviews as hv
import hvplot.pandas  # noqa
import pandas as pd
import panel as pn

hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
    import google.colab  # noqa

    def _render(self, **kwargs):
        hv.extension("bokeh")
        return hv.Store.render(self)

    hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
    pass

TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")
%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)

Loading a dataset#

To create BridgeDS Dataset objects, it’s recommended to utilize a DatasetProvider. In this instance, we’ll employ the Coco2017Detection provider:

[2]:
from bridge.providers.vision import Coco2017Detection

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir)
ds = provider.build_dataset()
ds
Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_train2017.json already exists, skipping download.
loading annotations into memory...
Done (t=12.56s)
creating index...
index created!
[2]:
Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}

Real-life example: Exploratory Data Analysis on COCO#

In this demo we’ll perform a short step-by-step analysis on COCO, using different tools available in BridgeDS.

Assigning a column#

Let’s take a brief look at our samples and annotations:

[3]:
ds.samples.head()
[3]:
element_type data category license file_name coco_url height width date_captured flickr_url
sample_id element_id
9 9_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000000009.jpg http://images.cocodataset.org/train2017/000000... 480.0 640.0 2013-11-19 20:40:11 http://farm5.staticflickr.com/4026/4622125393_...
25 25_img image http://images.cocodataset.org/train2017/000000... image 1.0 000000000025.jpg http://images.cocodataset.org/train2017/000000... 426.0 640.0 2013-11-16 14:11:30 http://farm1.staticflickr.com/94/241612385_d9e...
30 30_img image http://images.cocodataset.org/train2017/000000... image 4.0 000000000030.jpg http://images.cocodataset.org/train2017/000000... 428.0 640.0 2013-11-24 03:32:32 http://farm4.staticflickr.com/3377/3573516590_...
34 34_img image http://images.cocodataset.org/train2017/000000... image 6.0 000000000034.jpg http://images.cocodataset.org/train2017/000000... 425.0 640.0 2013-11-18 16:32:48 http://farm5.staticflickr.com/4024/4599442031_...
36 36_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000000036.jpg http://images.cocodataset.org/train2017/000000... 640.0 481.0 2013-11-18 06:56:10 http://farm8.staticflickr.com/7216/7200825264_...
[4]:
ds.annotations.head()
[4]:
element_type data category category_id area iscrowd
sample_id element_id
9 9_1038967 bbox BoundingBox(class_name=51,coords=[ 1.08 187.6... obj 51.0 120057.13925 0.0
9_1039564 bbox BoundingBox(class_name=51,coords=[311.73 4.3... obj 51.0 44434.75110 0.0
9_1058555 bbox BoundingBox(class_name=56,coords=[249.6 229.2... obj 56.0 49577.94435 0.0
9_1534147 bbox BoundingBox(class_name=51,coords=[ 0. 13.5... obj 51.0 24292.78170 0.0
9_1913551 bbox BoundingBox(class_name=55,coords=[376.2 40.3... obj 55.0 2239.29240 0.0

Observe the annotations table: the class names (within the BoundingBox objects in the data column) are represented by numerical labels, which may impede readability during data analysis. To address this, we may choose to use a third-party file that maps these integer labels to their corresponding text labels.

[5]:
from urllib.request import urlopen

url = "https://raw.githubusercontent.com/amikelive/coco-labels/master/coco-labels-paper.txt"

classnames = urlopen(url).read().decode("utf-8").splitlines()
classnames = {i + 1: c for i, c in enumerate(classnames)}
print(classnames)
{1: 'person', 2: 'bicycle', 3: 'car', 4: 'motorcycle', 5: 'airplane', 6: 'bus', 7: 'train', 8: 'truck', 9: 'boat', 10: 'traffic light', 11: 'fire hydrant', 12: 'street sign', 13: 'stop sign', 14: 'parking meter', 15: 'bench', 16: 'bird', 17: 'cat', 18: 'dog', 19: 'horse', 20: 'sheep', 21: 'cow', 22: 'elephant', 23: 'bear', 24: 'zebra', 25: 'giraffe', 26: 'hat', 27: 'backpack', 28: 'umbrella', 29: 'shoe', 30: 'eye glasses', 31: 'handbag', 32: 'tie', 33: 'suitcase', 34: 'frisbee', 35: 'skis', 36: 'snowboard', 37: 'sports ball', 38: 'kite', 39: 'baseball bat', 40: 'baseball glove', 41: 'skateboard', 42: 'surfboard', 43: 'tennis racket', 44: 'bottle', 45: 'plate', 46: 'wine glass', 47: 'cup', 48: 'fork', 49: 'knife', 50: 'spoon', 51: 'bowl', 52: 'banana', 53: 'apple', 54: 'sandwich', 55: 'orange', 56: 'broccoli', 57: 'carrot', 58: 'hot dog', 59: 'pizza', 60: 'donut', 61: 'cake', 62: 'chair', 63: 'couch', 64: 'potted plant', 65: 'bed', 66: 'mirror', 67: 'dining table', 68: 'window', 69: 'desk', 70: 'toilet', 71: 'door', 72: 'tv', 73: 'laptop', 74: 'mouse', 75: 'remote', 76: 'keyboard', 77: 'cell phone', 78: 'microwave', 79: 'oven', 80: 'toaster', 81: 'sink', 82: 'refrigerator', 83: 'blender', 84: 'book', 85: 'clock', 86: 'vase', 87: 'scissors', 88: 'teddy bear', 89: 'hair drier', 90: 'toothbrush', 91: 'hair brush'}

We can use ds.assign_annotations to replace our bounding box class labels with new ones:

[6]:
from bridge.utils.data_objects import BoundingBox, ClassLabel


def map_bbox_class_names(bbox, classnames):
    coords = bbox.coords
    class_idx = bbox.class_label.class_idx
    class_name = classnames[class_idx]
    return BoundingBox(coords, ClassLabel(class_idx, class_name))


ds = ds.assign_annotations(
    data=lambda samples, anns: anns.data.apply(lambda bbox: map_bbox_class_names(bbox, classnames))
)
ds.annotations.head()
[6]:
element_type data category category_id area iscrowd
sample_id element_id
9 9_1038967 bbox BoundingBox(class_name=bowl,coords=[ 1.08 187... obj 51.0 120057.13925 0.0
9_1039564 bbox BoundingBox(class_name=bowl,coords=[311.73 4... obj 51.0 44434.75110 0.0
9_1058555 bbox BoundingBox(class_name=broccoli,coords=[249.6 ... obj 56.0 49577.94435 0.0
9_1534147 bbox BoundingBox(class_name=bowl,coords=[ 0. 13... obj 51.0 24292.78170 0.0
9_1913551 bbox BoundingBox(class_name=orange,coords=[376.2 ... obj 55.0 2239.29240 0.0

Another issue is that ds.samples.date_captured is actually made of strings, instead of pd.Timestamp. Let’s fix that:

[7]:
print(ds.samples.date_captured.dtype)
ds = ds.assign_samples(date_captured=lambda samples, anns: pd.to_datetime(samples.date_captured))
print(ds.samples.date_captured.dtype)
object
datetime64[ns]

This is a short example of where the Table API shines. Most frameworks and libraries implement some variant of our Sample API, which in practice would mean that to do these assignement operations they would have to iterate through the dataset using a nested loop:

for sample in samples:
    for annotation in sample:
        <do...>

Which is both slow and verbose.

Plotting#

With our tables now in appropriate formats, let’s generate some basic plots to gain insights into our data.

Note: While our preferred plotting API is hvplot, Pandas Plotting remains a viable option, as are other options that support the Pandas API.

[8]:
plot = ds.annotations.data.apply(lambda bb: str(bb.class_label)).value_counts().hvplot.bar()

plot.opts(
    title="Class-histogram, COCO Train",
    width=900,
    xrotation=90,
    xlabel="class",
    ylabel="n_bboxes",
)
[8]:
[9]:
ds.samples.license.value_counts().hvplot.bar().opts(title="Image Licenses, Histogram")
[9]:
[10]:
(ds.samples.groupby(pd.Grouper(freq="d", key="date_captured")).size()).hvplot.bar().opts(
    xrotation=45, title="Date Captured Histogram, COCO Train"
)
[10]:
[11]:
ds.annotations.area.hvplot.density().opts(
    title="KDE of annotation area, COCO Train",
    xlabel="area (px)",
    ylabel="density",
    tools=[],
)
[11]:

Investigating a bbox with abnormally large area#

Observing the KDE plot, we notice an unnatural leftward squeezing. This behavior is likely due to hvplot setting the x-axis limits based on the minimum and maximum values present in the data. Could this suggest that one of our annotations has an area on the order of 8.0e+5 px^2?

[12]:
large_ann = ds.annotations.loc[ds.annotations.area.idxmax()]
large_ann
[12]:
element_type                                                 bbox
data            BoundingBox(class_name=dining table,coords=[  ...
category                                                      obj
category_id                                                  67.0
area                                                 787151.47665
iscrowd                                                       0.0
Name: (400410, 400410_391362), dtype: object

We can see the area of this annotation is 787151, so indeed in the order of 8.0e+5

At this juncture, we’ve identified a specific sample with id=400410 that warrants further examination. Utilizing the ds.get and sample.show() methods from the Sample API allows us to visualize this sample

(Reminder: ds.get and ds.iget serve as equivalents to df.loc and df.iloc, respectively, for single samples).

[13]:
sample_id = large_ann.name[0]  # MultiIndex loc causes the name to be tuples (<sample_id>,<element_id>)
ds.get(sample_id).show()
[13]:

You can also call ds.show() to visualize the entire dataset instead of a single sample. You can freely scroll through using the slider and visualize different samples from the COCO, right in your notebook.

[14]:
ds.show()
[14]:

Sorting COCO dataset by bbox sizes#

Like we’ve seen in the previous section, it’s evident that the dining table annotation covers the entire image.

To assess the frequency of such occurrences, let’s display the samples in our dataset in descending order of annotation size.

To achieve this:

  1. Assign a new column to ds.samples representing the area value of its largest annotation.

  2. Sort the samples by this column.

  3. Run ds.show().

[15]:
def get_largest_area_annotation_per_sample(samples, anns):
    return (
        anns.sort_values("area", ascending=False)
        .groupby("sample_id")
        .area.first()
        .reindex(
            samples.index.get_level_values("sample_id")
        )  # without reindex, the areas may have a different sample order than our `ds.samples` index
        .values
    )


ds = ds.assign_samples(top_ann_area=get_largest_area_annotation_per_sample)
ds.samples.head()
[15]:
element_type data category license file_name coco_url height width date_captured flickr_url top_ann_area
sample_id element_id
9 9_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000000009.jpg http://images.cocodataset.org/train2017/000000... 480.0 640.0 2013-11-19 20:40:11 http://farm5.staticflickr.com/4026/4622125393_... 120057.13925
25 25_img image http://images.cocodataset.org/train2017/000000... image 1.0 000000000025.jpg http://images.cocodataset.org/train2017/000000... 426.0 640.0 2013-11-16 14:11:30 http://farm1.staticflickr.com/94/241612385_d9e... 19686.59795
30 30_img image http://images.cocodataset.org/train2017/000000... image 4.0 000000000030.jpg http://images.cocodataset.org/train2017/000000... 428.0 640.0 2013-11-24 03:32:32 http://farm4.staticflickr.com/3377/3573516590_... 47675.66290
34 34_img image http://images.cocodataset.org/train2017/000000... image 6.0 000000000034.jpg http://images.cocodataset.org/train2017/000000... 425.0 640.0 2013-11-18 16:32:48 http://farm5.staticflickr.com/4024/4599442031_... 92920.15370
36 36_img image http://images.cocodataset.org/train2017/000000... image 3.0 000000000036.jpg http://images.cocodataset.org/train2017/000000... 640.0 481.0 2013-11-18 06:56:10 http://farm8.staticflickr.com/7216/7200825264_... 97486.80810
[16]:
ds.sort_samples("top_ann_area", ascending=False).show()
[16]:

By scrolling the slider, we observe images with very large annotations on the left, followed by images with very small annotations, and then images without annotations on the right.

Filtering out images with large bboxes#

An alternative approach is to remove samples with bounding boxes that cover the majority of the image. We can accomplish this using ds.select_samples and ds.select_annotations, which similarly to ds.assign_samples / ds.assign_annotations, work with a Pandas-like API:

[17]:
print("Original dataset:", ds)
ds_smaller = ds.select_samples(lambda samples, anns: samples.top_ann_area < 1e5)
print("Filtered dataset:", ds_smaller)
ds_smaller.sort_samples("top_ann_area", ascending=False).show()
Original dataset: Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}
Filtered dataset: Dataset: {'n_samples': 98264, 'n_bbox': 752062, 'n_image': 98264}
[17]:

For completeness, let’s plot the KDE from before on ds_smaller:

[18]:
ds_smaller.annotations.area.hvplot.density().opts(
    title="KDE of annotation area, COCO Train",
    xlabel="area (px)",
    ylabel="density",
    tools=[],
)
[18]:

As we can see, there’s still a leftward squeezing - although significantly less than before. We’ve gained some insight into the distribution of our bbox sizes, but there’s always more to do. Feel free to change the bbox area threshold to something even smaller, or plot this KDE for individual classes (rather than all of them), etc.