Coco Eda Demo#
Download this notebook from GitHub
Preliminaries#
Imports#
[1]:
from pathlib import Path
import holoviews as hv
import hvplot.pandas # noqa
import pandas as pd
import panel as pn
hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
import google.colab # noqa
def _render(self, **kwargs):
hv.extension("bokeh")
return hv.Store.render(self)
hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
pass
TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")
%opts magic unavailable (pyparsing cannot be imported)
%compositor magic unavailable (pyparsing cannot be imported)
Loading a dataset#
To create BridgeDS Dataset objects, it’s recommended to utilize a DatasetProvider. In this instance, we’ll employ the Coco2017Detection provider:
[2]:
from bridge.providers.vision import Coco2017Detection
root_dir = TMP_NOTEBOOK_ROOT / "coco"
provider = Coco2017Detection(root_dir)
ds = provider.build_dataset()
ds
Annotations file /tmp/bridge-ds/tutorials/coco/annotations/instances_train2017.json already exists, skipping download.
loading annotations into memory...
Done (t=12.56s)
creating index...
index created!
[2]:
Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}
Real-life example: Exploratory Data Analysis on COCO#
In this demo we’ll perform a short step-by-step analysis on COCO, using different tools available in BridgeDS.
Assigning a column#
Let’s take a brief look at our samples and annotations:
[3]:
ds.samples.head()
[3]:
| element_type | data | category | license | file_name | coco_url | height | width | date_captured | flickr_url | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| sample_id | element_id | ||||||||||
| 9 | 9_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000000009.jpg | http://images.cocodataset.org/train2017/000000... | 480.0 | 640.0 | 2013-11-19 20:40:11 | http://farm5.staticflickr.com/4026/4622125393_... |
| 25 | 25_img | image | http://images.cocodataset.org/train2017/000000... | image | 1.0 | 000000000025.jpg | http://images.cocodataset.org/train2017/000000... | 426.0 | 640.0 | 2013-11-16 14:11:30 | http://farm1.staticflickr.com/94/241612385_d9e... |
| 30 | 30_img | image | http://images.cocodataset.org/train2017/000000... | image | 4.0 | 000000000030.jpg | http://images.cocodataset.org/train2017/000000... | 428.0 | 640.0 | 2013-11-24 03:32:32 | http://farm4.staticflickr.com/3377/3573516590_... |
| 34 | 34_img | image | http://images.cocodataset.org/train2017/000000... | image | 6.0 | 000000000034.jpg | http://images.cocodataset.org/train2017/000000... | 425.0 | 640.0 | 2013-11-18 16:32:48 | http://farm5.staticflickr.com/4024/4599442031_... |
| 36 | 36_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000000036.jpg | http://images.cocodataset.org/train2017/000000... | 640.0 | 481.0 | 2013-11-18 06:56:10 | http://farm8.staticflickr.com/7216/7200825264_... |
[4]:
ds.annotations.head()
[4]:
| element_type | data | category | category_id | area | iscrowd | ||
|---|---|---|---|---|---|---|---|
| sample_id | element_id | ||||||
| 9 | 9_1038967 | bbox | BoundingBox(class_name=51,coords=[ 1.08 187.6... | obj | 51.0 | 120057.13925 | 0.0 |
| 9_1039564 | bbox | BoundingBox(class_name=51,coords=[311.73 4.3... | obj | 51.0 | 44434.75110 | 0.0 | |
| 9_1058555 | bbox | BoundingBox(class_name=56,coords=[249.6 229.2... | obj | 56.0 | 49577.94435 | 0.0 | |
| 9_1534147 | bbox | BoundingBox(class_name=51,coords=[ 0. 13.5... | obj | 51.0 | 24292.78170 | 0.0 | |
| 9_1913551 | bbox | BoundingBox(class_name=55,coords=[376.2 40.3... | obj | 55.0 | 2239.29240 | 0.0 |
Observe the annotations table: the class names (within the BoundingBox objects in the data column) are represented by numerical labels, which may impede readability during data analysis. To address this, we may choose to use a third-party file that maps these integer labels to their corresponding text labels.
[5]:
from urllib.request import urlopen
url = "https://raw.githubusercontent.com/amikelive/coco-labels/master/coco-labels-paper.txt"
classnames = urlopen(url).read().decode("utf-8").splitlines()
classnames = {i + 1: c for i, c in enumerate(classnames)}
print(classnames)
{1: 'person', 2: 'bicycle', 3: 'car', 4: 'motorcycle', 5: 'airplane', 6: 'bus', 7: 'train', 8: 'truck', 9: 'boat', 10: 'traffic light', 11: 'fire hydrant', 12: 'street sign', 13: 'stop sign', 14: 'parking meter', 15: 'bench', 16: 'bird', 17: 'cat', 18: 'dog', 19: 'horse', 20: 'sheep', 21: 'cow', 22: 'elephant', 23: 'bear', 24: 'zebra', 25: 'giraffe', 26: 'hat', 27: 'backpack', 28: 'umbrella', 29: 'shoe', 30: 'eye glasses', 31: 'handbag', 32: 'tie', 33: 'suitcase', 34: 'frisbee', 35: 'skis', 36: 'snowboard', 37: 'sports ball', 38: 'kite', 39: 'baseball bat', 40: 'baseball glove', 41: 'skateboard', 42: 'surfboard', 43: 'tennis racket', 44: 'bottle', 45: 'plate', 46: 'wine glass', 47: 'cup', 48: 'fork', 49: 'knife', 50: 'spoon', 51: 'bowl', 52: 'banana', 53: 'apple', 54: 'sandwich', 55: 'orange', 56: 'broccoli', 57: 'carrot', 58: 'hot dog', 59: 'pizza', 60: 'donut', 61: 'cake', 62: 'chair', 63: 'couch', 64: 'potted plant', 65: 'bed', 66: 'mirror', 67: 'dining table', 68: 'window', 69: 'desk', 70: 'toilet', 71: 'door', 72: 'tv', 73: 'laptop', 74: 'mouse', 75: 'remote', 76: 'keyboard', 77: 'cell phone', 78: 'microwave', 79: 'oven', 80: 'toaster', 81: 'sink', 82: 'refrigerator', 83: 'blender', 84: 'book', 85: 'clock', 86: 'vase', 87: 'scissors', 88: 'teddy bear', 89: 'hair drier', 90: 'toothbrush', 91: 'hair brush'}
We can use ds.assign_annotations to replace our bounding box class labels with new ones:
[6]:
from bridge.utils.data_objects import BoundingBox, ClassLabel
def map_bbox_class_names(bbox, classnames):
coords = bbox.coords
class_idx = bbox.class_label.class_idx
class_name = classnames[class_idx]
return BoundingBox(coords, ClassLabel(class_idx, class_name))
ds = ds.assign_annotations(
data=lambda samples, anns: anns.data.apply(lambda bbox: map_bbox_class_names(bbox, classnames))
)
ds.annotations.head()
[6]:
| element_type | data | category | category_id | area | iscrowd | ||
|---|---|---|---|---|---|---|---|
| sample_id | element_id | ||||||
| 9 | 9_1038967 | bbox | BoundingBox(class_name=bowl,coords=[ 1.08 187... | obj | 51.0 | 120057.13925 | 0.0 |
| 9_1039564 | bbox | BoundingBox(class_name=bowl,coords=[311.73 4... | obj | 51.0 | 44434.75110 | 0.0 | |
| 9_1058555 | bbox | BoundingBox(class_name=broccoli,coords=[249.6 ... | obj | 56.0 | 49577.94435 | 0.0 | |
| 9_1534147 | bbox | BoundingBox(class_name=bowl,coords=[ 0. 13... | obj | 51.0 | 24292.78170 | 0.0 | |
| 9_1913551 | bbox | BoundingBox(class_name=orange,coords=[376.2 ... | obj | 55.0 | 2239.29240 | 0.0 |
Another issue is that ds.samples.date_captured is actually made of strings, instead of pd.Timestamp. Let’s fix that:
[7]:
print(ds.samples.date_captured.dtype)
ds = ds.assign_samples(date_captured=lambda samples, anns: pd.to_datetime(samples.date_captured))
print(ds.samples.date_captured.dtype)
object
datetime64[ns]
This is a short example of where the Table API shines. Most frameworks and libraries implement some variant of our Sample API, which in practice would mean that to do these assignement operations they would have to iterate through the dataset using a nested loop:
for sample in samples:
for annotation in sample:
<do...>
Which is both slow and verbose.
Plotting#
With our tables now in appropriate formats, let’s generate some basic plots to gain insights into our data.
Note: While our preferred plotting API is hvplot, Pandas Plotting remains a viable option, as are other options that support the Pandas API.
[8]:
plot = ds.annotations.data.apply(lambda bb: str(bb.class_label)).value_counts().hvplot.bar()
plot.opts(
title="Class-histogram, COCO Train",
width=900,
xrotation=90,
xlabel="class",
ylabel="n_bboxes",
)
[8]:
[9]:
ds.samples.license.value_counts().hvplot.bar().opts(title="Image Licenses, Histogram")
[9]:
[10]:
(ds.samples.groupby(pd.Grouper(freq="d", key="date_captured")).size()).hvplot.bar().opts(
xrotation=45, title="Date Captured Histogram, COCO Train"
)
[10]:
[11]:
ds.annotations.area.hvplot.density().opts(
title="KDE of annotation area, COCO Train",
xlabel="area (px)",
ylabel="density",
tools=[],
)
[11]:
Investigating a bbox with abnormally large area#
Observing the KDE plot, we notice an unnatural leftward squeezing. This behavior is likely due to hvplot setting the x-axis limits based on the minimum and maximum values present in the data. Could this suggest that one of our annotations has an area on the order of 8.0e+5 px^2?
[12]:
large_ann = ds.annotations.loc[ds.annotations.area.idxmax()]
large_ann
[12]:
element_type bbox
data BoundingBox(class_name=dining table,coords=[ ...
category obj
category_id 67.0
area 787151.47665
iscrowd 0.0
Name: (400410, 400410_391362), dtype: object
We can see the area of this annotation is 787151, so indeed in the order of 8.0e+5
At this juncture, we’ve identified a specific sample with id=400410 that warrants further examination. Utilizing the ds.get and sample.show() methods from the Sample API allows us to visualize this sample
(Reminder: ds.get and ds.iget serve as equivalents to df.loc and df.iloc, respectively, for single samples).
[13]:
sample_id = large_ann.name[0] # MultiIndex loc causes the name to be tuples (<sample_id>,<element_id>)
ds.get(sample_id).show()
[13]:
You can also call ds.show() to visualize the entire dataset instead of a single sample. You can freely scroll through using the slider and visualize different samples from the COCO, right in your notebook.
[14]:
ds.show()
[14]:
Sorting COCO dataset by bbox sizes#
Like we’ve seen in the previous section, it’s evident that the dining table annotation covers the entire image.
To assess the frequency of such occurrences, let’s display the samples in our dataset in descending order of annotation size.
To achieve this:
Assign a new column to
ds.samplesrepresenting the area value of its largest annotation.Sort the samples by this column.
Run
ds.show().
[15]:
def get_largest_area_annotation_per_sample(samples, anns):
return (
anns.sort_values("area", ascending=False)
.groupby("sample_id")
.area.first()
.reindex(
samples.index.get_level_values("sample_id")
) # without reindex, the areas may have a different sample order than our `ds.samples` index
.values
)
ds = ds.assign_samples(top_ann_area=get_largest_area_annotation_per_sample)
ds.samples.head()
[15]:
| element_type | data | category | license | file_name | coco_url | height | width | date_captured | flickr_url | top_ann_area | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sample_id | element_id | |||||||||||
| 9 | 9_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000000009.jpg | http://images.cocodataset.org/train2017/000000... | 480.0 | 640.0 | 2013-11-19 20:40:11 | http://farm5.staticflickr.com/4026/4622125393_... | 120057.13925 |
| 25 | 25_img | image | http://images.cocodataset.org/train2017/000000... | image | 1.0 | 000000000025.jpg | http://images.cocodataset.org/train2017/000000... | 426.0 | 640.0 | 2013-11-16 14:11:30 | http://farm1.staticflickr.com/94/241612385_d9e... | 19686.59795 |
| 30 | 30_img | image | http://images.cocodataset.org/train2017/000000... | image | 4.0 | 000000000030.jpg | http://images.cocodataset.org/train2017/000000... | 428.0 | 640.0 | 2013-11-24 03:32:32 | http://farm4.staticflickr.com/3377/3573516590_... | 47675.66290 |
| 34 | 34_img | image | http://images.cocodataset.org/train2017/000000... | image | 6.0 | 000000000034.jpg | http://images.cocodataset.org/train2017/000000... | 425.0 | 640.0 | 2013-11-18 16:32:48 | http://farm5.staticflickr.com/4024/4599442031_... | 92920.15370 |
| 36 | 36_img | image | http://images.cocodataset.org/train2017/000000... | image | 3.0 | 000000000036.jpg | http://images.cocodataset.org/train2017/000000... | 640.0 | 481.0 | 2013-11-18 06:56:10 | http://farm8.staticflickr.com/7216/7200825264_... | 97486.80810 |
[16]:
ds.sort_samples("top_ann_area", ascending=False).show()
[16]:
By scrolling the slider, we observe images with very large annotations on the left, followed by images with very small annotations, and then images without annotations on the right.
Filtering out images with large bboxes#
An alternative approach is to remove samples with bounding boxes that cover the majority of the image. We can accomplish this using ds.select_samples and ds.select_annotations, which similarly to ds.assign_samples / ds.assign_annotations, work with a Pandas-like API:
[17]:
print("Original dataset:", ds)
ds_smaller = ds.select_samples(lambda samples, anns: samples.top_ann_area < 1e5)
print("Filtered dataset:", ds_smaller)
ds_smaller.sort_samples("top_ann_area", ascending=False).show()
Original dataset: Dataset: {'n_samples': 118287, 'n_bbox': 860001, 'n_image': 118287}
Filtered dataset: Dataset: {'n_samples': 98264, 'n_bbox': 752062, 'n_image': 98264}
[17]:
For completeness, let’s plot the KDE from before on ds_smaller:
[18]:
ds_smaller.annotations.area.hvplot.density().opts(
title="KDE of annotation area, COCO Train",
xlabel="area (px)",
ylabel="density",
tools=[],
)
[18]:
As we can see, there’s still a leftward squeezing - although significantly less than before. We’ve gained some insight into the distribution of our bbox sizes, but there’s always more to do. Feel free to change the bbox area threshold to something even smaller, or plot this KDE for individual classes (rather than all of them), etc.