{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Preliminaries" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import holoviews as hv\n", "import hvplot.pandas # noqa\n", "import pandas as pd\n", "import panel as pn\n", "\n", "hv.extension(\"bokeh\")\n", "pn.extension()\n", "# If in google colab, run hack that allows holoviews to work properly\n", "try:\n", " import google.colab # noqa\n", "\n", " def _render(self, **kwargs):\n", " hv.extension(\"bokeh\")\n", " return hv.Store.render(self)\n", "\n", " hv.core.Dimensioned._repr_mimebundle_ = _render\n", "except ModuleNotFoundError:\n", " pass\n", "\n", "TMP_NOTEBOOK_ROOT = Path(\"/tmp/bridge-ds/tutorials\")" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "## Loading a dataset" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "To create BridgeDS Dataset objects, it's recommended to utilize a **DatasetProvider**. In this instance, we'll employ the Coco2017Detection provider:" ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "from bridge.providers.vision import Coco2017Detection\n", "\n", "root_dir = TMP_NOTEBOOK_ROOT / \"coco\"\n", "\n", "provider = Coco2017Detection(root_dir)\n", "ds = provider.build_dataset()\n", "ds" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "# Real-life example: Exploratory Data Analysis on COCO" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "In this demo we'll perform a short step-by-step analysis on COCO, using different tools available in BridgeDS." ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "## Assigning a column\n", "Let's take a brief look at our samples and annotations:" ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "ds.samples.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "ds.annotations.head()" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "Observe the annotations table: the class names (within the BoundingBox objects in the `data` column) are represented by numerical labels, which may impede readability during data analysis. To address this, we may choose to use a third-party file that maps these integer labels to their corresponding text labels." ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "from urllib.request import urlopen\n", "\n", "url = \"https://raw.githubusercontent.com/amikelive/coco-labels/master/coco-labels-paper.txt\"\n", "\n", "classnames = urlopen(url).read().decode(\"utf-8\").splitlines()\n", "classnames = {i + 1: c for i, c in enumerate(classnames)}\n", "print(classnames)" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "We can use `ds.assign_annotations` to replace our bounding box class labels with new ones:" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "from bridge.utils.data_objects import BoundingBox, ClassLabel\n", "\n", "\n", "def map_bbox_class_names(bbox, classnames):\n", " coords = bbox.coords\n", " class_idx = bbox.class_label.class_idx\n", " class_name = classnames[class_idx]\n", " return BoundingBox(coords, ClassLabel(class_idx, class_name))\n", "\n", "\n", "ds = ds.assign_annotations(\n", " data=lambda samples, anns: anns.data.apply(lambda bbox: map_bbox_class_names(bbox, classnames))\n", ")\n", "ds.annotations.head()" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "Another issue is that `ds.samples.date_captured` is actually made of strings, instead of `pd.Timestamp`. Let's fix that:" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "print(ds.samples.date_captured.dtype)\n", "ds = ds.assign_samples(date_captured=lambda samples, anns: pd.to_datetime(samples.date_captured))\n", "print(ds.samples.date_captured.dtype)" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "This is a short example of where the Table API shines. Most frameworks and libraries implement some variant of our Sample API, which in practice would mean that to do these assignement operations they would have to iterate through the dataset using a nested loop:\n", "\n", "```\n", "for sample in samples:\n", " for annotation in sample:\n", " \n", "```\n", "\n", "Which is both slow and verbose." ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "## Plotting\n", "With our tables now in appropriate formats, let's generate some basic plots to gain insights into our data.\n", "\n", "Note: While our preferred plotting API is [hvplot](https://hvplot.holoviz.org/), [Pandas Plotting](https://pandas.pydata.org/docs/user_guide/visualization.html) remains a viable option, as are other options that support the Pandas API." ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "plot = ds.annotations.data.apply(lambda bb: str(bb.class_label)).value_counts().hvplot.bar()\n", "\n", "plot.opts(\n", " title=\"Class-histogram, COCO Train\",\n", " width=900,\n", " xrotation=90,\n", " xlabel=\"class\",\n", " ylabel=\"n_bboxes\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "ds.samples.license.value_counts().hvplot.bar().opts(title=\"Image Licenses, Histogram\")" ] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [ "(ds.samples.groupby(pd.Grouper(freq=\"d\", key=\"date_captured\")).size()).hvplot.bar().opts(\n", " xrotation=45, title=\"Date Captured Histogram, COCO Train\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "ds.annotations.area.hvplot.density().opts(\n", " title=\"KDE of annotation area, COCO Train\",\n", " xlabel=\"area (px)\",\n", " ylabel=\"density\",\n", " tools=[],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "23", "metadata": {}, "source": [ "## Investigating a bbox with abnormally large area\n", "\n", "Observing the KDE plot, we notice an unnatural leftward squeezing. This behavior is likely due to `hvplot` setting the x-axis limits based on the minimum and maximum values present in the data. Could this suggest that one of our annotations has an area on the order of 8.0e+5 px^2?" ] }, { "cell_type": "code", "execution_count": null, "id": "24", "metadata": {}, "outputs": [], "source": [ "large_ann = ds.annotations.loc[ds.annotations.area.idxmax()]\n", "large_ann" ] }, { "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "We can see the area of this annotation is 787151, so indeed in the order of 8.0e+5\n", "\n", "At this juncture, we've identified a specific sample with `id=400410` that warrants further examination. Utilizing the `ds.get` and `sample.show()` methods from the Sample API allows us to visualize this sample\n", "\n", "(Reminder: `ds.get` and `ds.iget` serve as equivalents to `df.loc` and `df.iloc`, respectively, _for single samples_)." ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "sample_id = large_ann.name[0] # MultiIndex loc causes the name to be tuples (,)\n", "ds.get(sample_id).show()" ] }, { "cell_type": "markdown", "id": "27", "metadata": {}, "source": [ "You can also call `ds.show()` to visualize the entire dataset instead of a single sample. You can freely scroll through using the slider and visualize different samples from the COCO, right in your notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "28", "metadata": {}, "outputs": [], "source": [ "ds.show()" ] }, { "cell_type": "markdown", "id": "29", "metadata": {}, "source": [ "## Sorting COCO dataset by bbox sizes\n", "\n", "Like we've seen in the previous section, it's evident that the `dining table` annotation covers the entire image.\n", "\n", "To assess the frequency of such occurrences, let's display the samples in our dataset in descending order of annotation size.\n", "\n", "To achieve this:\n", "1. Assign a new column to `ds.samples` representing the area value of its largest annotation.\n", "2. Sort the samples by this column.\n", "3. Run `ds.show()`." ] }, { "cell_type": "code", "execution_count": null, "id": "30", "metadata": {}, "outputs": [], "source": [ "def get_largest_area_annotation_per_sample(samples, anns):\n", " return (\n", " anns.sort_values(\"area\", ascending=False)\n", " .groupby(\"sample_id\")\n", " .area.first()\n", " .reindex(\n", " samples.index.get_level_values(\"sample_id\")\n", " ) # without reindex, the areas may have a different sample order than our `ds.samples` index\n", " .values\n", " )\n", "\n", "\n", "ds = ds.assign_samples(top_ann_area=get_largest_area_annotation_per_sample)\n", "ds.samples.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "31", "metadata": {}, "outputs": [], "source": [ "ds.sort_samples(\"top_ann_area\", ascending=False).show()" ] }, { "cell_type": "markdown", "id": "32", "metadata": {}, "source": [ "By scrolling the slider, we observe images with very large annotations on the left, followed by images with very small annotations, and then images without annotations on the right." ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "## Filtering out images with large bboxes\n", "An alternative approach is to remove samples with bounding boxes that cover the majority of the image. We can accomplish this using `ds.select_samples` and `ds.select_annotations`, which similarly to `ds.assign_samples` / `ds.assign_annotations`, work with a Pandas-like API:" ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "print(\"Original dataset:\", ds)\n", "ds_smaller = ds.select_samples(lambda samples, anns: samples.top_ann_area < 1e5)\n", "print(\"Filtered dataset:\", ds_smaller)\n", "ds_smaller.sort_samples(\"top_ann_area\", ascending=False).show()" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "For completeness, let's plot the KDE from before on `ds_smaller`:" ] }, { "cell_type": "code", "execution_count": null, "id": "36", "metadata": {}, "outputs": [], "source": [ "ds_smaller.annotations.area.hvplot.density().opts(\n", " title=\"KDE of annotation area, COCO Train\",\n", " xlabel=\"area (px)\",\n", " ylabel=\"density\",\n", " tools=[],\n", ")" ] }, { "cell_type": "markdown", "id": "37", "metadata": {}, "source": [ "As we can see, there's still a leftward squeezing - although significantly less than before. We've gained some insight into the distribution of our bbox sizes, but there's always more to do. Feel free to change the bbox area threshold to something even smaller, or plot this KDE for individual classes (rather than all of them), etc." ] } ], "metadata": { "kernelspec": { "display_name": "python3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }