Dataset Provider#

Preliminaries#

Downloading the demo dataset#

In this tutorial, we’ll integrate the Large Movie Review Dataset, a text classification dataset, with Bridge.

[1]:

from pathlib import Path

TMP_NOTEBOOK_ROOT = Path("/tmp/bridge-ds/tutorials")

[2]:

from bridge.utils import download_and_extract_archive

download_and_extract_archive(
    "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", TMP_NOTEBOOK_ROOT / "imdb"
)

Downloading https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz to /tmp/bridge-ds/tutorials/imdb/aclImdb_v1.tar.gz
Extracting /tmp/bridge-ds/tutorials/imdb/aclImdb_v1.tar.gz to /tmp/bridge-ds/tutorials/imdb

File Tree#

After extracting, we can observe the following file structure:

├── README
├── imdb.vocab
├── imdbEr.txt
├── test
│   ├── labeledBow.feat
│   ├── neg  [12500 entries]
│   ├── pos  [12500 entries]
│   ├── urls_neg.txt
│   └── urls_pos.txt
└── train
    ├── labeledBow.feat
    ├── neg  [12500 entries]
    ├── pos  [12500 entries]
    ├── unsup  [50000 entries]
    ├── unsupBow.feat
    ├── urls_neg.txt
    ├── urls_pos.txt
    └── urls_unsup.txt

In the next steps, we will learn how to load this dataset to BridgeDS.

DatasetProvider#

The recommended way to create Bridge Datasets is by using DatasetProviders. They implement a single method, provider.build_dataset().

Here is the outline:

class YourDatasetProvider(DatasetProvider):
    def __init__(self, *args,**kwargs):
        """
        Load the original dataset. This usually means downloading the dataset from a source, storing samples in a list, etc.
        Remember that in Bridge it's enough to store references to your data, not necessarily the actual data.
        """
        super().__init__(dataset_dir, download)

    def build_dataset(self, display_engine=None, cache_mechanisms=None):
        """
        Convert the dataset from raw format into our own Dataset type.

        Parameters:
        - display_engine (DisplayEngine): The display engine to use for visualization.
        - cache_mechanisms (Dict[str, CacheMechanism | None] | None): Cache mechanisms for different types of elements.
        NOTE: Learn more about cache mechanisms and display engines in more advanced tutorials.
        """
        # Implement dataset building logic here
        pass

Let’s start by writing the basic layout of the class, and the __init__:

[3]:

import os

from bridge.primitives.dataset import SingularDataset
from bridge.providers import DatasetProvider


class LargeMovieReviewDataset(DatasetProvider):
    dataset_url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

    def __init__(self, root: str | os.PathLike, split: str = "train", download: bool = False):
        root = Path(root)

        if download:
            if (root / "aclImdb_v1.tar.gz").exists():
                print("Archive file aclImdb_v1.tar.gz already exists, skipping download.")
            else:
                download_and_extract_archive(self.dataset_url, str(root))
        self._split_root = root / "aclImdb" / split

    def build_dataset(
        self,
        display_engine=None,
        cache_mechanisms=None,
    ) -> SingularDataset:
        pass

Now we can instantiate this provider and verify that it points to the right directory:

[4]:

provider = LargeMovieReviewDataset(TMP_NOTEBOOK_ROOT / "imdb", split="train", download=False)
provider._split_root

[4]:

PosixPath('/tmp/bridge-ds/tutorials/imdb/aclImdb/train')

[5]:

os.listdir(provider._split_root)

[5]:

['urls_unsup.txt',
 'labeledBow.feat',
 'unsup',
 'urls_pos.txt',
 'pos',
 'unsupBow.feat',
 'urls_neg.txt',
 'neg']

The next step will be to implement build_dataset(), which will load the relevant metadata from this directory into a Bridge Dataset.

Concretely, we will iterate over the directories and convert every text file into two elements: a text element, and a class label element. To get the convenient API where the text elements are samples and the class elements are annotations, we will keep two separate lists for elements during this process, but we will ensure elements from the same sample share a sample id:

[6]:

import os
from pathlib import Path

from bridge.primitives.dataset.singular_dataset import SingularDataset
from bridge.primitives.element.data.load_mechanism import LoadMechanism
from bridge.primitives.element.element import Element
from bridge.utils.data_objects import ClassLabel


class LargeMovieReviewDataset(DatasetProvider):
    dataset_url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

    def __init__(self, root: str | os.PathLike, split: str = "train", download: bool = False):
        root = Path(root)

        if download:
            if (root / "aclImdb_v1.tar.gz").exists():
                print("Archive file aclImdb_v1.tar.gz already exists, skipping download.")
            else:
                download_and_extract_archive(self.dataset_url, str(root))
        self._split_root = root / "aclImdb" / split

    def build_dataset(
        self,
        display_engine=None,
        cache_mechanisms=None,
    ) -> SingularDataset:
        samples = []
        annotations = []

        class_dir_list = [d for d in list(self._split_root.iterdir()) if d.is_dir()]
        for class_idx, class_dir in enumerate(sorted(class_dir_list)):
            for textfile in class_dir.iterdir():
                load_mechanism = LoadMechanism.from_url_string(str(textfile), "text")
                text_element = Element(
                    element_id=f"text_{textfile.stem}",
                    sample_id=textfile.stem,
                    etype="text",
                    load_mechanism=load_mechanism,
                )
                load_mechanism = LoadMechanism(ClassLabel(class_idx, class_dir.name), category="obj")
                label_element = Element(
                    element_id=f"label_{textfile.stem}",
                    sample_id=textfile.stem,
                    etype="class_label",
                    load_mechanism=load_mechanism,
                )
                samples.append(text_element)
                annotations.append(label_element)

        return SingularDataset.from_lists(
            samples, annotations, display_engine=display_engine, cache_mechanisms=cache_mechanisms
        )

There’s quite a bit of code here, so let’s break it down a little:

class_dir_list = [d for d in list(self._split_root.iterdir()) if d.is_dir()]
for class_idx, class_dir in enumerate(sorted(class_dir_list)):
    for textfile in class_dir.iterdir():

Create a nested loop, where for every class, we iterate on all samples of that class.

load_mechanism = LoadMechanism.from_url_string(str(textfile), 'text')
text_element = Element(
    element_id=f"text_{textfile.stem}",
    sample_id=textfile.stem,
    etype='text',
    load_mechanism=load_mechanism,
)

Create a LoadMechanism for our text file
Create the text element. This means defining a unique element id, a sample id, and using the load mechanism we just defined.

load_mechanism = LoadMechanism(ClassLabel(class_idx, class_dir.name), category='obj')
label_element = Element(
    element_id=f"label_{textfile.stem}",
    sample_id=textfile.stem,
    etype='class_label',
    load_mechanism=load_mechanism,
)

The LoadMechanism in this case will simply store the class label we define in-memory, rather than a URL.
The element is defined with a different element id than the one above, but the same sample id, so we know they relate.

NOTE: Each sample comprises of one text element and one label element, because we’re doing a classification task. In the SingularDataset regime, these will be separated into samples (text) and annotations (class labels).

Let’s create a Bridge Dataset and see what we’ve got:

[7]:

from bridge.display.basic import SimplePrints

ds = LargeMovieReviewDataset(TMP_NOTEBOOK_ROOT / "imdb", split="train", download=False).build_dataset(
    display_engine=SimplePrints()
)
ds

[7]:

Dataset: {'n_samples': 75000, 'n_class_label': 75000, 'n_text': 75000}

[8]:

ds.samples.head(3)

[8]:

		element_type	data	category
sample_id	element_id
3944_3	text_3944_3	text	/tmp/bridge-ds/tutorials/imdb/aclImdb/train/ne...	text
11343_4	text_11343_4	text	/tmp/bridge-ds/tutorials/imdb/aclImdb/train/ne...	text
11_3	text_11_3	text	/tmp/bridge-ds/tutorials/imdb/aclImdb/train/ne...	text

[9]:

ds.annotations.head(3)

[9]:

		element_type	data	category
sample_id	element_id
3944_3	label_3944_3	class_label	neg	obj
11343_4	label_11343_4	class_label	neg	obj
11_3	label_11_3	class_label	neg	obj

[10]:

ds.annotations.data.value_counts()

[10]:

data
unsup    50000
neg      12500
pos      12500
Name: count, dtype: int64

[11]:

sample = ds.iget(0)
sample.data  # SingularSample exposes the sample element data directly

[11]:

'The film My Name is Modesty is based around an episode that takes up about one page in the 10th modesty Blaise novel called Night of the Morningstar. It describes an incident in which the young Modesty (17 in the book, mid twenties in the film)asserts her leadership in a war over a casino. As this is set before the actual Blaise adventures her trusted sidekick Willi Garvin is not in the film. That is one of the main problems as the relationship between Blaise and Garvin was certainly always one of the fascinating aspects of the novels and the long running comic strip. The other problem is that the film is quite simply incredibly boring because it really is just one small episode blown up into a screenplay. The casting is okay but Alexandra Staden is not really convincing as the heroine and actually too old for the role to play the young Modesty. I get the impression that this film was a quick and dirty solution as not to lose the rights to the Blaise franchise.'

[12]:

ds.select_samples(lambda samples, anns: samples.index[:2]).show()

{
    "Sample ID 3944_3": {
        "Elements": {
            "etype=text": [
                {
                    "element_id": "text_3944_3",
                    "element_type": "text",
                    "sample_id": "3944_3",
                    "data": "/tmp/bridge-ds/tutorials/imdb/aclImdb/train/neg/3944_3.txt",
                    "category": "text",
                    "is_example": true
                }
            ],
            "etype=class_label": [
                {
                    "element_id": "label_3944_3",
                    "element_type": "class_label",
                    "sample_id": "3944_3",
                    "data": "neg",
                    "category": "obj",
                    "is_example": false
                }
            ]
        }
    }
}

{
    "Sample ID 11343_4": {
        "Elements": {
            "etype=text": [
                {
                    "element_id": "text_11343_4",
                    "element_type": "text",
                    "sample_id": "11343_4",
                    "data": "/tmp/bridge-ds/tutorials/imdb/aclImdb/train/neg/11343_4.txt",
                    "category": "text",
                    "is_example": true
                }
            ],
            "etype=class_label": [
                {
                    "element_id": "label_11343_4",
                    "element_type": "class_label",
                    "sample_id": "11343_4",
                    "data": "neg",
                    "category": "obj",
                    "is_example": false
                }
            ]
        }
    }
}

We have an operational Bridge Dataset, which we can manipulate as we see fit.

For example, observe the tree above, and note that the unsup dir only exists in the training set, and not in the test set. This is because this “class” is not a class per se, but rather unlabeled data which is included in our archive.

Let’s use a simple selection to clear out all samples of this class from our dataset:

[13]:

ds = ds.select_samples(lambda samples, anns: anns[anns.data != "unsup"].index.get_level_values("sample_id"))

ds.annotations.data.value_counts()

[13]:

data
neg    12500
pos    12500
Name: count, dtype: int64

In Summary#

To create our own custom datasets, it’s recommended to use a DatasetProvider
For SingularDatasets, we create two lists of elements - one for samples and one for _annotations. For any other kind of Dataset, we will create a single list of elements.
Elements have unique IDs across the Dataset, and share Sample IDs with Elements of the same Sample.
Elements are the low-level object which contains raw data, by using a LoadMechanism.

Up Next#

In this tutorial, we’ve used a primitive DisplayEngine called SimplePrints. If you would prefer a more sophisticated one like the Panel one in previous tutorials, continue to the next tutorial where we learn how to create our own DisplayEngine for a text dataset.