trw.datasets

Submodules

Package Contents

Classes

DatasetChunked

Chunked datasets to enable larger than RAM datasets to be processed

Functions

create_mnist_datasset(batch_size=1000, root=None, transforms=None, nb_workers=5, data_processing_batch_size=200, normalize_0_1=False)

create_cifar10_dataset(batch_size=300, root=None, transform_train=None, transform_valid=None, nb_workers=2, data_processing_batch_size=None, normalize_0_1=True)

create_segmentation_voc2012_dataset(batch_size=40, root=None, transform_train=default_voc_transforms(), transform_valid=default_voc_transforms(), nb_workers=2)

Create the VOC2012 segmentation dataset

chunk_samples(root, base_name, samples, nb_samples_per_chunk=50, write_fn=write_pickle_simple, header_fn=None)

Split the cases in batches that can be loaded quickly, additionally, create

read_pickle_simple_one(file)

Read a single sample from a chunk

write_pickle_simple(file, case)

Simply write each case (feature_name, feature_value)

read_whole_chunk(chunk_path, read_fn=read_pickle_simple_one)

Read the whole chunk at once

create_chunk_sequence(root, base_name, nb_chunks, chunk_start=0, nb_workers=0, max_jobs_at_once=None, sampler=trw.train.SamplerRandom(batch_size=1))

Create an asynchronously loaded sequence of chunks

create_chunk_reservoir(root, base_name, nb_chunks, max_reservoir_samples, min_reservoir_samples, chunk_start=0, nb_workers=1, input_sampler=trw.train.SamplerRandom(batch_size=1), reservoir_sampler=None, maximum_number_of_samples_per_epoch=None, max_jobs_at_once=None)

Create a reservoir of chunk asynchronously loaded

chunk_name(root, base_name, chunk_id)

Name of the data chunk

_read_whole_chunk_sequence(batch)

create_fake_symbols_datasset(nb_samples, image_shape, dataset_name, shapes_fn, ratio_valid=0.2, nb_classes_at_once=None, global_scale_factor=1.0, normalize_0_1=True, noise_fn=functools.partial(_noisy, noise_type='poisson'), max_classes=None, batch_size=64, background=255)

Create artificial 2D for classification and segmentation problems

_random_location(image_shape, figure_shape)

_random_color()

_add_shape(imag, mask, shape, shapes_added, scale_factor, color, min_overlap_distance=30)

_create_image(shape, objects, nb_classes_at_once=None, max_classes=None, background=255)

param shape

the shape of an image [height, width]

_noisy(image, noise_type)

param image

a numpy image (float) in range [0..255]

create_fake_symbols_3d_datasset(nb_samples, image_shape, ratio_valid=0.2, nb_classes_at_once=None, global_scale_factor=1.0, normalize_0_1=True, noise_fn=functools.partial(_noisy, noise_type='poisson'), shapes_fn=default_shapes_3d, max_classes=None, batch_size=64, background=255, dataset_name='fake_symbols_3d')

Create artificial 2D for classification and segmentation problems

trw.datasets.create_mnist_datasset(batch_size=1000, root=None, transforms=None, nb_workers=5, data_processing_batch_size=200, normalize_0_1=False)
trw.datasets.create_cifar10_dataset(batch_size=300, root=None, transform_train=None, transform_valid=None, nb_workers=2, data_processing_batch_size=None, normalize_0_1=True)
trw.datasets.create_segmentation_voc2012_dataset(batch_size=40, root=None, transform_train=default_voc_transforms(), transform_valid=default_voc_transforms(), nb_workers=2)

Create the VOC2012 segmentation dataset

Parameters
  • batch_size – the number of samples per batch

  • root – the root of the dataset

  • transform_train – the transform to apply on each batch of data of the training data

  • transform_valid – the transform to apply on each batch of data of the validation data

  • nb_workers – the number of worker process to pre-process the batches

Returns

a datasets with dataset voc2012 and splits train, valid.

trw.datasets.chunk_samples(root, base_name, samples, nb_samples_per_chunk=50, write_fn=write_pickle_simple, header_fn=None)

Split the cases in batches that can be loaded quickly, additionally, create a header containing the file position for each case in a chunk to enable random access of the cases within a chunk

The header is extremely small, so it can be loaded in memory for very large datasets

Parameters
  • root – the root directory where the chunked cases will be exported

  • base_name – the base name of each chunk

  • samples – the cases. Must be a list of dictionary of (feature_name, feature_value)

  • nb_samples_per_chunk – the maximum number of cases per chunk

  • write_fn – defines how the cases are serialized

  • header_fn – a function f(sample)->dict that will be used to populate a header

Returns

the number of chunks

class trw.datasets.DatasetChunked(root, base_name, chunk_id, reader_one_fn=read_pickle_simple_one)

Bases: torch.utils.data.Dataset

Chunked datasets to enable larger than RAM datasets to be processed

The idea is to have a very large dataset split in chunks. Each chunks contains N samples. Chunks are loaded in two parts:

  • the sample data: a binary file containing N samples. The samples are only loaded when requested

  • the header: this is loaded when the dataset is instantiated, it contains header description (e.g., file offset position per sample) and custom attributes

Each sample within a chunk can be independently loaded

__len__(self)
__getitem__(self, item)
trw.datasets.read_pickle_simple_one(file)

Read a single sample from a chunk :param file: :return:

trw.datasets.write_pickle_simple(file, case)

Simply write each case (feature_name, feature_value)

trw.datasets.read_whole_chunk(chunk_path, read_fn=read_pickle_simple_one)

Read the whole chunk at once

trw.datasets.create_chunk_sequence(root, base_name, nb_chunks, chunk_start=0, nb_workers=0, max_jobs_at_once=None, sampler=trw.train.SamplerRandom(batch_size=1))

Create an asynchronously loaded sequence of chunks

Parameters
  • root – the directory where the chnuks are stored

  • base_name – the basename of the chnuks

  • nb_chunks – the number of chunks to load

  • chunk_start – the starting chunk

  • nb_workers – the number of workers dedicated to load the chunks

  • max_jobs_at_once – the maximum number of jobs allocated at once

  • sampler – the sampler of the chunks to be loaded

Returns

a sequence

trw.datasets.create_chunk_reservoir(root, base_name, nb_chunks, max_reservoir_samples, min_reservoir_samples, chunk_start=0, nb_workers=1, input_sampler=trw.train.SamplerRandom(batch_size=1), reservoir_sampler=None, maximum_number_of_samples_per_epoch=None, max_jobs_at_once=None)

Create a reservoir of chunk asynchronously loaded

Parameters
  • root – the directory where the chnuks are stored

  • base_name – the basename of the chnuks

  • nb_chunks – the number of chunks to load

  • max_reservoir_samples – the size of the reservoir

  • min_reservoir_samples – the minimum of samples to be loaded before starting a sequence

  • chunk_start – the starting chunk

  • nb_workers – the number of workers dedicated to load the chunks

  • input_sampler – the sampler of the chunks to be loaded

  • reservoir_sampler – the sampler used for the reservoir

  • maximum_number_of_samples_per_epoch – the maximum number of samples generated before the sequence is interrupted

  • max_jobs_at_once – maximum number of jobs in the input queue

Returns

a sequence

trw.datasets.chunk_name(root, base_name, chunk_id)

Name of the data chunk :param root: the folder where the chunk is stored :param base_name: the chunk base name :param chunk_id: the id of the chunk

Returns

the path to the chunk

trw.datasets._read_whole_chunk_sequence(batch)
trw.datasets.create_fake_symbols_datasset(nb_samples, image_shape, dataset_name, shapes_fn, ratio_valid=0.2, nb_classes_at_once=None, global_scale_factor=1.0, normalize_0_1=True, noise_fn=functools.partial(_noisy, noise_type='poisson'), max_classes=None, batch_size=64, background=255)

Create artificial 2D for classification and segmentation problems

This dataset will randomly create shapes at random location & color with a segmentation map.

Parameters
  • nb_samples – the number of samples to be generated

  • image_shape – the shape of an image [height, width]

  • ratio_valid – the ratio of samples to be used for the validation split

  • nb_classes_at_once – the number of classes to be included in each sample. If None, all the classes will be included

  • global_scale_factor – the scale of the shapes to generate

  • noise_fn – a function to create noise in the image

  • shapes_fn – the function to create the different shapes

  • normalize_0_1 – if True, the data will be normalized (i.e., image & position will be in range [0..1])

  • max_classes – the total number of classes available

  • batch_size – the size of the batch for the dataset

  • background – the background value of the sample (before normalization if normalize_0_1 is True)

  • dataset_name – the name of the returned dataset

Returns

a dict containing the dataset fake_symbols_2d with train and valid splits with features image, mask, classification, <shape_name>_center

trw.datasets._random_location(image_shape, figure_shape)
trw.datasets._random_color()
trw.datasets._add_shape(imag, mask, shape, shapes_added, scale_factor, color, min_overlap_distance=30)
trw.datasets._create_image(shape, objects, nb_classes_at_once=None, max_classes=None, background=255)
Parameters
  • shape – the shape of an image [height, width]

  • nb_classes_at_once – the number of classes to be included in each sample. If None, all the classes will be included

  • max_classes – the maximum number of classes to be used. If None, all classes can be used, else a random subset

Returns

image, mask and shape information

trw.datasets._noisy(image, noise_type)
Parameters
  • image – a numpy image (float) in range [0..255]

  • noise_type – the type of noise. Must be one of:

  • noise. (* 'gauss' Gaussian-distributed additive) –

  • data. (* 'poisson' Poisson-distributed noise generated from the) –

  • 1. (* 's&p' Replaces random pixels with 0 or) –

  • n*image (* 'speckle' Multiplicative noise using out = image +) – uniform noise with specified mean & variance

  • is (where n) – uniform noise with specified mean & variance

Returns

noisy image

trw.datasets.create_fake_symbols_3d_datasset(nb_samples, image_shape, ratio_valid=0.2, nb_classes_at_once=None, global_scale_factor=1.0, normalize_0_1=True, noise_fn=functools.partial(_noisy, noise_type='poisson'), shapes_fn=default_shapes_3d, max_classes=None, batch_size=64, background=255, dataset_name='fake_symbols_3d')

Create artificial 2D for classification and segmentation problems

This dataset will randomly create shapes at random location & color with a segmentation map.

Parameters
  • nb_samples – the number of samples to be generated

  • image_shape – the shape of an image [height, width]

  • ratio_valid – the ratio of samples to be used for the validation split

  • nb_classes_at_once – the number of classes to be included in each sample. If None, all the classes will be included

  • global_scale_factor – the scale of the shapes to generate

  • noise_fn – a function to create noise in the image

  • shapes_fn – the function to create the different shapes

  • normalize_0_1 – if True, the data will be normalized (i.e., image & position will be in range [0..1])

  • max_classes – the total number of classes available

  • batch_size – the size of the batch for the dataset

  • background – the background value of the sample (before normalization if normalize_0_1 is True)

  • dataset_name – the name of the returned dataset

Returns

a dict containing the dataset fake_symbols_2d with train and valid splits with features image, mask, classification, <shape_name>_center