`trw.datasets.chunked_dataset`¶

Module Contents¶

Classes¶

DatasetChunked

Chunked datasets to enable larger than RAM datasets to be processed

Functions¶

`write_pickle_simple`(file, case)	Simply write each case (feature_name, feature_value)
`read_pickle_simple_one`(file)	Read a single sample from a chunk
`chunk_name`(root, base_name, chunk_id)	Name of the data chunk
`chunk_samples`(root, base_name, samples, nb_samples_per_chunk=50, write_fn=write_pickle_simple, header_fn=None)	Split the cases in batches that can be loaded quickly, additionally, create
`read_whole_chunk`(chunk_path, read_fn=read_pickle_simple_one)	Read the whole chunk at once
`_read_whole_chunk_sequence`(batch)
`create_chunk_sequence`(root, base_name, nb_chunks, chunk_start=0, nb_workers=0, max_jobs_at_once=None, sampler=trw.train.SamplerRandom(batch_size=1))	Create an asynchronously loaded sequence of chunks
`create_chunk_reservoir`(root, base_name, nb_chunks, max_reservoir_samples, min_reservoir_samples, chunk_start=0, nb_workers=1, input_sampler=trw.train.SamplerRandom(batch_size=1), reservoir_sampler=None, maximum_number_of_samples_per_epoch=None, max_jobs_at_once=None)	Create a reservoir of chunk asynchronously loaded

trw.datasets.chunked_dataset.write_pickle_simple(file, case)¶: Simply write each case (feature_name, feature_value)

trw.datasets.chunked_dataset.read_pickle_simple_one(file)¶: Read a single sample from a chunk :param file: :return:

trw.datasets.chunked_dataset.chunk_name(root, base_name, chunk_id)¶

Name of the data chunk :param root: the folder where the chunk is stored :param base_name: the chunk base name :param chunk_id: the id of the chunk

Returns: the path to the chunk

trw.datasets.chunked_dataset.chunk_samples(root, base_name, samples, nb_samples_per_chunk=50, write_fn=write_pickle_simple, header_fn=None)¶

Split the cases in batches that can be loaded quickly, additionally, create a header containing the file position for each case in a chunk to enable random access of the cases within a chunk

The header is extremely small, so it can be loaded in memory for very large datasets

Parameters

root – the root directory where the chunked cases will be exported
base_name – the base name of each chunk
samples – the cases. Must be a list of dictionary of (feature_name, feature_value)
nb_samples_per_chunk – the maximum number of cases per chunk
write_fn – defines how the cases are serialized
header_fn – a function f(sample)->dict that will be used to populate a header

Returns

the number of chunks

trw.datasets.chunked_dataset.read_whole_chunk(chunk_path, read_fn=read_pickle_simple_one)¶: Read the whole chunk at once

trw.datasets.chunked_dataset._read_whole_chunk_sequence(batch)¶

trw.datasets.chunked_dataset.create_chunk_sequence(root, base_name, nb_chunks, chunk_start=0, nb_workers=0, max_jobs_at_once=None, sampler=trw.train.SamplerRandom(batch_size=1))¶

Create an asynchronously loaded sequence of chunks

Parameters

root – the directory where the chnuks are stored
base_name – the basename of the chnuks
nb_chunks – the number of chunks to load
chunk_start – the starting chunk
nb_workers – the number of workers dedicated to load the chunks
max_jobs_at_once – the maximum number of jobs allocated at once
sampler – the sampler of the chunks to be loaded

Returns

a sequence

trw.datasets.chunked_dataset.create_chunk_reservoir(root, base_name, nb_chunks, max_reservoir_samples, min_reservoir_samples, chunk_start=0, nb_workers=1, input_sampler=trw.train.SamplerRandom(batch_size=1), reservoir_sampler=None, maximum_number_of_samples_per_epoch=None, max_jobs_at_once=None)¶

Create a reservoir of chunk asynchronously loaded

Parameters

root – the directory where the chnuks are stored
base_name – the basename of the chnuks
nb_chunks – the number of chunks to load
max_reservoir_samples – the size of the reservoir
min_reservoir_samples – the minimum of samples to be loaded before starting a sequence
chunk_start – the starting chunk
nb_workers – the number of workers dedicated to load the chunks
input_sampler – the sampler of the chunks to be loaded
reservoir_sampler – the sampler used for the reservoir
maximum_number_of_samples_per_epoch – the maximum number of samples generated before the sequence is interrupted
max_jobs_at_once – maximum number of jobs in the input queue

Returns

a sequence

class trw.datasets.chunked_dataset.DatasetChunked(root, base_name, chunk_id, reader_one_fn=read_pickle_simple_one)¶

Bases: torch.utils.data.Dataset

Chunked datasets to enable larger than RAM datasets to be processed

The idea is to have a very large dataset split in chunks. Each chunks contains N samples. Chunks are loaded in two parts:

the sample data: a binary file containing N samples. The samples are only loaded when requested
the header: this is loaded when the dataset is instantiated, it contains header description (e.g., file offset position per sample) and custom attributes

Each sample within a chunk can be independently loaded

__len__(self)¶

__getitem__(self, item)¶

trw.datasets.chunked_dataset¶

Module Contents¶

Classes¶

Functions¶

`trw.datasets.chunked_dataset`¶