trw.train.sequence

Module Contents

Classes

Sequence

A Sequence defines how to iterate the data as a sequence of small batches of data.

Functions

collate_dicts_pytorch(list_of_dicts)

Collate a list of dictionaries into a batch (i.e., a dictionary of values by feature name)

remove_nested_list(items)

Remove 2 nested list where items is just a list (one element) of list

Attributes

is_windows_platform

logger

default_collate_list_of_dicts

trw.train.sequence.is_windows_platform
trw.train.sequence.logger
trw.train.sequence.collate_dicts_pytorch(list_of_dicts)

Collate a list of dictionaries into a batch (i.e., a dictionary of values by feature name)

Parameters

list_of_dicts – a list of dictionaries

Returns

a batch

trw.train.sequence.remove_nested_list(items)

Remove 2 nested list where items is just a list (one element) of list

trw.train.sequence.default_collate_list_of_dicts
class trw.train.sequence.Sequence(source_split)

A Sequence defines how to iterate the data as a sequence of small batches of data.

To train a deep learning model, it is often necessary to split our original data into small chunks. This is because storing all at once the forward pass of our model is memory hungry, instead, we calculate the forward and backward pass on a small chunk of data. This is the interface for batching a dataset.

Examples:

data = list(range(100))
sequence = SequenceArray({'data': data}).batch(10)
for batch in sequence:
    # do something with our batch
abstract __iter__(self)
Returns

An iterator of batches

collate(self, collate_fn=utilities.default_collate_fn, device=None)

Aggregate the input batch as a dictionary of torch.Tensor and move the data to the appropriate device

Parameters
  • collate_fn – the function to collate the input batch

  • device – the device where to send the samples. If None, the default device is CPU

Returns

a collated sequence of batches

map(self, function_to_run, nb_workers=0, max_jobs_at_once=None, worker_post_process_results_fun=None, queue_timeout=0.1, preprocess_fn=None, collate_fn=None)

Transform a sequence using a given function.

Note

The map may create more samples than the original sequence.

Parameters
  • function_to_run – the mapping function

  • nb_workers – the number of workers that will process the split. If 0, no workers will be created.

  • max_jobs_at_once – the maximum number of results that can be pushed in the result queue at once. If 0, no limit. If None, it will be set equal to the number of workers

  • worker_post_process_results_fun – a function used to post-process the worker results (executed by the worker)

  • queue_timeout – the timeout used to pull results from the output queue

  • preprocess_fn – a function that will preprocess the batch just prior to sending it to the other processes

  • collate_fn – a function to collate each batch of data

Returns

a sequence of batches

batch(self, batch_size, discard_batch_not_full=False, collate_fn=default_collate_list_of_dicts)

Group several batches of samples into a single batch

Parameters
  • batch_size – the number of samples of the batch

  • discard_batch_not_full – if True and if a batch is not full, discard these

  • collate_fn – a function to collate the batches. If None, no collation performed

Returns

a sequence of batches

rebatch(self, batch_size, discard_batch_not_full=False, collate_fn=default_collate_list_of_dicts)

Normalize a sequence to identical batch size given an input sequence with varying batch size

Parameters
  • batch_size – the size of the batches created by this sequence

  • discard_batch_not_full – if True, the last batch will be discarded if not full

  • collate_fn – function to merge multiple batches

async_reservoir(self, max_reservoir_samples, function_to_run, min_reservoir_samples=1, nb_workers=1, max_jobs_at_once=None, reservoir_sampler=sampler.SamplerSequential(), collate_fn=remove_nested_list, maximum_number_of_samples_per_epoch=None)

Create a sequence created from a reservoir. The purpose of this sequence is to maximize the GPU for batches of data at the expense of recycling previously processed samples.

Parameters
  • max_reservoir_samples – the maximum number of samples of the reservoir

  • function_to_run – the function to run asynchronously

  • min_reservoir_samples – the minimum of samples of the reservoir needed before an output sequence can be created

  • nb_workers – the number of workers that will process function_to_run

  • max_jobs_at_once – the maximum number of jobs that can be pushed in the result list at once. If 0, no limit. If None: set to the number of workers

  • reservoir_sampler – a sampler that will be used to sample the reservoir or None if no sampling needed

  • collate_fn – a function to post-process the samples into a single batch. If None, return the items as they were in source_split

  • maximum_number_of_samples_per_epoch – the maximum number of samples per epoch to generate. If we reach this maximum, this will not empty the reservoir but simply interrupt the sequence so that we can restart.

fill_queue(self)

Fill the queue jobs of the current sequence

fill_queue_all_sequences(self)

Go through all the sequences and fill their input queue

__next__(self)
Returns

The next batch of data

next_item(self, blocking)
Parameters

blocking – if True, the next elements will block the current thread if not ready

Returns

The next batch of data

has_background_jobs(self)
Returns

True if this sequence has a background job to create the next element

has_background_jobs_previous_sequences(self)
Returns

the number of sequences that have background jobs currently running to create the next element

subsample(self, nb_samples)

Sub-sample a sequence to a fixed number of samples.

The purpose is to obtain a smaller sequence, this is particularly useful for the export of augmentations, samples.

Parameters

nb_samples – the number of samples desired in the original sequence

Returns

a subsampled Sequence

subsample_uids(self, uids, uids_name, new_sampler=None)

Sub-sample a sequence to samples with specified UIDs.

Parameters
  • uids (list) – the uids. If new_sampler keeps the ordering, then the samples of the resampled sequence should follow uids ordering

  • uids_name (str) – the name of the UIDs

  • new_sampler (Sampler) – the sampler to be used for the subsampler sequence. If None, re-use the existing

Returns

a subsampled Sequence