`trw.train.sequence`¶

Module Contents¶

Classes¶

`Sequence`	A Sequence defines how to iterate the data as a sequence of small batches of data.
`SequenceIterator`

Functions¶

remove_nested_list(items)

Remove 2 nested list where items is just a list (one element) of list

Attributes¶

`logger`
`default_collate_list_of_dicts`

trw.train.sequence.logger¶

trw.train.sequence.remove_nested_list(items)¶: Remove 2 nested list where items is just a list (one element) of list

trw.train.sequence.default_collate_list_of_dicts¶

class trw.train.sequence.Sequence(source_split)¶

A Sequence defines how to iterate the data as a sequence of small batches of data.

To train a deep learning model, it is often necessary to split our original data into small chunks. This is because storing all at once the forward pass of our model is memory hungry, instead, we calculate the forward and backward pass on a small chunk of data. This is the interface for batching a dataset.

Examples:

data = list(range(100))
sequence = SequenceArray({'data': data}).batch(10)
for batch in sequence:
    # do something with our batch

abstract __iter__(self)¶

Returns: An iterator of batches

collate(self, collate_fn=default_collate_fn, device=None)¶

Aggregate the input batch as a dictionary of torch.Tensor and move the data to the appropriate device

Parameters

collate_fn – the function to collate the input batch
device – the device where to send the samples. If None, the default device is CPU

Returns

a collated sequence of batches

map(self, function_to_run, nb_workers=0, max_jobs_at_once=None, queue_timeout=default_queue_timeout, collate_fn=None, max_queue_size_pin=None)¶

Transform a sequence using a given function.

Note

The map may create more samples than the original sequence.

Parameters

function_to_run – the mapping function
nb_workers – the number of workers that will process the split. If 0, no workers will be created.
max_jobs_at_once – the maximum number of results that can be pushed in the result queue at once. If 0, no limit. If None, it will be set equal to the number of workers
queue_timeout – the timeout used to pull results from the output queue
collate_fn – a function to collate each batch of data

: param max_queue_size_pin: defines the max number of batches prefected. If None, defaulting to: a size based on the number of workers. This only controls the final queue sized of the pin thread (the workers queue can be independently set)

Returns: a sequence of batches

batch(self, batch_size, discard_batch_not_full=False, collate_fn=default_collate_list_of_dicts)¶

Group several batches of samples into a single batch

Parameters

batch_size – the number of samples of the batch
discard_batch_not_full – if True and if a batch is not full, discard these
collate_fn – a function to collate the batches. If None, no collation performed

Returns

a sequence of batches

sub_batch(self, batch_size, discard_batch_not_full=False)¶

This sequence will split batches in smaller batches if the underlying sequence batch is too large.

This sequence can be useful to manage very large tensors. Indeed, this class avoids concatenating tensors (as opposed to in trw.train.SequenceReBatch). Since this operation can be costly as the tensors must be reallocated. In this case, it may be faster to work on a smaller batch by avoiding the concatenation cost.

Parameters

batch_size – the maximum size of a batch
discard_batch_not_full – if True, batch that do have size batch_size will be discarded

rebatch(self, batch_size, discard_batch_not_full=False, collate_fn=default_collate_list_of_dicts)¶

Normalize a sequence to identical batch size given an input sequence with varying batch size

Parameters

batch_size – the size of the batches created by this sequence
discard_batch_not_full – if True, the last batch will be discarded if not full
collate_fn – function to merge multiple batches

max_samples(self, max_samples)¶

Virtual resize of the sequence. The sequence will terminate when a certain number: of samples produced has been reached. Restart the sequence where it was stopped.

Parameters: max_samples – the number of samples this sequence will produce before stopping

async_reservoir(self, max_reservoir_samples, function_to_run, *, min_reservoir_samples=1, nb_workers=1, max_jobs_at_once=None, reservoir_sampler=sampler.SamplerSequential(), collate_fn=remove_nested_list, maximum_number_of_samples_per_epoch=None, max_reservoir_replacement_size=None)¶

Parameters

max_reservoir_samples – the maximum number of samples of the reservoir
function_to_run – the function to run asynchronously
min_reservoir_samples – the minimum of samples of the reservoir needed before an output sequence can be created
nb_workers – the number of workers that will process function_to_run to fill the reservoir. Must be >= 1
max_jobs_at_once – the maximum number of jobs that can be started and stored by epoch by the workers. If 0, no limit. If None: set to the number of workers
reservoir_sampler – a sampler that will be used to sample the reservoir or None for sequential sampling of the reservoir
collate_fn – a function to post-process the samples into a single batch, or None if not to be collated
maximum_number_of_samples_per_epoch – the maximum number of samples that will be generated per epoch. If we reach this maximum, the sequence will be interrupted
max_reservoir_replacement_size – Specify the maximum number of samples replaced in the reservoir by epoch. If None, we will use the whole result queue. This can be useful to control explicitly how the reservoir is updated and depend less on the speed of hardware. Note that to have an effect, max_jobs_at_once should be greater than max_reservoir_replacement_size.

fill_queue(self)¶: Fill the queue jobs of the current sequence

fill_queue_all_sequences(self)¶: Go through all the sequences and fill their input queue

has_background_jobs(self)¶

Returns: True if this sequence has a background job to create the next element

has_background_jobs_previous_sequences(self)¶

Returns: the number of sequences that have background jobs currently running to create the next element

abstract subsample(self, nb_samples)¶

Sub-sample a sequence to a fixed number of samples.

The purpose is to obtain a smaller sequence, this is particularly useful for the export of augmentations, samples.

Parameters: nb_samples – the number of samples desired in the original sequence
Returns: a subsampled Sequence

abstract subsample_uids(self, uids, uids_name, new_sampler=None)¶

Sub-sample a sequence to samples with specified UIDs.

Parameters

uids (list) – the uids. If new_sampler keeps the ordering, then the samples of the resampled sequence should follow uids ordering
uids_name (str) – the name of the UIDs
new_sampler (Sampler) – the sampler to be used for the subsampler sequence. If None, re-use the existing

Returns

a subsampled Sequence

abstract close(self)¶

class trw.train.sequence.SequenceIterator¶

Bases: collections.abc.Iterator

abstract __next__(self)¶

Returns: The next batch of data

next_item(self, blocking)¶

Parameters: blocking – if True, the next elements will block the current thread if not ready
Returns: The next batch of data

close(self)¶: Special method to close and clean the resources of the sequence

trw.train.sequence¶

Module Contents¶

Classes¶

Functions¶

Attributes¶

`trw.train.sequence`¶