trw.train.sequence
¶
Module Contents¶
Classes¶
A Sequence defines how to iterate the data as a sequence of small batches of data. |
Functions¶
|
Collate a list of dictionaries into a batch (i.e., a dictionary of values by feature name) |
|
Remove 2 nested list where items is just a list (one element) of list |
Attributes¶
- trw.train.sequence.is_windows_platform¶
- trw.train.sequence.logger¶
- trw.train.sequence.collate_dicts_pytorch(list_of_dicts)¶
Collate a list of dictionaries into a batch (i.e., a dictionary of values by feature name)
- Parameters
list_of_dicts – a list of dictionaries
- Returns
a batch
- trw.train.sequence.remove_nested_list(items)¶
Remove 2 nested list where items is just a list (one element) of list
- trw.train.sequence.default_collate_list_of_dicts¶
- class trw.train.sequence.Sequence(source_split)¶
A Sequence defines how to iterate the data as a sequence of small batches of data.
To train a deep learning model, it is often necessary to split our original data into small chunks. This is because storing all at once the forward pass of our model is memory hungry, instead, we calculate the forward and backward pass on a small chunk of data. This is the interface for batching a dataset.
Examples:
data = list(range(100)) sequence = SequenceArray({'data': data}).batch(10) for batch in sequence: # do something with our batch
- abstract __iter__(self)¶
- Returns
An iterator of batches
- collate(self, collate_fn=utilities.default_collate_fn, device=None)¶
Aggregate the input batch as a dictionary of torch.Tensor and move the data to the appropriate device
- Parameters
collate_fn – the function to collate the input batch
device – the device where to send the samples. If None, the default device is CPU
- Returns
a collated sequence of batches
- map(self, function_to_run, nb_workers=0, max_jobs_at_once=None, worker_post_process_results_fun=None, queue_timeout=0.1, preprocess_fn=None, collate_fn=None)¶
Transform a sequence using a given function.
Note
The map may create more samples than the original sequence.
- Parameters
function_to_run – the mapping function
nb_workers – the number of workers that will process the split. If 0, no workers will be created.
max_jobs_at_once – the maximum number of results that can be pushed in the result queue at once. If 0, no limit. If None, it will be set equal to the number of workers
worker_post_process_results_fun – a function used to post-process the worker results (executed by the worker)
queue_timeout – the timeout used to pull results from the output queue
preprocess_fn – a function that will preprocess the batch just prior to sending it to the other processes
collate_fn – a function to collate each batch of data
- Returns
a sequence of batches
- batch(self, batch_size, discard_batch_not_full=False, collate_fn=default_collate_list_of_dicts)¶
Group several batches of samples into a single batch
- Parameters
batch_size – the number of samples of the batch
discard_batch_not_full – if True and if a batch is not full, discard these
collate_fn – a function to collate the batches. If None, no collation performed
- Returns
a sequence of batches
- rebatch(self, batch_size, discard_batch_not_full=False, collate_fn=default_collate_list_of_dicts)¶
Normalize a sequence to identical batch size given an input sequence with varying batch size
- Parameters
batch_size – the size of the batches created by this sequence
discard_batch_not_full – if True, the last batch will be discarded if not full
collate_fn – function to merge multiple batches
- async_reservoir(self, max_reservoir_samples, function_to_run, min_reservoir_samples=1, nb_workers=1, max_jobs_at_once=None, reservoir_sampler=sampler.SamplerSequential(), collate_fn=remove_nested_list, maximum_number_of_samples_per_epoch=None)¶
Create a sequence created from a reservoir. The purpose of this sequence is to maximize the GPU for batches of data at the expense of recycling previously processed samples.
- Parameters
max_reservoir_samples – the maximum number of samples of the reservoir
function_to_run – the function to run asynchronously
min_reservoir_samples – the minimum of samples of the reservoir needed before an output sequence can be created
nb_workers – the number of workers that will process function_to_run
max_jobs_at_once – the maximum number of jobs that can be pushed in the result list at once. If 0, no limit. If None: set to the number of workers
reservoir_sampler – a sampler that will be used to sample the reservoir or None if no sampling needed
collate_fn – a function to post-process the samples into a single batch. If None, return the items as they were in source_split
maximum_number_of_samples_per_epoch – the maximum number of samples per epoch to generate. If we reach this maximum, this will not empty the reservoir but simply interrupt the sequence so that we can restart.
- fill_queue(self)¶
Fill the queue jobs of the current sequence
- fill_queue_all_sequences(self)¶
Go through all the sequences and fill their input queue
- __next__(self)¶
- Returns
The next batch of data
- next_item(self, blocking)¶
- Parameters
blocking – if True, the next elements will block the current thread if not ready
- Returns
The next batch of data
- has_background_jobs(self)¶
- Returns
True if this sequence has a background job to create the next element
- has_background_jobs_previous_sequences(self)¶
- Returns
the number of sequences that have background jobs currently running to create the next element
- subsample(self, nb_samples)¶
Sub-sample a sequence to a fixed number of samples.
The purpose is to obtain a smaller sequence, this is particularly useful for the export of augmentations, samples.
- Parameters
nb_samples – the number of samples desired in the original sequence
- Returns
a subsampled Sequence
- subsample_uids(self, uids, uids_name, new_sampler=None)¶
Sub-sample a sequence to samples with specified UIDs.
- Parameters
uids (list) – the uids. If new_sampler keeps the ordering, then the samples of the resampled sequence should follow uids ordering
uids_name (str) – the name of the UIDs
new_sampler (Sampler) – the sampler to be used for the subsampler sequence. If None, re-use the existing
- Returns
a subsampled Sequence