cntk.io package¶
CNTK IO utilities.
-
Base64ImageDeserializer
(filename, streams)[source]¶ Configures the image reader that reads base64 encoded images and corresponding labels from a file of the form:
[sequenceId <tab>] <numerical label (0-based class id)> <tab> <base64 encoded image>
Similarly to the ImageDeserializer, the sequenceId prefix is optional and can be omitted.
Parameters: filename (str) – file name of the input file dataset that contains images and corresponding labels See also
-
CBFDeserializer
(filename, streams={})[source]¶ Configures the CNTK binary-format deserializer.
Parameters: - filename (str) – file name containing the binary data
- streams – any dictionary-like object that contains a mapping from stream
names to
StreamDef
objects. Each StreamDef object configures an input stream.
See also
-
CTFDeserializer
(filename, streams)[source]¶ Configures the CNTK text-format reader that reads text-based files with lines of the form:
[Sequence_Id] (Sample)+
where:
Sample=|Input_Name (Value )*
Parameters: - filename (str) – file name containing the text input
- streams – any dictionary-like object that contains a mapping from stream
names to
StreamDef
objects. Each StreamDef object configures an input stream.
See also
-
HTKFeatureDeserializer
(streams)[source]¶ Configures the HTK feature reader that reads speech data from scp files.
Parameters: streams – any dictionary-like object that contains a mapping from stream names to StreamDef
objects. Each StreamDef object configures a feature stream.
-
HTKMLFDeserializer
(label_mapping_file, streams, phoneBoundaries=False)[source]¶ Configures an HTK label reader that reads speech HTK format MLF (Master Label File)
Parameters: - label_mapping_file (str) – path to the label mapping file
- streams – any dictionary-like object that contains a mapping from stream
names to
StreamDef
objects. Each StreamDef object configures a label stream. - phoneBoundaries (bool, defaults to False) – if phone boundaries should be considered (should be set to True for CTC training, False otherwise)
-
INFINITELY_REPEAT
= 18446744073709551615¶ int – constant used to specify a minibatch scheduling unit to equal the size of the full data sweep.
-
ImageDeserializer
(filename, streams)[source]¶ Configures the image reader that reads images and corresponding labels from a file of the form:
<full path to image> <tab> <numerical label (0-based class id)>
or:
sequenceId <tab> path <tab> label
Parameters: filename (str) – file name of the map file that associates images to classes See also
-
LatticeDeserializer
(lattice_index_file, streams)[source]¶ Configures a lattice deserializer
Parameters: lattice_index_file (str) – path to the file containing list of lattice TOC (table of content) files
-
class
MinibatchData
(value, num_sequences, num_samples, sweep_end)[source]¶ Bases:
cntk.cntk_py.MinibatchData
,cntk.tensor.ArrayMixin
Holds a minibatch of input data. This is never directly created, but only returned by
MinibatchSource
instances.-
as_sequences
(variable=None)[source]¶ Convert the value of this minibatch instance to a sequence of NumPy arrays that have their masked entries removed.
Returns: a list of NumPy arrays if dense, otherwise a SciPy CSR array
-
end_of_sweep
¶ Indicates whether the data in this minibatch comes from a sweep end or crosses a sweep boundary (and as a result includes data from different sweeps).
-
is_sparse
¶ Whether the data in this minibatch is sparse.
-
mask
¶ The mask object of the minibatch. In it, 2 marks the beginning of a sequence, 1 marks a sequence element as valid, and 0 marks it as invalid.
-
num_samples
¶ The number of samples in this minibatch
-
num_sequences
¶ The number of sequences in this minibatch
-
shape
¶ The shape of the data in this minibatch as tuple.
-
-
class
MinibatchSource
(deserializers, max_samples=cntk.io.INFINITELY_REPEAT, max_sweeps=cntk.io.INFINITELY_REPEAT, randomization_window_in_chunks=cntk.io.DEFAULT_RANDOMIZATION_WINDOW, randomization_window_in_samples=0, randomization_seed=0, trace_level=cntk.logging.get_trace_level(), multithreaded_deserializer=None, frame_mode=False, truncation_length=0, randomize=True, max_errors=0)[source]¶ Bases:
cntk.cntk_py.MinibatchSource
Parameters: - deserializers (a single deserializer or a list) – deserializers to be used in the composite reader
- max_samples (int, defaults to
cntk.io.INFINITELY_REPEAT
) – The maximum number of input samples (not ‘label samples’) the reader can produce. After this number has been reached, the reader returns empty minibatches on subsequent calls tonext_minibatch()
. max_samples and max_sweeps are mutually exclusive, an exception will be raised if both have non-default values. Important: Click here for a description of input and label samples. - max_sweeps (int, defaults to
cntk.io.INFINITELY_REPEAT
) – The maximum number of sweeps over the input dataset After this number has been reached, the reader returns empty minibatches on subsequent calls to func:next_minibatch. max_samples and max_sweeps are mutually exclusive, an exception will be raised if both have non-default values. - randomization_window_in_chunks (int, defaults to
cntk.io.DEFAULT_RANDOMIZATION_WINDOW_IN_CHUNKS
) – size of the randomization window in chunks, non-zero value enables randomization. randomization_window_in_chunks and randomization_window_in_samples are mutually exclusive, an exception will be raised if both have non-zero values. - randomization_window_in_samples (int, defaults to 0) – size of the randomization window in samples, non-zero value enables randomization. randomization_window_in_chunks and randomization_window_in_samples are mutually exclusive, an exception will be raised if both have non-zero values.
- randomization_seed (int, defaults to 0) – initial randomization seed value (incremented every sweep when the input data is re-randomized).
- trace_level (an instance of
cntk.logging.TraceLevel
) – the output verbosity level, defaults to the current logging verbosity level given byget_trace_level()
. - multithreaded_deserializer (bool) – specifies if the deserialization should be done on a single or multiple threads. Defaults to None, which is effectively “auto” (multhithreading is disabled unless ImageDeserializer is present in the deserializers list). False and True faithfully turn the multithreading off/on.
- frame_mode (bool, defaults to False) – switches the frame mode on and off. If the frame mode is enabled the input data will be processed as individual frames ignoring all sequence information (this option cannot be used for BPTT, an exception will be raised if frame mode is enabled and the truncation length is non-zero).
- truncation_length (int, defaults to 0) – truncation length in samples, non-zero value enables the truncation (only applicable for BPTT, cannot be used in frame mode, an exception will be raised if frame mode is enabled and the truncation length is non-zero).
- randomize (bool, defaults to True) – Enables or disables randomization; use randomization_window_in_chunks or randomization_window_in_samples to specify the randomization range
- max_errors (int, defaults to 0) – maximum number of errors in the dataset to ignore
-
current_position
¶ Gets current position in the minibatch source.
Parameters: - getter (
Dictionary
) – minibatch position on the global timeline. - setter (
Dictionary
) – position returned by the getter
- getter (
-
get_checkpoint_state
()[source]¶ Gets the checkpoint state of the MinibatchSource.
Returns: A dict that has the checkpoint state of the MinibatchSource
-
next_minibatch
(minibatch_size_in_samples, input_map=None, device=None, num_data_partitions=None, partition_index=None)[source]¶ Reads a minibatch that contains data for all input streams. The minibatch size is specified in terms of #samples and/or #sequences for the primary input stream; value of 0 for #samples/#sequences means unspecified. In case the size is specified in terms of both #sequences and #samples, the smaller of the 2 is taken. An empty map is returned when the MinibatchSource has no more data to return.
Parameters: - minibatch_size_in_samples (int) – number of samples to retrieve for the next minibatch. Must be > 0. Important: Click here for a full description of this parameter.
- input_map (dict) – mapping of
Variable
toStreamInformation
which will be used to convert the returned data. - device (DeviceDescriptor, defaults to None) – CNTK DeviceDescriptor
- num_data_partitions – Used for distributed training, indicates into how many partitions the source should split the data.
- partition_index (int, defaults to None) – Used for distributed training, indicates data from which partition to take.
Returns: A mapping of
StreamInformation
toMinibatchData
if input_map was not specified. Otherwise, the returned value will be a mapping ofVariable
to class:MinibatchData. When the maximum number of epochs/samples is exhausted, the return value is an empty dict.Return type:
-
restore_from_checkpoint
(checkpoint)[source]¶ Restores the MinibatchSource state from the specified checkpoint.
Parameters: checkpoint (dict) – checkpoint to restore from
-
stream_info
(name)[source]¶ Gets the description of the stream with given name. Throws an exception if there are none or multiple streams with this same name.
Parameters: name (str) – stream name to fetch Returns: StreamInformation
The information for the given stream name.
-
stream_infos
()[source]¶ Describes the streams this minibatch source produces.
Returns: A list of instances of StreamInformation
-
streams
¶ Describes the streams ‘this’ minibatch source produces.
Returns: A dict mapping input names to instances of StreamInformation
-
class
MinibatchSourceFromData
(data_streams, max_samples=18446744073709551615)[source]¶ Bases:
cntk.io.UserMinibatchSource
This wraps in-memory data as a CNTK MinibatchSource object (aka “reader”), used to feed the data into a TrainingSession.
Use this if your data is small enough to be loaded into RAM in its entirety, and the data is already sufficiently randomized.
While CNTK allows user code to iterate through minibatches by itself and feed data minibatch by minibatch through
train_minibatch()
, the standard way is to iterate through data using a MinibatchSource object. For example, the high-levelTrainingSession
interface, which manages a full training including checkpointing and cross validation, operates on this level.A MinibatchSource created as a MinibatchSourceFromData linearly iterates through the data provided by the caller as numpy arrays or scipy.sparse.csr_matrix objects, without randomization. The data is not copied, so if you want to modify the data while being read through a MinibatchSourceFromData, please pass a copy.
Example
>>> N = 5 >>> X = np.arange(3*N).reshape(N,3).astype(np.float32) # 6 rows of 3 values >>> s = C.io.MinibatchSourceFromData(dict(x=X), max_samples=len(X)) >>> mb = s.next_minibatch(3) # get a minibatch of 3 >>> d = mb[s.streams['x']] >>> d.data.asarray() array([[ 0., 1., 2.], [ 3., 4., 5.], [ 6., 7., 8.]], dtype=float32) >>> mb = s.next_minibatch(3) # note: only 2 left >>> d = mb[s.streams['x']] >>> d.data.asarray() array([[ 9., 10., 11.], [ 12., 13., 14.]], dtype=float32) >>> mb = s.next_minibatch(3) >>> mb {}
>>> # example of a sparse input >>> Y = np.array([i % 3 == 0 for i in range(N)], np.float32) >>> import scipy.sparse >>> Y = scipy.sparse.csr_matrix((np.ones(N,np.float32), (range(N), Y)), shape=(N, 2)) >>> s = C.io.MinibatchSourceFromData(dict(x=X, y=Y)) # also not setting max_samples -> will repeat >>> mb = s.next_minibatch(3) >>> d = mb[s.streams['y']] >>> d.data.asarray().todense() matrix([[ 0., 1.], [ 1., 0.], [ 1., 0.]], dtype=float32) >>> mb = s.next_minibatch(3) # at end only 2 sequences >>> d = mb[s.streams['y']] >>> d.data.asarray().todense() matrix([[ 0., 1.], [ 1., 0.]], dtype=float32)
>>> # if we do not set max_samples, then it will start over once the end is hit >>> mb = s.next_minibatch(3) >>> d = mb[s.streams['y']] >>> d.data.asarray().todense() matrix([[ 0., 1.], [ 1., 0.], [ 1., 0.]], dtype=float32)
>>> # values can also be GPU-side CNTK Value objects (if everything fits into the GPU at once) >>> s = C.io.MinibatchSourceFromData(dict(x=C.Value(X), y=C.Value(Y))) >>> mb = s.next_minibatch(3) >>> d = mb[s.streams['y']] >>> d.data.asarray().todense() matrix([[ 0., 1.], [ 1., 0.], [ 1., 0.]], dtype=float32)
>>> # data can be sequences >>> import cntk.layers.typing >>> XX = [np.array([1,3,2], np.float32),np.array([4,1], np.float32)] # 2 sequences >>> YY = [scipy.sparse.csr_matrix(np.array([[0,1],[1,0],[1,0]], np.float32)), scipy.sparse.csr_matrix(np.array([[1,0],[1,0]], np.float32))] >>> s = cntk.io.MinibatchSourceFromData(dict(xx=(XX, cntk.layers.typing.Sequence[cntk.layers.typing.tensor]), yy=(YY, cntk.layers.typing.Sequence[cntk.layers.typing.tensor]))) >>> mb = s.next_minibatch(3) >>> mb[s.streams['xx']].data.asarray() array([[ 1., 3., 2.]], dtype=float32) >>> mb[s.streams['yy']].data.shape # getting sequences out is messy, so we only show the shape (1, 3, 2)
Parameters: - data_streams – name-value pairs
- max_samples (int, defaults to
cntk.io.INFINITELY_REPEAT
) – The maximum number of samples the reader can produce. If inputs are sequences, and the different streams have different lengths, then each sequence counts with the maximum length. After this number has been reached, the reader returns empty minibatches on subsequent calls tonext_minibatch()
. Important: Click here for a description of input and label samples.
Returns: An implementation of a
cntk.io.MinibatchSource
that will iterate through the data.-
get_checkpoint_state
()[source]¶ Gets the checkpoint state of the MinibatchSource.
Returns: A Dictionary
that has the checkpoint state of the MinibatchSourceReturn type: cntk.cntk_py.Dictionary
-
class
StreamConfiguration
(name, dim, is_sparse=False, stream_alias='', defines_mb_size=False)[source]¶ Bases:
cntk.cntk_py.StreamConfiguration
Configuration of a stream in a text format reader.
Parameters: - name (str) – name of this stream
- dim (int) – dimensions of this stream. A text format reader reads data
as flat arrays. If you need different shapes you can
reshape()
it later. - is_sparse (bool, defaults to False) – whether the provided data is sparse (False by default)
- stream_alias (str, defaults to '') – name of the stream in the file
- defines_mb_size (bool, defaults to False) – whether this stream defines the minibatch size.
-
StreamDef
(field=None, shape=None, is_sparse=False, transforms=None, context=None, scp=None, mlf=None, broadcast=None, defines_mb_size=False, max_sequence_length=65535)[source]¶ - Configuration of a stream for use with the builtin Deserializers. The meanings of some configuration keys have a mild dependency on the exact deserializer, and certain keys are meaningless for certain deserializers.
Parameters: - field (str, defaults to None) –
this is the name of the stream
- for CTFDeserializer the name is inside the CTF file
- for ImageDeserializer the acceptable names are image or label
- for HTKFeatureDeserializer and HTKMLFDeserializer only the default value of None is acceptable
- shape (int or tuple, defaults to None) – dimensions of this
stream. HTKFeatureDeserializer, HTKMLFDeserializer, and
CTFDeserializer read data as flat arrays. If you need different
shapes you can
reshape()
it later. - is_sparse (bool, defaults to False) – whether the provided data is sparse. False by default, unless mlf is provided.
- transforms (list, defaults to None) – list of transforms to be applied by the Deserializer. Currently only ImageDeserializer supports transforms.
- context (tuple, defaults to None) – left and right context to consider when reading in HTK data. Only supported by HTKFeatureDeserializer.
- scp (str or list, defaults to None) – scp files for HTK data
- mlf (str or list, defaults to None) – mlf files for HTK data
- broadcast (bool, defaults to None) – whether the features in this stream should be broadcast to the whole sequence (useful in e.g. ivectors with HTK)
- defines_mb_size (bool, defaults to False) – whether this stream defines the minibatch size.
- max_sequence_length (int, defaults to 65535) – the upper limit on the length of consumed sequences. Sequence of larger size are skipped.
- field (str, defaults to None) –
-
class
StreamInformation
(name, stream_id, storage_format, dtype, shape, defines_mb_size=False)[source]¶ Bases:
cntk.cntk_py.StreamInformation
Stream information container that is used to describe streams when implementing custom minibatch source through
UserMinibatchSource
.Parameters: - name (str) – name of the stream
- stream_id (int) – unique ID of the stream
- storage_format (str) – ‘dense’ or ‘sparse’
- dtype (NumPy type) – data type
- shape (tuple) – shape of the elements
- defines_mb_size (bool, default to False) – whether this stream defines the minibatch size when there are multiple streams.
-
name
¶
-
class
UserDeserializer
[source]¶ Bases:
cntk.cntk_py.SwigDataDeserializer
User deserializer is a base class for all user defined deserializers. To support deserialization of a custom format, please implement the public methods of this class and pass an instance of it to MinibatchSource. A UserDeserializer is a plug-in to MinibatchSource for reading data in custom formats. Reading data through this mechanism provides the following benefits:
- randomization of data too large to fit into RAM, through CNTK chunked paging algorithm
- distributed reading - only chunks needed by a particular worker are requested
- composability of transforms (currently composability of user deserializers is not yet supported)
- transparent support of sequence/frame/truncated BPTT modes
- automatic chunk and minibatch prefetch
- checkpointing
The MinibatchSource uses the information provided by this class to build the timeline and move along it when the next minibatch is requested. The deserializer itself, however, is stateless.
-
get_chunk
(chunk_id)[source]¶ Should return a dictionary of stream name -> data of the chunk, where data is csr_matrix/numpy array in sample mode, or a list of csr_matrix/numpy array in sequence mode.
Parameters: chunk_id (int) – id of the chunk to be read, 0 <= chunk_id < num_chunks Returns: dict containing the data
-
stream_infos
()[source]¶ Should return a list of meta information
StreamInformation
about all streams exposed by the deserializer.Returns: list of StreamInformation
exposed by the deserializer
-
class
UserMinibatchSource
[source]¶ Bases:
cntk.cntk_py.SwigMinibatchSource
Base class of all user minibatch sources.
-
get_checkpoint_state
()[source]¶ Returns a dictionary describing the current state of the minibatch source. Needs to be overwritten if the state of the minibatch source needs to be stored to and later restored from the checkpoint.
Returns: dictionary, that can be later used on restore_from_checkpoint()
.
-
is_infinite
()[source]¶ Should return true if the user has not specified any limit on the number of sweeps and samples.
-
next_minibatch
(num_samples, number_of_workers, worker_rank, device=None)[source]¶ Function to be implemented by the user.
Parameters: - num_samples (int) – number of samples to return
- number_of_workers (int) – number of workers in total
- worker_rank (int) – worker for which the data is to be returned
- device (DeviceDescriptor, defaults to None) – the device descriptor that contains the type and id of the device on which the computation is performed. If None, the default device is used.
Returns: mapping of
StreamInformation
toMinibatchData
-
restore_from_checkpoint
(state)[source]¶ Sets the state of the checkpoint.
Parameters: state (dict) – dictionary containing the state
-
stream_info
(name)[source]¶ Gets the description of the stream with given name. Throws an exception if there are none or multiple streams with this same name.
-
stream_infos
()[source]¶ Function to be implemented by the user.
Returns: list of StreamInformation
instances
-
-
sequence_to_cntk_text_format
(seq_idx, alias_tensor_map)[source]¶ Converts a list of NumPy arrays representing tensors of inputs into a format that is readable by
CTFDeserializer
.Parameters: - seq_idx (int) – number of current sequence
- alias_tensor_map (dict) – maps alias (str) to tensor (ndarray). Tensors are assumed to have dynamic axis.
Returns: String representation in CNTKTextReader format
Return type: str