Class Avogadro::Io::Hdf5DataFormat#

class Hdf5DataFormat#

The Hdf5DataFormat class provides access to data stored in HDF5 files.

This class is intended to supplement an existing format reader/writer by providing the option to write large data to an HDF5 file store. The purpose is to keep text format files at a manageable size.

Author

Allison Vacanti

To use this class, open or create an HDF5 file with the openFile method, using the appropriate OpenMode for the intended operation. Data can be written to the file using the writeDataset methods and retrieved using the readDataset methods. When finished, call closeFile to release the file resources from the HDF5 library.

A complete set of datasets available in an open file can be retrieved with the datasets() method, and the existence of a particular dataset can be tested with datasetExists(). removeDataset() can be used to unlink an existing dataset from the file, though this will not free any space on disk. The space occupied by an unlinked dataset may be reclaimed by new write operations, but only if they occur before the file is closed.

A convenient thresholding system is implemented to help the accompanying text format writer determine which data is “large” enough to be stored in HDF5. A size threshold (in bytes) may be set with the setThreshold() function (the default is 1KB). A data object may be passed to the exceedsThreshold method to see if the size of the data in the container exceeds the currently set threshold. If so, it should be written into the HDF5 file by writeDataset. If not, it should be serialized into the text file in a suitable format. The thresholding operations are optional; the threshold size does not affect the behavior of the read/write methods and are only for user convenience.

Public Types

enum OpenMode#

Open modes for use with openFile().

Values:

enumerator ReadOnly#

Open an existing file in read-only mode. The file must exist.

enumerator ReadWriteTruncate#

Create a file in read/write mode, removing any existing file with the same name.

enumerator ReadWriteAppend#

Open an file in read/write mode. If the file exist, its contents will be preserved. If it does not, a new file will be created.

Public Functions

Hdf5DataFormat()#
~Hdf5DataFormat()#

Destructor. Closes any open file before freeing memory.

bool isOpen() const#
Returns:

true if a file is open.

bool openFile(const std::string &filename_, OpenMode mode = ReadWriteAppend)#

openFile Open a file for use by this reader/writer.

Note

Only a single file may be opened at a time. Attempting to open multiple files without calling closeFile() will fail.

Parameters:
  • filename_ – Name of the file to open.

  • mode – OpenMode for the file. Default is ReadWriteAppend.

Returns:

True if the file is successfully opened/create by the HDF5 subsystem, false otherwise.

std::string filename() const#
Returns:

The name of the open file, or an empty string if no file is open.

bool closeFile()#

closeFile Close the file and reset the reader/writer. Another file may be opened after calling this function.

Returns:

true if the file is successfully released by the HDF5 subsystem.

void setThreshold(size_t bytes)#

setThreshold Set the threshold size in bytes that will be used in the exceedsThreshold functions. The threshold can be used to determine which data is considered “large enough” to be stored in HDF5, rather than an accompanying format.

Parameters:

bytes – The size in bytes for the threshold. Default: 1KB.

size_t threshold() const#
Returns:

The current threshold size in bytes. Default: 1KB.

bool exceedsThreshold(size_t bytes) const#

exceedsThreshold Test if a data set is “large enough” to be stored in HDF5 format. If this function returns true, the number of bytes tested is larger than the threshold and the data should be written into the HDF5 file. If false, the data should be written into the accompanying format.

Parameters:

bytes – The size of the dataset in bytes

Returns:

true if the size exceeds the threshold set by setThreshold.

bool exceedsThreshold(const MatrixX &data) const#

exceedsThreshold Test if a data set is “large enough” to be stored in HDF5 format. If this function returns true, the size of the data in the object is larger than the threshold and should be written into the HDF5 file. If false, the data should be written into the accompanying format.

Parameters:

data – Data object to test.

Returns:

true if the size of the serializable data in data exceeds the threshold set by setThreshold.

bool exceedsThreshold(const std::vector<double> &data) const#

exceedsThreshold Test if a data set is “large enough” to be stored in HDF5 format. If this function returns true, the size of the data in the object is larger than the threshold and should be written into the HDF5 file. If false, the data should be written into the accompanying format.

Parameters:

data – Data object to test.

Returns:

true if the size of the serializable data in data exceeds the threshold set by setThreshold.

bool exceedsThreshold(const Core::Array<double> &data) const#

exceedsThreshold Test if a data set is “large enough” to be stored in HDF5 format. If this function returns true, the size of the data in the object is larger than the threshold and should be written into the HDF5 file. If false, the data should be written into the accompanying format.

Parameters:

data – Data object to test.

Returns:

true if the size of the serializable data in data exceeds the threshold set by setThreshold.

bool datasetExists(const std::string &path) const#

datasetExists Test if the currently open file contains a dataset at the HDF5 absolute path path.

Parameters:

path – An absolute path into the HDF5 data.

Returns:

true if the object at path both exists and is a dataset, false otherwise.

bool removeDataset(const std::string &path) const#

removeDataset Remove a dataset from the currently opened file.

Warning

Removing datasets can be expensive in terms of filesize, as deleted space cannot be reclaimed by HDF5 once the file is closed, and the file will not decrease in size as datasets are removed. For details, see http://www.hdfgroup.org/HDF5/doc/H5.user/Performance.html#Freespace.

Parameters:

path – An absolute path into the HDF5 data.

Returns:

true if the dataset exists and has been successfully removed.

std::vector<int> datasetDimensions(const std::string &path) const#

datasetDimensions Find the dimensions of a dataset.

Parameters:

path – An absolute path into the HDF5 data.

Returns:

A vector containing the dimensionality of the data, major dimension first. If an error is encountered, an empty vector is returned.

bool writeDataset(const std::string &path, const MatrixX &data) const#

writeDataset Write the data to the currently opened file at the specified absolute HDF5 path.

Parameters:
  • path – An absolute path into the HDF5 data.

  • data – The data container to serialize to HDF5.

Returns:

true if the data is successfully written, false otherwise.

bool writeDataset(const std::string &path, const std::vector<double> &data, int ndims = 1, size_t *dims = nullptr) const#

writeDataset Write the data to the currently opened file at the specified absolute HDF5 path.

Note

Since std::vector is a flat container, the dimensionality data is only used to set up the dataset metadata in the HDF5 container. Omitting the dimensionality parameters will write a flat array.

Parameters:
  • path – An absolute path into the HDF5 data.

  • data – The data container to serialize to HDF5.

  • ndims – The number of dimensions in the data. Default: 1.

  • dims – The dimensionality of the data, major dimension first. Default: data.size().

Returns:

true if the data is successfully written, false otherwise.

bool writeDataset(const std::string &path, const Core::Array<double> &data, int ndims = 1, size_t *dims = nullptr) const#

writeDataset Write the data to the currently opened file at the specified absolute HDF5 path.

Note

Since this is a flat container, the dimensionality data is only used to set up the dataset metadata in the HDF5 container. Omitting the dimensionality parameters will write a flat array.

Parameters:
  • path – An absolute path into the HDF5 data.

  • data – The data container to serialize to HDF5.

  • ndims – The number of dimensions in the data. Default: 1.

  • dims – The dimensionality of the data, major dimension first. Default: data.size().

Returns:

true if the data is successfully written, false otherwise.

bool readDataset(const std::string &path, MatrixX &data) const#

readDataset Populate the data container @data with data at from the specified path in the currently opened HDF5 file.

Parameters:
  • path – An absolute path into the HDF5 data.

  • data – The data container to into which the HDF5 data shall be deserialized. data will be resized to fit the data.

Returns:

true if the data is successfully read, false otherwise. If the read fails, the data object may be left in an unpredictable state.

std::vector<int> readDataset(const std::string &path, std::vector<double> &data) const#

readDataset Populate the data container @data with data at from the specified path in the currently opened HDF5 file.

Parameters:
  • path – An absolute path into the HDF5 data.

  • data – The data container to into which the HDF5 data shall be deserialized. data will be resized to fit the data.

Returns:

A vector containing the dimensionality of the dataset, major dimension first. If an error occurs, an empty vector is returned and *data will be set to nullptr.

std::vector<int> readDataset(const std::string &path, Core::Array<double> &data) const#

readDataset Populate the data container @data with data at from the specified path in the currently opened HDF5 file.

Parameters:
  • path – An absolute path into the HDF5 data.

  • data – The data container to into which the HDF5 data shall be deserialized. data will be resized to fit the data.

Returns:

A vector containing the dimensionality of the dataset, major dimension first. If an error occurs, an empty vector is returned and *data will be set to nullptr.

std::vector<std::string> datasets() const#

datasets Traverse the currently opened file and return a list of all dataset objects in the file.

Warning

The list is not cached internal and is recalculated on each call. This may be expensive on large HDF5 files, so external caching is recommended if this data is frequently needed.

Returns:

A list of datasets in the current file.