Return to “Advanced Topics.”
This document assumes a working familiarity with UTF-8 Unicode (UTF-8).
Any reader who is unfamiliar with UTF-8 encoding should read the
Wikipedia UTF-8 article
(https://en.wikipedia.org/wiki/UTF-8
)
before proceeding; it provides an excellent primer.
For our context, the most important UTF-8 concepts are:
More specific technical details will only become important if they affect the specifics of your application design or implementation.
H5Pset_char_encoding
,
which sets the character encoding used for object and attribute names.
For example, the following call sequence could be used to create a dataset with its name encoded with the UTF-8 character set:
lcpl_id = H5Pcreate(H5P_LINK_CREATE) ; error = H5Pset_char_encoding(lcpl_id, H5T_CSET_UTF8) ; dset_id = H5Dcreate2(group_id, "datos_ñ", dtype_id, dspace_id, lcpl_id, H5P_DEFAULT, H5P_DEFAULT) ;
If the character encoding of an object name is unknown,
the combination of an H5Dget_create_plist
call
and an H5Pget_char_encoding
call will reveal that
information.
H5Tset_cset
,
which sets the character encoding to be used in building a character
datatype.
For example, the following commands could be used to create an 8-character, UTF-8 encoded, string datatype for use in either an attribute or dataset:
utf8_8char_dtype_id = H5Tcopy(H5T_C_S1) ; error = H5Tset_cset(utf8_8char_dtype_id, H5T_CSET_UTF8) ; error = H5Tset_size(utf8_8char_dtype_id, "8") ;
If a character or string datatype’s character encoding is unkonwn,
an H5Tget_cset
call can be used to determine that.
Programmers who are accustomed to using ASCII text without accommodating other text encodings will have to be aware of certain common issues as they begin using UTF-8.
Be aware, however, of system or application limitations once data or other information has been extracted from an HDF5 file. The application or system must be designed to accommodate UTF-8 if the information is then used elsewhere in the application or system environment.
Data from a UTF-8 encoded HDF5 datatype, in either a dataset or an attribute, that has been established within an HDF5 application should “just work” within the HDF5 portions of the application.
When working with Unicode text, one can no longer assume a 1:1 correspondence between the number of characters and the data storage requirement.
Mac OS also generally handles UTF-8 correctly.
But must investigate.
Windows systems internally use a different
Unicode encoding (UCS-2, discussed in this
UTF-16 article).
What’s the appropriate thing to say about UTF-8 on Windows?
Do we know that “A carefully designed HDF5 application
using UTF-8 encoding within an HDF5 file can be expected to
function as expected.”
I have seen references implying that
“Windows has the reputation of a somewhat schizophrenic
approach to text handling.”
Have we seen a situation where Windows silently used UCS-2 or
UTF-16 when UTF-8 had been specified in an HDF5 application?
Have we seen situations where HDF5 app[lications successfully use
UTF-s encoding on Windows?
For object and attibute names:
H5Pset_char_encoding
H5Pget_char_encoding
|
For dataset and attribute datatypes:
H5Tset_cset
H5Tget_cset
|
UTF-8 article on Wikipedia |
Return to “Advanced Topics.”