What is that? Who took that picture? When was it taken? Where was it taken? How did they do it? Those questions are answered by metadata - the data about data.
Metadata captures the 'who, what, when, where, why, and how' that is used by Internet searches to filter and locate information and ultimately the corresponding data.
These data might be the actual video from which this image was grabbed, the biological or geological descriptions of what is seen, the captured data of the sensors on the vehicles, and the identification of the black 'smoke' seen billowing from this geological feature.
OER is committed to meeting its mandate to make the data that they manage discoverable, accessible, and understandable to the general public for the benefit of science, education, and ultimately to the better understanding of the world's ocean.
Metadata is the key that unlocks discoverability of these data.
How Does Metadata Fit into the Picture?
As part of the stewardship of the data from OER-sponsored expeditions, the data must be delivered to the NOAA Archives - accompanied by metadata. The OER Integrated Product Team (IPT) for Data Management regularly archives data from OER-sponsored expeditions into two of the three National Data Centers for NOAA, the National Oceanographic Data Center (NODC) in Silver Spring, MD, and the National Geophysical Data Center (NGDC) in Boulder, CO; and into the NOAA Central Library (NCL) in Silver Spring, MD. The NODC archives oceanographic, biological, and environmental data; the NGDC archives geological and geophysical data, and NCL archives multimedia data and products - video, images, reports, and publications.
Metadata can be generated in a variety of formats. Each data repository may have a preference or differing requirements for the formats that they will accept. For example, the NCL uses a Library of Congress metadata standard called MAchine Readable Catalog or MARC. The NOAA National Data Centers, of which NODC and NGDC are two, accept the Federal Geospatial Data Committee (FGDC) sanctioned standards, the Content Standard for Digital Geospatial Metadata (CSDGM) and the International Standards Organization (ISO).
Regardless of the format, however, all of these metadata records can also be represented in an eXtensible Markup Language (XML) format. XML is a 'meta-language' used to represent information of many types and is readable by humans and by computers alike.
For the OER Data Management Project, the IPT currently has several processes that produce the metadata that are needed to archive data to the NOAA Data Centers and the NOAA Central Library.
- MARC metadata templates for all of the different types of data the NCL will archive are modified manually using XML editing software for some of the individual data items being submitted to the Library. For the Okeanos Explorer, the IPT has developed an application called Video Metadata System or ViMS which programmatically reads information from the video file names, embedded video metadata, and file properties to automatically generate the MARC metadata.
- Another application developed by the IPT, the Cruise Information Management System (CIMS), captures metadata for data producing activities during the mission. The CIMS then can programmatically produce CSDGM metadata in an XML format.
- The IPT has developed an application that will convert folders of oceanographic and meteorological ASCII files from the system monitoring integrated NOAA Scientific Computing System (SCS) into archive-ready, compressed NetCDF format for archiving at NODC. Metadata is embedded in the file and so no external metadata is required.
- Finally, the IPT is working to develop ISO metadata templates to be used in generating ISO collection-level and dataset-level metadata records. Currently this is a manual process using XML editing software and the finalized templates.
A Case Study for ISO
The OER IPT realizes that the metadata generation process needs to be streamlined. It is working on solutions that provide total flexibility for its metadata needs but that also provide the best benefit to making the OER data fully documented and discoverable and accessible to everyone. For this reason, the ISO standard has been chosen as the preferred format for metadata for the OER Data Management Project. ISO has the capacity for much more comprehensive information about the data that it represents. By developing it first, the other formats can be generated through programmatic transforms as they are needed. Most importantly, however, is ISO's ability to link metadata records together through a parent-child relationship (see Figure 1).
Figure 1: ISO Parent-Child Relationship: Child Records point up to their Parent Records
Furthermore, these parent-child relationships can be built in a tiered hierarchical structure. These capabilities provide the metadata with much more context and allow users to locate related metadata records during searches (see Figure 2).
Figure 2: ISO Hierarchical Example for OER. EX* = Okeanos Explorer, AO* = Announcement of Opportunity
Additionally, the ISO capability to build and use component libraries of potentially reusable segments eases metadata maintenance (see Figure 3).
Figure3: ISO Components: Any reusable metadata segments can be saved as components
A Path Forward
Scientists know that metadata is an important component for their data. The sooner metadata is generated after data are collected, the better that metadata will be in describing the data. Not only is generating metadata considered tedious, time-consuming, and a resource drain, there aren't definitive tools and templates are difficult to find.
The OER IPT is committed to easing the burden on the scientists and improving the percentages of OER-sponsored data collections in the archives. One way that the NCEI-based OER Data Management Team (OER DMT) is helping is by using research proposals and/or cruise plans during pre-cruise planning to develop data management plans and a collection-level ISO metadata record that the project principals can review and provide input to prior to the cruise taking place.
Then, post-cruise and over a period of time, as data are ready for archiving, as data products are generated, and as reports are written, and as publications are published, the metadata for these can reference the unique identification (UUID) of the collection-level metadata record generated during planning. The OER DMT is also building a library of templates for collection-level and dataset-level ISO metadata that can be downloaded, filled in, and submitted along with the corresponding data.