OPS-DC-SPEC-001
Data Center
Science Requirements
Steven Berukoff, Kevin Reardon, Robert Tawa
Data Center
16 June 2020
| Name | Date |
Approved By : | Robert Tawa Data Center PM |
|
Approved By: | Kevin Reardon Tom Schad Alexandra Tritschler Friedrich Woeger DKIST Science Team |
|
Approved By: | Scott Wiant Data Center Senior Computing Analyst |
|
Released By: | Thomas Rimmele DKIST Director |
|
REVISION SUMMARY:
Date: 12 Sep 2016
Revision: A
Changes: Initial formal document.
Date: 16 June 2020
Revision: B
Changes:
Deleted unused acronym, ATBD.
Deleted R1.02 Ancillary data is no longer sent by the DHS to the DC
Edited R1.03 to remove 8 hr. notice.
Deleted R1.04 Data Policy Enforcement. Requirement is unnecessary and there is no formal DKIST Data Policy.
Edited R1.05, R1.06, R1.07 to remove reference to non-existent DKIST data policy
Added R1.148 to add special access to proprietary data.
Deleted R1.11 Design of Data transfer through Globus precludes necessity of and ability to implement this requirement
Edited R1.15 requirement to remove reference to external data.
Deleted R1.17 as there is no longer any reason to involve the DQAC in DC operations nor any minimum time to keep data on disk
Deleted R1.18 redundant to requirement R1.17
Edited R1.20 to clarify deletion process
Edited R1.21 to remove document reference and “<<TBD>>”
Deleted R1.23 Ancillary data is no longer sent by the DHS to the DC
Edited R1.24 to remove “permanent and irrevocable”
Edited R1.25 to better reflect the rationale of the requirement
Added R1.146 to add capability lost with the editing of R1.25
Edited R1.26 for clarity
Added R1.147 to add capability related to R1.146
Deleted R1.29 as this requirement is impossible to implement and because loss of data is so improbable it is unnecessary
Edited R1.32 for clarity
Deleted R1.34 as it is a descendent of R1.32 – Added to System Requirements.
Edited R1.35 to change name of requirement to “Storage of Calibration Parameters”. Aligned requirement and rationale to this goal.
Edited R1.36 Rationale to remove reference to languages and what is planned
Edited R1.38 Search Parameters – modified the list of search parameters and added explanatory text.
Calibration metadata has been deleted from the list of search parameters. The rationale for R1.38 is to generate a list of common search criteria. Searches based on calibration metadata are not common. The calibration history will be recorded in the FITS headers of each file and that information is available in the ASDF file associated with each dataset. Using the user tools, the ASDF files will enable users to search on *any* metadata in the files including calibration metadata.
Investigator ID was deleted from the search terms. Investigator ID is generally opaque to the user it belongs to and is used like a primary key in a DB and is not something a general user can really look up.
Modified “Target Type – Observed in DKIST Data” to be a goal rather than an explicitly provided search term. Target type could provide a very useful search tool for users to find observation of particular features or events of interest. However, it will not be possible to provide this option at the Data Center launch.
Edited R1.40 and R1.41 to clarify limits on the requirements
Deleted R1.43 Search Metadata from Non-DKIST Data Providers – this requirement was deleted as it is not within the first light scope.
Edited R1.45 to remove requirement of resolution.
Edited R1.46 to remove requirement to distribute intermediate calibration data
Edited R1.47 to clarify and remove “best effort basis”
Edited R1.50 to clarify non-repudiation in context of science data
Added R1.145 Level 0 Search Interface Requirements for DKIST Staff – Instrument Scientists and other DKIST staff will need access to level 0 data with a rudimentary interface for their work.
Deleted R1.53 Access Variability. Requirement is unnecessary and cannot be physically or financially met.
Edited R1.54 to disallow external (to DC) use of the DC computing infrastructure. Sandbox for DKIST scientists to be provided by IT.
Edited R1.55 Open Source Software for clarification.
Edited R1.57 to make IDL a choice rather than a requirement – as IDL does not have adequate tools that will allow the requirement to be met.
Edited R1.61 to remove the provision of services as those are an operations function.
Multiple – Changed/removed the word “embargo” and replaced with some variation of “proprietary data”
Table of Contents
1. Overview 1
1.1 Purpose 1
1.2 Scope 1
1.3 Verification 1
1.4 Prioritization 2
1.5 Definitions 2
1.6 Acronyms 3
1.7 Applicable Documents 3
1.8 Reference Documents 3
2. Rate and Volume FROM DKIST 4
3. Data and Metadata Availability 5
4. Data Curation, Retention, and ASSOCIATION 8
5. PROCESSING & Calibrations 12
6. Search & Distribution 14
7. Data Center Usage 20
8. Software, SYSTEMS, Support 22
Overview
Purpose
This document documents requirements for the Data Center that prescribe how it can support the scientific objectives of the DKIST. This, in concert with the Data Center Operational Concepts Definition (DC-OCD), is the highest-level document from which design requirements flow. Most inform and guide the definition of Data Center systems (design) requirements that are captured elsewhere and will be reviewed separately.
Scope
The requirements herein relate to the performance and function of the Data Center as driven by scientific necessity. They do not include functional aspects stemming from the conceptual or design development of the Data Center.
Verification
Included in each major numbered specification listed in this document is a requirement verification method. These verification methods specify the minimum standards of verification required to ensure that the individual requirements and specifications are met. Verification methods used below are those identified in the Systems Engineering Book of Knowledge (SEBOK) (RD[01]).
Inspection. Technique based on visual or dimensional examination of an element; the verification relies on the human senses or uses simple methods of measurement and handling. Inspection is generally non-destructive, and typically includes the use of sight, hearing, smell, touch, and taste, simple physical manipulation, mechanical and electrical gauging, and measurement. No stimuli (tests) are necessary. The technique is used to check properties or characteristics best determined by observation (e.g. - paint color, weight, documentation, listing of code, etc.).
Analysis. Technique based on analytical evidence obtained without any intervention on the submitted element using mathematical or probabilistic calculation, logical reasoning (including the theory of predicates), modeling and/or simulation under defined conditions to show theoretical compliance. Mainly used where testing to realistic conditions cannot be achieved or is not cost-effective.
Demonstration. Technique used to demonstrate correct operation of the submitted element against operational and observable characteristics without using physical measurements (no or minimal instrumentation or test equipment). Demonstration is sometimes called 'field testing'. It generally consists of a set of tests selected by the supplier to show that the element response to stimuli is suitable or to show that operators can perform their assigned tasks when using the element. Observations are made and compared with predetermined/expected responses. Demonstration may be appropriate when requirements or specification are given in statistical terms (e.g. meant time to repair, average power consumption, etc.).
Test. Technique performed onto the submitted element by which functional, measurable characteristics, operability, supportability, or performance capability is quantitatively verified when subjected to controlled conditions that are real or simulated. Testing often uses special test equipment or instrumentation to obtain accurate quantitative data to be analyzed.
Verification by demonstration and test should be achieved wherever possible. In cases where inspection or analysis is the only verification, every effort should be explored to verify these through proof test cases.
Prioritization
Included for each requirement of an element, function, or performance characteristic, is an assessment of its priority of development into one of two phases. The following identifiers and meanings are used:
Required – Shall be achieved to meet the minimum objectives.
Goal – Effort should be undertaken to meet specification.
Definitions
This document contains references terms requiring definition.
Early Operations – The time period beginning at the final acceptance of the DKIST into operations, which occurs at the conclusion of integration, test, commissioning and science verification activities, and extending four (4) years from the beginning of DKIST Operations. This definition is being used as a time interval affecting the Data Center, and may or may not be adopted by other DKIST operations elements.
Regular Operations – The period of DKIST Operations after Early Operations, extending for the DKIST operational lifetime.
Scientific Data –The set of qualitative and quantitative values that represent the outcome of a scientific measurement or collection process. For the purpose of this document, refers explicitly to the recorded numerical values resulting from collection, processing, and/or analysis of solar photons by DKIST facility instruments and supporting systems.
Scientific Metadata – The set of qualitative and/or quantitative values and/or text that describes other scientific data. For the purpose of this document, refers to such values or text that describes the direct context and environment of the collection process that is designed to collect solar photons.
Ancillary Data and Metadata – Data or metadata that are not directly related to the measurement process but that support scientific, operational or computational outcomes.
Data Set – An aggregation of data and metadata formatted and organized as intended for user consumption.
Data Store – The Data Center systems and supporting software that provide curatorial management capability of DKIST data and metadata.
Science Data Frame – The aggregation of scientific data and metadata resulting from a single output array recorded by a DKIST photo-sensitive detector, or from operations on multiple exposures by the detectors or other systems that generate an analogous output array, all stored in a structured, machine-readable format. This includes those exposures obtained for the explicit purpose of calibrating instrument performance (e.g., dark frames). For example, the DKIST Data Handling System produces science data frames in FITS file format in accordance with the specification of SPEC-0122, from binary data (termed fully processed accumulators), and alphanumeric metadata stored internally within the DHS.
Calibration Parameter – A numeric value used in performing one or more calibration calculations. Calibration parameters may be generated during the data acquisition process or created via data processing, and may change over time. Multiple associated parameters may be combined to create sets, e.g., for creating curves or matrices. Examples abound: gain, rms noise, linearity, Mueller matrix.
Mebibyte, Gibibyte, Tebibyte, Pebibyte – These terms refer to powers of 1024 used in storage of binary (2-bit) data. Specifically,
1 Mebibyte (MiB) = 10242 = 220 (1048576) bytes
1 Gibibyte (GiB) = 10243 = 230 (1073741824) bytes
1 Tebibyte (TiB) = 10244 = 240 (1099411627776) bytes
1 Pebibyte (PiB) = 10245 = 250 (1125899906842624) bytes
In particular, note that these all exceed the more common notation of e.g., “megabytes” by some non-negligible percentage. For instance, one MiB is 4.85% larger than one MB and one PiB is nearly 12.6% larger than one PB.
Acronyms
Acronym | Meaning |
DKIST | Daniel Ken Inouye Solar Telescope |
NISP | National Integrated Synoptic Program |
NSO | National Solar Observatory |
DHS | Data Handling System |
Applicable Documents
Ref | Title |
AD[01] | OPS-DC-SPEC-002, “Data Center Operational Concepts Document” |
Reference Documents
Ref | Title |
RD[01] | SEBOK Wiki, http://sebokwiki.org |
RD[02] | OPS-DC-TN-001, “Data Center Data Volume Estimate” |
Rate and Volume FROM DKIST
Annual Volume and Image Count
The Data Center shall support the receipt of 2.8 PiB of DKIST scientific and calibration data and metadata, via 1.9 x 108 science data frames and 5.0 x 1010 non-unique metadata elements, per year delivered from the DKIST Data Handling System.
Rationale: The rationale for estimates of data and metadata rates are documented in detail in Technical Note OPS-DC-TN-001, “Data Center Incoming Data Rate” (RD[02]). Particularly, the Technical Note derives the science data frame rate, owing to a model combining facility usage and uptime, seeing, and instrument usage. Subsequently, the model scales the science data frame estimate to the data volume associated with them, thus accumulates to 2.8 Pebibytes per annum (about 8 Tebibytes per day) and 5.0 x 1010 non-unique metadata elements received within the header of the science data frames.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 4.1.4
Ancillary Data and Metadata [Deleted]
Peak Daily Data Volume
The Data Center shall support receipt of data volumes of 60 TiB that may be generated by the DKIST in a single day.
Rationale: The rationale for estimates of data and metadata rates are documented in detail in Technical Note OPS-DC-TN-001, “Data Center Incoming Data Rate” (RD[02]).
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 4.1.4
Data and Metadata Availability
Data Policy Enforcement [Deleted]
Open Data Access
The Data Center shall make available for search and download all non-proprietary scientific data and related descriptive scientific metadata to requesting users.
Rationale: The data resulting from experiments performed on the DKIST should not be held in private, unavailable repositories, but rather made available broadly to enhance its utility.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 2.1
R1.148 Proprietary Data Access
The Data Center shall make available for search and download proprietary scientific data and related descriptive scientific metadata to authorized requesting users.
Rationale: The data resulting from experiments performed on the DKIST should be held in private, unavailable [to the public] repositories, when that data has been flagged as proprietary.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 2.1
Search Transparency
The Data Center shall provide search transparency of all scientific data and related descriptive metadata to requesting users, notating and enforcing restrictions for proprietary data.
Rationale: The Data Center should show users information about its existing holdings in accord with the preceding requirement "Open Data Access", which states that all data can be made available to users except for proprietary data, which will have some access restrictions. Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 2.1
Search Access
The Data Center shall be capable of enabling users with user- and group-specific access to data through its data and metadata search services.
Rationale: For the user community, specialized access to data and metadata search services can provide targeted utility of Data Center functions that enhances the usability of DKIST data. Specialized access – such as the availability of user-specific historical information – is achieved through associated users with their work, requiring the Data Center to build an authentication function to ensure user validity.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 7.1.2
Data and Metadata Availability Time
The Data Center shall make available for search and download 90%, annual average, (goal 100%) of data and metadata received from the DKIST Data Handling System, whose elements are defined in SPEC-0122, within ten days of its receipt in Boulder.
Rationale: Data and metadata from experiments should be available quickly after the experimental data are acquired. These are received from the summit DHS whose elements are defined by SPEC-0122, ‘DKIST Data Model’.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD-7.1.3
Data Download Status Notification
The Data Center shall provide a status notification to a user requesting download of data within fifteen seconds of request.
Rationale: Users need an expectation of when requested data will be available for download from the Data Center. A status notification provides user either a notice as to when data would be available for download, or a notice that data is immediately available for download.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 7.1.2
Direct Data Download Availability Time
The Data Center shall provide direct data download availability within five business days after request.
Rationale: Users need an expectation of when requested data will be available for download from the Data Center. The Data Center needs to have a maximum timescale during which to provide the data to the user.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 7.1.2
Optimization of Data Download Availability [Deleted]
User Initiation of Data Transfer
The Data Center shall support distribution of datasets and metadata through user-initiated network transfers using common and existing transport mechanism and protocols.
Rationale: Users will want to pull data from the Data Center resources, and as well may wish to “subscribe” to information or data that is generated by and distributed by the Data Center.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD-8.5.4
Data Center Initiation of Data Transfer
The Data Center shall support distribution of datasets and metadata through Data Center-initiated network transfers using common and existing transport mechanism and protocols.
Rationale: Users will want to pull data from the Data Center resources, and as well may wish to “subscribe” to information or data that is generated by and distributed by the Data Center.
Priority: Goal
Verification: Demonstration
Requirement Origin: DC-OCD-8.5.4
Provide Documentation of Distributed Data
The Data Center shall provide full documentation regarding the format, organization, and definition of distributed data sets.
Rationale: Users must have a clear expectation on the structure and organization of distributed data resources. This may include description of metadata header elements, information about how the data is organized within the file(s), any formatting features of note, and how or if data is compressed.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD-8.5.3
Data Curation, Retention, and ASSOCIATION
Data-Metadata Association
The Data Center shall maintain association of the content of data and metadata received from the DKIST Data Handling System with data and metadata
Resulting from Science Operations Planning and Monitoring;
Produced by Data Center processes including calibration;
with its descriptive data and related ancillary information, including, scientific source, generational documentation and software provenance, for the duration of the lifetime of the data.
Rationale: Maintaining provenance is key in ensuring the production of stable products for community use, as well as the long-term utility of data. Such association is typically supported by one or more data models.
Priority: Required
Verification: Inspection/Demonstration
Requirement Origin: DC-OCD 5.3.1, 5.3.2
Resilience to Disaster
The Data Center shall ensure that localized catastrophic disasters do not result in permanent loss of data.
Rationale: In order to ensure long-term safety of DKIST data requiring retention, the Data Center must ensure that it does not store its data in a single location, as should that location suffer a catastrophic event permanent data loss would result.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 10.3.7
Retention Enforcement [Deleted]
Data Value [Deleted]
Minimum Retention Time
The Data Center shall retain content of science data frames, without downsampling or otherwise reducing the natively-acquired image resolution, received from the DKIST Data Handling System for at minimum six months from time of acquisition.
Rationale: This ensures that the content – e.g., the image data and associated descriptive information – of science frames will be retained for a period of time allowing analysis. This content may not be stored internally in the form received from the summit, i.e., they may be reversibly disaggregated into smaller or different-size pieces. This also ensures a minimum lifetime for data in routine Operations.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 5.4.1
Data and Metadata from Summit Operations
The Data Center shall receive and retain all data and metadata adherent to the “DKIST Data Model”, documented in SPEC-0122, as received from the Summit facility for the lifetime of the project, unless directed for deletion by the DKIST Data Quality Assessment Committee.
Rationale: It is necessary to capture the record of each experiment and relevant facility operations performed on the DKIST summit, regardless of whether the science images are retained.
Priority: Required
Verification: Inspection/Demonstration
Requirement Origin: DC-OCD 5.4.1
Metadata from Operations Planning
The Data Center shall receive and retain for the lifetime of the project all metadata required to establish and maintain provenance of captured experiments to their proposal, planning, and execution.
Rationale: To enable the long-term association of the outputs of the measurement process to associated plans and proposals, it is necessary for the Data Center to capture and maintain metadata from Operations Planning.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 5.4.1
Metadata from Data Center Operations
The Data Center shall retain metadata associated with the computational processing and transformation of raw data into output data sets and/or products, for the lifetime of those outputs.
Rationale: Should the Data Center publish data sets for community consumption, it must also retain a record of by what means, broadly but incompletely described as algorithms and their versions, software tools and their versions, hardware and its type and version, in order to provide the capability for explanation, inspection, and reproduction. Nominally, since the Data Center will distribute raw and calibrated data (L1 data) to the user community, it should also be capable of receiving feedback on those outputs, and associating some or all of that feedback (e.g., in the form of annotations or revisions of control parameters) to data and metadata stored in the Data Center.
Priority: Required
Verification: Inspection/Demonstration
Requirement Origin: DC-OCD 5.4.1
Retention of Ancillary Data [Deleted]
Deletion
The Data Center shall be capable deletion of scientific data and any associated scientific, engineering, analysis, and processing metadata in its data stores .
Rationale: To enforce limits on storage identified herein, the Data Center will delete data and metadata.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 5.4.3
Lossless Compression
The Data Center shall be capable of reducing - by using lossless compression algorithms - the size of science data holdings.
Rationale: Multiple techniques exist to decrease the stored size of data with acceptable loss in photometric or spatial resolution, and these will be employed selectively on DKIST data. Lossy compression techniques for integer, and floating-point data are routinely used in astronomy to provide storage cost savings, at a cost of substantial storage savings but acceptable loss in scientific data quality. The selection and application of image compression algorithms, parameters, and loss ratios will be determined based on analysis of effects on sample data and evaluation of impact on various scientific utilizations. Downsampling can be performed on raw and calibrated data (L1), and the methods applied to which data will largely be the responsibility of the DKIST Science Team, acting through its Data Quality Assessment Committee.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 5.4.3
R1.146 Lossy Compression
The Data Center shall be capable of reducing - by using lossy compression algorithms - the size of science data holdings.
Rationale: Multiple techniques exist to decrease the stored size of data with acceptable loss in photometric or spatial resolution, and these will be employed selectively on DKIST data. Lossy compression techniques for integer, and floating-point data are routinely used in astronomy to provide storage cost savings, at a cost of substantial storage savings but acceptable loss in scientific data quality. The selection and application of image compression algorithms, parameters, and loss ratios will be determined based on analysis of effects on sample data and evaluation of impact on various scientific utilizations. Downsampling can be performed on raw and calibrated data (L1), and the methods applied to which data will largely be the responsibility of the DKIST DC and Science Team.
Priority: Required
Verification: Analysis
Requirement Origin: DC-OCD 5.4.3
Retention Reporting
The Data Center shall be capable of reporting the availability of its retained data and metadata holdings.
Rationale: Not only should the Data Center perform retention of data and metadata, it should be capable of providing some reports of its holdings upon request.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 5.5.8
R1.147 Degradation Reporting
The Data Center shall be capable of reporting the status of past and pending deletion and lossy compression activities.
Rationale: Not only should the Data Center perform compression of data and metadata, it should be capable of providing some reports of its holdings upon request.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 5.5.8
Data Loss Limits
The Data Center shall ensure that the non-directed, irretrievable loss of its holdings of scientific data does not exceed 0.1% annual average (goal 0.01%).
Rationale: The undirected loss of data may be caused by system defect or failure, a defect in implemented procedure, or natural causes (e.g., cosmic rays). The loss of too much data can affect the scientific integrity of specific images, data sets, and collections thereof; yet, a requirement allowing too little loss is overly burdensome technologically and financially.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 5.5.8
Reproduction of Data Center Output
The Data Center shall be capable of redistributing datasets previously created and distributed, unless relevant inputs, algorithms, code, and/or outputs have been deleted or deprecated through data reduction, downsampling or data deletion activities. In such cases, the Data Center shall reproduce data sets with reduced fidelity, reporting the difference from the original, or shall report that insufficient input information is available.
Rationale: Rather than statically store every data set produced, it is sensible to ensure that raw data, required inputs, relevant computing specifications (e.g., algorithm information), software versions, and hardware requirements remain available to recreate the product(s).
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 5.5.3
Loss Reporting [Deleted]
PROCESSING & Calibrations
Algorithm Extensibility
The Data Center shall provide data analysis and processing algorithms in a modular and extensible manner in order to accommodate changes and/or improvements to techniques applicable to DKIST Data.
Rationale: Changes in processing needs, whether driven by algorithmic, scientific, or technological changes, must be accommodated without undue impact on Data Center Operations.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 2.1, DC-OCD 6.3.1
Quality Framework
The Data Center shall define and maintain a standardized science data quality information framework including algorithmically-derived quality metrics for calibrated data (L1) and metadata.
Rationale: Utility of DKIST data is enhanced when users of data are provided quality metrics for the data they are searching for or using. For external users, data sets accompanied by a reporting of the “success” of a calibration aids in ensuring fitness for use. For internal users, such reporting aids in identifying opportunities to improve and extend existing calibration techniques.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 6.3.3
Calibrations
The Data Center shall calibrate science data sets gathered by the DKIST facility instruments, utilizing controlled implementations of documented algorithms.
Rationale: The primary output of the Data Center is calibrated (L1) DKIST data. Calibrations characterize, and where possible, remove known errors, artifacts, and inhomogeneities convolved with the solar signal by the optomechanical and thermal systems of the DKIST. Calibrations may also include characterization or removal of known artifacts introduced by the terrestrial atmosphere on the solar signal (see following “Goal” requirement). The minimum calculations to perform will be detailed in the Calibrations Subsystem requirements documentation. The detailed specification of which techniques and algorithms will be applied to the DKIST data – “documented algorithms” in the requirement. These documented algorithms contain the rationale and justification for the intended calculations, detailed logical algorithm flow, mathematical operations that need to be performed, and references where available and/or applicable.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 2.1, DC-OCD 6
Image Reconstruction
The Data Center shall be capable of improving the quality of science data frames impacted by the terrestrial atmosphere through the controlled implementation of documented algorithms for image reconstruction.
Rationale: Many data sets will benefit from quality improvements enabled by the application of image reconstruction techniques to data. Current algorithms are computationally expensive and therefore likely prohibitive within the effort to deliver Data Center functions to meet DKIST first light. However, image reconstruction is sufficiently important that the Data Center should strive to include their development in its workplan should resources permit.
Priority: Goal
Verification: Demonstration
Requirement Origin: DC-OCD 6.1
Calculate Calibration Parameters [Deleted]
Storage of Calibration Parameters
The Data Center shall be capable of storing calibration parameters and associated values for the lifetime of the facility.
Rationale: The parameters used in performing calibration processing should be maintained as an archival resource akin to other metadata. Maintaining a store of some calibration parameters imposes a negligible impact on available storage while enabling the derivation of long-term performance measures relevant to calibrations, such as parameter drift.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 7.1.1, DC-OCD 5.3.1
Search & Distribution
Web and Script Search Access
The Data Center shall provide the capability for users to search and download data and metadata using network-enabled interfaces accessible from web browsers and select scripting languages.
Rationale: Users want to access data holdings through a web browser and through scripts in common languages. This should be supported by the creation of an API that can be implemented in numerous forms.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 7.1.2
Search Interface Help Documentation
The Data Center shall make available for viewing and download help documentation that provides explanation of the search interface functionality, including descriptions of parameters and valid ranges where appropriate.
Rationale: Informative documentation aiding the use of search interfaces will enable users to utilize the interfaces more effectively while decreasing the impact on DKIST staff in responding to help requests.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 10.3.8
Search Parameters
The Data Center shall provide the capability to search, singly and in combination using the following categories: Temporal, Spatial, Spectral, Polarimetric, Instrument, requested Target Type, and Unique Identifiers. The Data Center has a goal to also enable searching by observed Target Type.
Temporal
Timestamp of data acquisition, and intervals thereof
Temporal sampling
Spatial
Polygonal bounding box
Spatial coordinates specified by coordinate entries consistent with the helioprojective Cartesian coordinate system
Spatial sampling
Spectral
Spectral band(s) contained in data
Spectral line(s) contained in data
Spectral sampling
Numeric search + controlled tag names (e.g., “Ca II”)
Polarimetric
Stokes I, Q, U, V
Polarimetric accuracy, if available
Instruments
Instrument(s) used
Instrument modes: Spectropolarimetric, spectroscopic, filtergraph
Unique identifiers
Proposal ID
Experiment ID
Target Type
Type requested in proposal (per configured vocabulary defined in the DKIST ICD 4.2/7.0 OCS to DSSC and the SPEC-0122 DKIST Data Model Specification)
Search Goals
Target Type
Observed in DKIST data
Rationale: This represents a common set of search criteria for many astronomical data search interfaces, and particularly for those providing access to spectropolarimetric data obtained through facility operation.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 7.1.1
Further definitions of Search Parameters:
Temporal sampling: We define temporal sampling as the particular frequency at which common observations have the same Data Set Parameters at the observed cadence within level 1 datasets The meaning of cadence here can change slightly depending on the type of observation. E.g. in “field sampling” mode, mosaics would be created for the full scanned FOV, and in this case cadence would refer to the period between whole mosaics rather than the constituent frames with the same Data Set Parameters (but different pointing). This is in line with SOLARNET definitions and what solar physicists would expect. We expect temporal sampling to be consistent within a dataset.
Spectral sampling: Spectral sampling will be based on the wavelength plate scale for the dataset. If non-uniform wavelength sampling is used, then the average wavelength plate scale will be calculated.
Stokes I, Q, U V: This will be a Boolean. Either the observation contains polarization observations, or it does not.
Instrument modes: Spectropolarimetric, spectroscopic, filtergraph: Observations will be labelled with one of these three modes.
Target:
Type will be a configured vocabulary. An example vocabulary already in use, can be found at https://www.lmsal.com/hek/VOEvent_Spec.html.
Ideally, we will be able to identify the observed features from the data itself, but in lieu of that, the intended target for the requested experiment may make a good starting point, if easily determined.
Query Frequency
The Data Center shall support at minimum 100 simultaneous queries against its available data holdings.
Rationale: The rule of thumb used here is 2% of a user base will access the Data Center search facility simultaneously. This user base is estimated conservatively at 5000 individuals.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 7.1
Query Performance Time
The Data Center shall return an informational status response to a submitted query of its data holdings within two seconds of submission when 100 or fewer concurrent users are querying the available data.
Rationale: A user submitting a query request needs to be informed that their query has been received and is being processed.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 7.1.2, DC-OCD 8.5.4
Query Result Set Return Time
The Data Center shall return a query result set, defined as descriptive metadata and associated ancillary information, or a subset of no less than twenty-five items if the result set is large, to a submitted query within seven seconds, when 100 or fewer concurrent users are querying the data holdings.
Rationale: A user submitting a query request should be shown at least a preliminary list of results quickly.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 7.1.2, DC-OCD 8.5.4
Visual Presentation of Search Results
The Data Center shall be capable of utilizing simple preliminary result presentation techniques for the purpose of summarizing and/or presenting large and/or complex query result sets.
Rationale: For some data sets, simple visualization of result sets will be more effective in providing user access and subsetting capabilities than, say, tabular formats. Thus, the use of simple data plots, overlays of statistical summaries, browse images and movies, etc. will aid the user in identifying data of interest.
Priority: Required
Verification: Inspection/Demonstration
Requirement Origin: DC-OCD 7.1.2
Search Metadata from Non-DKIST Data Providers [deleted]
Science Data Set Definition
The Data Center shall distribute science data sets that shall be comprised of, at minimum,
Calibrated (L1) Science Data, containing the output of one or more instrument actions driven by one or more scientific experiments and the acquisition of directly-related ancillary data (e.g., calibration parameters), and calibrated using predefined, quality-controlled means;
Science data quality information via a quality record, produced during data processing, and documented for digestion by end-users;
Revision-controlled structured metadata and ancillary information with full and structured attribute descriptions, including measurement activity and processing provenance, to include at a minimum processing and utility code versions, code execution status information, and reported quality;
Relevant calibration information, including parameters used and their validity and the calibration steps performed;
Related documentation, e.g., algorithms and code versions used, or relevant links to them.
Rationale: Data sets contain the scientific data and metadata documenting the observations performed and their context, the processing steps performed and their context, and suitable additional information such as data quality. Data quality information provides users with input into fitness-for-use considerations; reproducibility information allows users to recreate, and, hopefully, improve upon Data Center data processing techniques; and related documentation aids users in understanding what was done to the data. Relevant data quality information might include quantitative parameters such as Fried parameter, or process-driven parameters such as completion of intended calibration and the list of calibration steps used. The exact specification of the bulleted items listed above will be described in lower-level requirements.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 8.1.1, 8.2
Browse Images & Movies
The Data Center shall provide browse images and movies of calibrated (L1) data, when required input data are available, to allow previewing and browsing of data sets or full science images prior to requesting download full science data sets.
Rationale: Browse imagery available from major astronomical data archives typically utilize JPEG, GIF, and PNG methods for distribution. Browse movies are often distributed in AVI and/or MPEG standards. These standards – which variously encompass digital encoding schemes, compression techniques, and file formatting – are easily viewable by the multimedia visualization applications in common use. Note that it is expected that the input data (e.g., the individual raw or calibrated images) necessary to create browse media may not always be available.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 8.3.2
Calibration Data Set
The Data Center shall temporarily store, and utilize calibration data arrays for the express purpose of generating quality metrics and parameters that may be used to assess the quality of the produced data as well as to perform trend analysis on fundamental properties that may affect data quality.
Rationale: Calibration information is often generated from facility input sources or is useful more broadly than for a single experiment. Making calibration information accessible enables their use to generate quality metrics and trend analyses. The calibration data array is primarily the derived calibration parameters, not the instrument outputs.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 8.3.3
Raw Data Sets Available on Request
The Data Center shall be capable of assembling and distributing raw data sets – including the calibration information acquired at the DKIST, upon authorized user request.
Rationale: The Data Center will make available raw data to users, which must be published in a usable form. These raw data must be accompanied by essential calibration information acquired at the DKIST in order for the data to be effectively used. Such calibration information would include, for instance, dark frames and flat fields acquired during Observing Programs.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 8.3.1
Metadata Set
The Data Center shall be capable of distributing collections of selected metadata for end-user consumption in structured machine-readable formats and accompanied by full and structured attribute descriptions for all metadata elements.
Rationale: The Data Center will enable users to search on various metadata fields, and should make available a way for users to download metadata of interest to end-users. An example might be seeing quality over some interval during which an experiment of interest was being executed.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 8.3.4
Data Set Standards
The Data Center shall distribute data sets adhering to predefined, commonly-used and supported standards for structured representation, file format, defined content, and archival.
Rationale: The use of standards lowers the barrier of entry for users to find value and utility in data sets produced by the DKIST. Defined standards also improve longevity for long-term datasets by instituting rules for use and verification capable of being implemented and enforced. The use of commonly-used and supported file formats optimally supports data usage by end-users. Most users of solar data have existing scripts and programs that accommodate commonly-used file formats such as FITS and its popular variants such as compressed FITS, which are accompanied by (some) standardization of definition and usage. Additionally, some DKIST instruments (notably the DL-NIRSP) may benefit from data representation in other formats such as HDF. The Data Center should utilize such standards, and ensure that the distributed data is accompanied by descriptive information for the format. Further, the selection of a few commonly-used file formats can provide a flexible representation of data within files, appropriate for the data and for the needs of users.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 8.5.3, 8.5.1
R1.145 Level 0 Search Interface Requirements for DKIST Staff
The Data Center shall provide a low level interface to enable authorized DKIST staff to locate raw data sets for download.
Rationale: Authorized DKIST staff, such as instrument scientists, will require a means to locate and download raw data sets, for trouble shooting, performance monitoring or assisting with calibration errors.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 8.3.1
Data Center Usage
Non-Repudiative Data Source
The Data Center shall distribute checksum information with distributed data sets.
Rationale: While neither the DKIST nor the Data Center can logically enforce acceptable data usage, it can provide information to a user that allows the user to determine whether the data obtained are the same as that residing in the Data Center data stores. Such information can take different forms, such as file checksums which provide an integrity check for downloaded material. Such an approach is termed “non-repudiation”, i.e., that no user of the data can deny its authenticity since a verifiable test (i.e., the checksum) is distributed along with the data itself. This facilitates trusted consumption of data, by individuals using it directly or by institutions choosing to mirror some of this data, and provides an entry for the implemented system to verify the fidelity of any data set.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 5
User Community
The Data Center shall be capable of supporting 5000 unique account-holding users requesting access to Data Center holdings.
Rationale: Conservative estimates for the number of potential individual users of the Data Center are based on approximate membership of national and international professional societies.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 7.1
Concurrent Usage
The Data Center shall be capable of supporting the concurrent usage of Data Center resources by at minimum 100 unique users with minimum bandwidth of 100 Megabits per second per user.
Rationale: The Data Center must serve the needs of multiple individuals simultaneously accessing its resources. “Resources” here are defined as data holdings, web pages, search interfaces, and available data analysis resources.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 7.1
Access Variability [Deleted]
Sandbox
The Data Center shall make available a managed computing resource for the purpose of addressing code or algorithm improvements, bugs, and failures, determining the feasibility or utility of feature requests or modifications.
Rationale: In order to increase the efficacy and improve the performance of DKIST data calibrations, it will be necessary for staff to utilize a centralized, commonly-available computing resource to contribute work to the Data Center effort.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 10.3.11
Software, SYSTEMS, Support
Open Source Software
The Data Center shall make publicly available for download the documentation, algorithms, and code used in calibration.
Rationale: To supplement an open data policy, open source provides not only transparency of process for the Data Center, but affords external users the ability to contribute to improvement and extension of Data Center methods.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 2.1
Open Documentation
The Data Center shall make available for download selected documentation for public consumption, including but not limited to those describing Data Center design, data model, and data curation mechanisms.
Rationale: To supplement an open data policy, open source provides not only transparency of process for the Data Center, but affords external users the ability to contribute to improvement and extension of Data Center methods.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 2.1
Data Access Tools
The Data Center shall make publicly available for download simple, documented tools, at minimum for both IDL and/or Python, that can be used for accessing and viewing data and metadata distributed by the Data Center.
Rationale: To speed adoption and usage of DKIST data, simple tools should be provided that provide users access to data and metadata and support their ability to perform analysis on data.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 6.5.1
Data Access Through Published APIs
The Data Center shall make publicly available for viewing and download APIs and implementations thereof, in at minimum IDL and Python, for accessing the DKIST data stores.
Rationale: The majority of users of data utilize scripts to acquire data. In solar physics, the two predominant languages used are IDL and Python. Additional languages can be used by implementing the published API.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 7.1.2
Software and Document Version Control
The Data Center shall be capable of distributing version control information related to software and documentation that it maintains in publicly released repositories.
Rationale: Version controlled software provides users with stable feature sets with which to maintain optimized work environments.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 5.5.2
Security
The Data Center shall implement and enforce cybersecurity hardware and software controls and systems to ensure the integrity, confidentiality, and availability of its internal systems and stored data.
Rationale: Ensuring that deployed systems are guided by effective cybersecurity policies and practices helps prevent and mitigate damage from malevolent attack. The three “pillars of cybersecurity” are integrity, that one maintains the accuracy and consistency of systems and data over their lifecycle, confidentiality, preventing unauthorized disclosure of information, and availability, ensuring that systems and security controls are functioning well.
Priority: Required
Verification: Inspection
Requirement Origin: DC-OCD 10.3.4
Help
The Data Center shall provide information and services to users to aid usage of its services and functionality.
Rationale: Some aspects of a helpdesk would provide information to users navigating available web services, reading through provided documentation, or parsing downloaded files. Other aspects would include the ability to report bugs in software, request features in software, describe errors or inconsistencies in downloaded data and metadata, and to provide generalized commentary on the quality of service that the Data Center is providing.
Priority: Required
Verification: Demonstration
Requirement Origin: DC-OCD 10.3.8
1
I