OPS-DC-SPEC-001

 

 

Data Center

Science Requirements

 

Steven Berukoff, Kevin Reardon, Robert Tawa

Data Center

 

16 June 2020

 

 

Name

Date

Approved By :

Robert Tawa

Data Center PM

 

Approved By:

Kevin Reardon

Tom Schad

Alexandra Tritschler

Friedrich Woeger

DKIST Science Team

 

Approved By:

Scott Wiant

Data Center Senior Computing Analyst

 

Released By:

Thomas Rimmele

DKIST Director

 

 

REVISION SUMMARY:

 

Date: 12 Sep 2016
Revision: A
Changes: Initial formal document.

Date: 16 June 2020

Revision: B

Changes:

Deleted unused acronym, ATBD.

Deleted R1.02 Ancillary data is no longer sent by the DHS to the DC

Edited R1.03 to remove 8 hr. notice.

Deleted R1.04 Data Policy Enforcement. Requirement is unnecessary and there is no formal DKIST Data Policy.

Edited R1.05, R1.06, R1.07 to remove reference to non-existent DKIST data policy

Added R1.148 to add special access to proprietary data.

Deleted R1.11 Design of Data transfer through Globus precludes necessity of and ability to implement this requirement

Edited R1.15 requirement to remove reference to external data.

Deleted R1.17 as there is no longer any reason to involve the DQAC in DC operations nor any minimum time to keep data on disk

Deleted R1.18 redundant to requirement R1.17

Edited R1.20 to clarify deletion process

Edited R1.21 to remove document reference and “<<TBD>>”

Deleted R1.23 Ancillary data is no longer sent by the DHS to the DC

Edited R1.24 to remove “permanent and irrevocable”

Edited R1.25 to better reflect the rationale of the requirement

Added R1.146 to add capability lost with the editing of R1.25

Edited R1.26 for clarity

Added R1.147 to add capability related to R1.146

Deleted R1.29 as this requirement is impossible to implement and because loss of data is so improbable it is unnecessary

Edited R1.32 for clarity

Deleted R1.34 as it is a descendent of R1.32 – Added to System Requirements.

Edited R1.35 to change name of requirement to “Storage of Calibration Parameters”. Aligned requirement and rationale to this goal.

Edited R1.36 Rationale to remove reference to languages and what is planned

Edited R1.38 Search Parameters – modified the list of search parameters and added explanatory text.

  • Calibration metadata has been deleted from the list of search parameters. The rationale for R1.38 is to generate a list of common search criteria. Searches based on calibration metadata are not common. The calibration history will be recorded in the FITS headers of each file and that information is available in the ASDF file associated with each dataset. Using the user tools, the ASDF files will enable users to search on *any* metadata in the files including calibration metadata.

  • Investigator ID was deleted from the search terms. Investigator ID is generally opaque to the user it belongs to and is used like a primary key in a DB and is not something a general user can really look up. 

  • Modified “Target Type – Observed in DKIST Data” to be a goal rather than an explicitly provided search term. Target type could provide a very useful search tool for users to find observation of particular features or events of interest. However, it will not be possible to provide this option at the Data Center launch.

Edited R1.40 and R1.41 to clarify limits on the requirements

Deleted R1.43 Search Metadata from Non-DKIST Data Providers – this requirement was deleted as it is not within the first light scope.

Edited R1.45 to remove requirement of resolution.

Edited R1.46 to remove requirement to distribute intermediate calibration data

Edited R1.47 to clarify and remove “best effort basis”

Edited R1.50 to clarify non-repudiation in context of science data

Added R1.145 Level 0 Search Interface Requirements for DKIST Staff – Instrument Scientists and other DKIST staff will need access to level 0 data with a rudimentary interface for their work.

Deleted R1.53 Access Variability. Requirement is unnecessary and cannot be physically or financially met.

Edited R1.54 to disallow external (to DC) use of the DC computing infrastructure. Sandbox for DKIST scientists to be provided by IT.

Edited R1.55 Open Source Software for clarification.

Edited R1.57 to make IDL a choice rather than a requirement – as IDL does not have adequate tools that will allow the requirement to be met.

Edited R1.61 to remove the provision of services as those are an operations function.

Multiple – Changed/removed the word “embargo” and replaced with some variation of “proprietary data”



Table of Contents

1. Overview 1

1.1 Purpose 1

1.2 Scope 1

1.3 Verification 1

1.4 Prioritization 2

1.5 Definitions 2

1.6 Acronyms 3

1.7 Applicable Documents 3

1.8 Reference Documents 3

2. Rate and Volume FROM DKIST 4

3. Data and Metadata Availability 5

4. Data Curation, Retention, and ASSOCIATION 8

5. PROCESSING & Calibrations 12

6. Search & Distribution 14

7. Data Center Usage 20

8. Software, SYSTEMS, Support 22

Overview

Purpose

This document documents requirements for the Data Center that prescribe how it can support the scientific objectives of the DKIST. This, in concert with the Data Center Operational Concepts Definition (DC-OCD), is the highest-level document from which design requirements flow. Most inform and guide the definition of Data Center systems (design) requirements that are captured elsewhere and will be reviewed separately.

Scope

The requirements herein relate to the performance and function of the Data Center as driven by scientific necessity. They do not include functional aspects stemming from the conceptual or design development of the Data Center.

Verification

Included in each major numbered specification listed in this document is a requirement verification method. These verification methods specify the minimum standards of verification required to ensure that the individual requirements and specifications are met. Verification methods used below are those identified in the Systems Engineering Book of Knowledge (SEBOK) (RD[01]).

 

  • Inspection. Technique based on visual or dimensional examination of an element; the verification relies on the human senses or uses simple methods of measurement and handling. Inspection is generally non-destructive, and typically includes the use of sight, hearing, smell, touch, and taste, simple physical manipulation, mechanical and electrical gauging, and measurement. No stimuli (tests) are necessary. The technique is used to check properties or characteristics best determined by observation (e.g. - paint color, weight, documentation, listing of code, etc.).

  • Analysis. Technique based on analytical evidence obtained without any intervention on the submitted element using mathematical or probabilistic calculation, logical reasoning (including the theory of predicates), modeling and/or simulation under defined conditions to show theoretical compliance. Mainly used where testing to realistic conditions cannot be achieved or is not cost-effective.

  • Demonstration. Technique used to demonstrate correct operation of the submitted element against operational and observable characteristics without using physical measurements (no or minimal instrumentation or test equipment). Demonstration is sometimes called 'field testing'. It generally consists of a set of tests selected by the supplier to show that the element response to stimuli is suitable or to show that operators can perform their assigned tasks when using the element. Observations are made and compared with predetermined/expected responses. Demonstration may be appropriate when requirements or specification are given in statistical terms (e.g. meant time to repair, average power consumption, etc.).

  • Test. Technique performed onto the submitted element by which functional, measurable characteristics, operability, supportability, or performance capability is quantitatively verified when subjected to controlled conditions that are real or simulated. Testing often uses special test equipment or instrumentation to obtain accurate quantitative data to be analyzed.

 

Verification by demonstration and test should be achieved wherever possible. In cases where inspection or analysis is the only verification, every effort should be explored to verify these through proof test cases.

Prioritization

Included for each requirement of an element, function, or performance characteristic, is an assessment of its priority of development into one of two phases. The following identifiers and meanings are used:

 

  • Required – Shall be achieved to meet the minimum objectives.

  • Goal – Effort should be undertaken to meet specification.

Definitions

This document contains references terms requiring definition.

  • Early Operations – The time period beginning at the final acceptance of the DKIST into operations, which occurs at the conclusion of integration, test, commissioning and science verification activities, and extending four (4) years from the beginning of DKIST Operations. This definition is being used as a time interval affecting the Data Center, and may or may not be adopted by other DKIST operations elements.

  • Regular Operations – The period of DKIST Operations after Early Operations, extending for the DKIST operational lifetime.

  • Scientific Data –The set of qualitative and quantitative values that represent the outcome of a scientific measurement or collection process. For the purpose of this document, refers explicitly to the recorded numerical values resulting from collection, processing, and/or analysis of solar photons by DKIST facility instruments and supporting systems.

  • Scientific Metadata – The set of qualitative and/or quantitative values and/or text that describes other scientific data. For the purpose of this document, refers to such values or text that describes the direct context and environment of the collection process that is designed to collect solar photons.

  • Ancillary Data and Metadata – Data or metadata that are not directly related to the measurement process but that support scientific, operational or computational outcomes.

  • Data Set – An aggregation of data and metadata formatted and organized as intended for user consumption.

  • Data Store – The Data Center systems and supporting software that provide curatorial management capability of DKIST data and metadata.

  • Science Data Frame – The aggregation of scientific data and metadata resulting from a single output array recorded by a DKIST photo-sensitive detector, or from operations on multiple exposures by the detectors or other systems that generate an analogous output array, all stored in a structured, machine-readable format. This includes those exposures obtained for the explicit purpose of calibrating instrument performance (e.g., dark frames). For example, the DKIST Data Handling System produces science data frames in FITS file format in accordance with the specification of SPEC-0122, from binary data (termed fully processed accumulators), and alphanumeric metadata stored internally within the DHS.

  • Calibration Parameter – A numeric value used in performing one or more calibration calculations. Calibration parameters may be generated during the data acquisition process or created via data processing, and may change over time. Multiple associated parameters may be combined to create sets, e.g., for creating curves or matrices. Examples abound: gain, rms noise, linearity, Mueller matrix.

  • Mebibyte, Gibibyte, Tebibyte, PebibyteThese terms refer to powers of 1024 used in storage of binary (2-bit) data. Specifically,

    • 1 Mebibyte (MiB) = 10242 = 220 (1048576) bytes

    • 1 Gibibyte (GiB) = 10243 = 230 (1073741824) bytes

    • 1 Tebibyte (TiB) = 10244 = 240 (1099411627776) bytes

    • 1 Pebibyte (PiB) = 10245 = 250 (1125899906842624) bytes

In particular, note that these all exceed the more common notation of e.g., “megabytes” by some non-negligible percentage. For instance, one MiB is 4.85% larger than one MB and one PiB is nearly 12.6% larger than one PB.

Acronyms

Acronym

Meaning

DKIST

Daniel Ken Inouye Solar Telescope

NISP

National Integrated Synoptic Program

NSO

National Solar Observatory

DHS

Data Handling System

 

Applicable Documents

Ref

Title

AD[01]

OPS-DC-SPEC-002, “Data Center Operational Concepts Document”

 

Reference Documents

Ref

Title

RD[01]

SEBOK Wiki, http://sebokwiki.org

RD[02]

OPS-DC-TN-001, “Data Center Data Volume Estimate”

Rate and Volume FROM DKIST

Annual Volume and Image Count

The Data Center shall support the receipt of 2.8 PiB of DKIST scientific and calibration data and metadata, via 1.9 x 108 science data frames and 5.0 x 1010 non-unique metadata elements, per year delivered from the DKIST Data Handling System.

Rationale: The rationale for estimates of data and metadata rates are documented in detail in Technical Note OPS-DC-TN-001, “Data Center Incoming Data Rate” (RD[02]). Particularly, the Technical Note derives the science data frame rate, owing to a model combining facility usage and uptime, seeing, and instrument usage. Subsequently, the model scales the science data frame estimate to the data volume associated with them, thus accumulates to 2.8 Pebibytes per annum (about 8 Tebibytes per day) and 5.0 x 1010 non-unique metadata elements received within the header of the science data frames.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 4.1.4

 

Ancillary Data and Metadata [Deleted]

 

Peak Daily Data Volume

The Data Center shall support receipt of data volumes of 60 TiB that may be generated by the DKIST in a single day.

Rationale: The rationale for estimates of data and metadata rates are documented in detail in Technical Note OPS-DC-TN-001, “Data Center Incoming Data Rate” (RD[02]).

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 4.1.4

 

 

 

 

 

 

Data and Metadata Availability

Data Policy Enforcement [Deleted]

 

Open Data Access

The Data Center shall make available for search and download all non-proprietary scientific data and related descriptive scientific metadata to requesting users.

Rationale: The data resulting from experiments performed on the DKIST should not be held in private, unavailable repositories, but rather made available broadly to enhance its utility.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 2.1

R1.148 Proprietary Data Access

The Data Center shall make available for search and download proprietary scientific data and related descriptive scientific metadata to authorized requesting users.

Rationale: The data resulting from experiments performed on the DKIST should be held in private, unavailable [to the public] repositories, when that data has been flagged as proprietary.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 2.1

 

 

Search Transparency

The Data Center shall provide search transparency of all scientific data and related descriptive metadata to requesting users, notating and enforcing restrictions for proprietary data.

Rationale: The Data Center should show users information about its existing holdings in accord with the preceding requirement "Open Data Access", which states that all data can be made available to users except for proprietary data, which will have some access restrictions. Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 2.1

 

Search Access

The Data Center shall be capable of enabling users with user- and group-specific access to data through its data and metadata search services.

Rationale: For the user community, specialized access to data and metadata search services can provide targeted utility of Data Center functions that enhances the usability of DKIST data. Specialized access – such as the availability of user-specific historical information – is achieved through associated users with their work, requiring the Data Center to build an authentication function to ensure user validity.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 7.1.2

 

Data and Metadata Availability Time

The Data Center shall make available for search and download 90%, annual average, (goal 100%) of data and metadata received from the DKIST Data Handling System, whose elements are defined in SPEC-0122, within ten days of its receipt in Boulder.

Rationale: Data and metadata from experiments should be available quickly after the experimental data are acquired. These are received from the summit DHS whose elements are defined by SPEC-0122, ‘DKIST Data Model’.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD-7.1.3

 

Data Download Status Notification

The Data Center shall provide a status notification to a user requesting download of data within fifteen seconds of request.

Rationale: Users need an expectation of when requested data will be available for download from the Data Center. A status notification provides user either a notice as to when data would be available for download, or a notice that data is immediately available for download.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 7.1.2

 

Direct Data Download Availability Time

The Data Center shall provide direct data download availability within five business days after request.

Rationale: Users need an expectation of when requested data will be available for download from the Data Center. The Data Center needs to have a maximum timescale during which to provide the data to the user.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 7.1.2

 

Optimization of Data Download Availability [Deleted]

 

User Initiation of Data Transfer

The Data Center shall support distribution of datasets and metadata through user-initiated network transfers using common and existing transport mechanism and protocols.

Rationale: Users will want to pull data from the Data Center resources, and as well may wish to “subscribe” to information or data that is generated by and distributed by the Data Center.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD-8.5.4

 

Data Center Initiation of Data Transfer

The Data Center shall support distribution of datasets and metadata through Data Center-initiated network transfers using common and existing transport mechanism and protocols.

Rationale: Users will want to pull data from the Data Center resources, and as well may wish to “subscribe” to information or data that is generated by and distributed by the Data Center.

Priority: Goal

Verification: Demonstration

Requirement Origin: DC-OCD-8.5.4

 

Provide Documentation of Distributed Data

The Data Center shall provide full documentation regarding the format, organization, and definition of distributed data sets.

Rationale: Users must have a clear expectation on the structure and organization of distributed data resources. This may include description of metadata header elements, information about how the data is organized within the file(s), any formatting features of note, and how or if data is compressed.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD-8.5.3

 

Data Curation, Retention, and ASSOCIATION

Data-Metadata Association

The Data Center shall maintain association of the content of data and metadata received from the DKIST Data Handling System with data and metadata

  • Resulting from Science Operations Planning and Monitoring;

  • Produced by Data Center processes including calibration;

with its descriptive data and related ancillary information, including, scientific source, generational documentation and software provenance, for the duration of the lifetime of the data.

Rationale: Maintaining provenance is key in ensuring the production of stable products for community use, as well as the long-term utility of data. Such association is typically supported by one or more data models.

Priority: Required

Verification: Inspection/Demonstration

Requirement Origin: DC-OCD 5.3.1, 5.3.2

 

Resilience to Disaster

The Data Center shall ensure that localized catastrophic disasters do not result in permanent loss of data.

Rationale: In order to ensure long-term safety of DKIST data requiring retention, the Data Center must ensure that it does not store its data in a single location, as should that location suffer a catastrophic event permanent data loss would result.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 10.3.7

 

Retention Enforcement [Deleted]

 

Data Value [Deleted]

 

Minimum Retention Time

The Data Center shall retain content of science data frames, without downsampling or otherwise reducing the natively-acquired image resolution, received from the DKIST Data Handling System for at minimum six months from time of acquisition.

Rationale: This ensures that the content – e.g., the image data and associated descriptive information – of science frames will be retained for a period of time allowing analysis. This content may not be stored internally in the form received from the summit, i.e., they may be reversibly disaggregated into smaller or different-size pieces. This also ensures a minimum lifetime for data in routine Operations.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 5.4.1

 

Data and Metadata from Summit Operations

The Data Center shall receive and retain all data and metadata adherent to the “DKIST Data Model”, documented in SPEC-0122, as received from the Summit facility for the lifetime of the project, unless directed for deletion by the DKIST Data Quality Assessment Committee.

Rationale: It is necessary to capture the record of each experiment and relevant facility operations performed on the DKIST summit, regardless of whether the science images are retained.

Priority: Required

Verification: Inspection/Demonstration

Requirement Origin: DC-OCD 5.4.1

 

Metadata from Operations Planning

The Data Center shall receive and retain for the lifetime of the project all metadata required to establish and maintain provenance of captured experiments to their proposal, planning, and execution.

Rationale: To enable the long-term association of the outputs of the measurement process to associated plans and proposals, it is necessary for the Data Center to capture and maintain metadata from Operations Planning.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 5.4.1

 

Metadata from Data Center Operations

The Data Center shall retain metadata associated with the computational processing and transformation of raw data into output data sets and/or products, for the lifetime of those outputs.

Rationale: Should the Data Center publish data sets for community consumption, it must also retain a record of by what means, broadly but incompletely described as algorithms and their versions, software tools and their versions, hardware and its type and version, in order to provide the capability for explanation, inspection, and reproduction. Nominally, since the Data Center will distribute raw and calibrated data (L1 data) to the user community, it should also be capable of receiving feedback on those outputs, and associating some or all of that feedback (e.g., in the form of annotations or revisions of control parameters) to data and metadata stored in the Data Center.

Priority: Required

Verification: Inspection/Demonstration

Requirement Origin: DC-OCD 5.4.1

 

Retention of Ancillary Data [Deleted]

 

Deletion

The Data Center shall be capable deletion of scientific data and any associated scientific, engineering, analysis, and processing metadata in its data stores .

Rationale: To enforce limits on storage identified herein, the Data Center will delete data and metadata.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 5.4.3

 

Lossless Compression

The Data Center shall be capable of reducing - by using lossless compression algorithms - the size of science data holdings.

Rationale: Multiple techniques exist to decrease the stored size of data with acceptable loss in photometric or spatial resolution, and these will be employed selectively on DKIST data. Lossy compression techniques for integer, and floating-point data are routinely used in astronomy to provide storage cost savings, at a cost of substantial storage savings but acceptable loss in scientific data quality. The selection and application of image compression algorithms, parameters, and loss ratios will be determined based on analysis of effects on sample data and evaluation of impact on various scientific utilizations. Downsampling can be performed on raw and calibrated data (L1), and the methods applied to which data will largely be the responsibility of the DKIST Science Team, acting through its Data Quality Assessment Committee.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 5.4.3

 

R1.146 Lossy Compression

The Data Center shall be capable of reducing - by using lossy compression algorithms - the size of science data holdings.

Rationale: Multiple techniques exist to decrease the stored size of data with acceptable loss in photometric or spatial resolution, and these will be employed selectively on DKIST data. Lossy compression techniques for integer, and floating-point data are routinely used in astronomy to provide storage cost savings, at a cost of substantial storage savings but acceptable loss in scientific data quality. The selection and application of image compression algorithms, parameters, and loss ratios will be determined based on analysis of effects on sample data and evaluation of impact on various scientific utilizations. Downsampling can be performed on raw and calibrated data (L1), and the methods applied to which data will largely be the responsibility of the DKIST DC and Science Team.

Priority: Required

Verification: Analysis

Requirement Origin: DC-OCD 5.4.3

 

Retention Reporting

The Data Center shall be capable of reporting the availability of its retained data and metadata holdings.

Rationale: Not only should the Data Center perform retention of data and metadata, it should be capable of providing some reports of its holdings upon request.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 5.5.8

 

R1.147 Degradation Reporting

The Data Center shall be capable of reporting the status of past and pending deletion and lossy compression activities.

Rationale: Not only should the Data Center perform compression of data and metadata, it should be capable of providing some reports of its holdings upon request.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 5.5.8

 

 

Data Loss Limits

The Data Center shall ensure that the non-directed, irretrievable loss of its holdings of scientific data does not exceed 0.1% annual average (goal 0.01%).

Rationale: The undirected loss of data may be caused by system defect or failure, a defect in implemented procedure, or natural causes (e.g., cosmic rays). The loss of too much data can affect the scientific integrity of specific images, data sets, and collections thereof; yet, a requirement allowing too little loss is overly burdensome technologically and financially.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 5.5.8

 

Reproduction of Data Center Output

The Data Center shall be capable of redistributing datasets previously created and distributed, unless relevant inputs, algorithms, code, and/or outputs have been deleted or deprecated through data reduction, downsampling or data deletion activities. In such cases, the Data Center shall reproduce data sets with reduced fidelity, reporting the difference from the original, or shall report that insufficient input information is available.

Rationale: Rather than statically store every data set produced, it is sensible to ensure that raw data, required inputs, relevant computing specifications (e.g., algorithm information), software versions, and hardware requirements remain available to recreate the product(s).

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 5.5.3

 

Loss Reporting [Deleted]

PROCESSING & Calibrations

Algorithm Extensibility

The Data Center shall provide data analysis and processing algorithms in a modular and extensible manner in order to accommodate changes and/or improvements to techniques applicable to DKIST Data.

Rationale: Changes in processing needs, whether driven by algorithmic, scientific, or technological changes, must be accommodated without undue impact on Data Center Operations.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 2.1, DC-OCD 6.3.1

 

Quality Framework

The Data Center shall define and maintain a standardized science data quality information framework including algorithmically-derived quality metrics for calibrated data (L1) and metadata.

Rationale: Utility of DKIST data is enhanced when users of data are provided quality metrics for the data they are searching for or using. For external users, data sets accompanied by a reporting of the “success” of a calibration aids in ensuring fitness for use. For internal users, such reporting aids in identifying opportunities to improve and extend existing calibration techniques.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 6.3.3

 

Calibrations

The Data Center shall calibrate science data sets gathered by the DKIST facility instruments, utilizing controlled implementations of documented algorithms.

Rationale: The primary output of the Data Center is calibrated (L1) DKIST data. Calibrations characterize, and where possible, remove known errors, artifacts, and inhomogeneities convolved with the solar signal by the optomechanical and thermal systems of the DKIST. Calibrations may also include characterization or removal of known artifacts introduced by the terrestrial atmosphere on the solar signal (see following “Goal” requirement). The minimum calculations to perform will be detailed in the Calibrations Subsystem requirements documentation. The detailed specification of which techniques and algorithms will be applied to the DKIST data – “documented algorithms” in the requirement. These documented algorithms contain the rationale and justification for the intended calculations, detailed logical algorithm flow, mathematical operations that need to be performed, and references where available and/or applicable.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 2.1, DC-OCD 6

 

Image Reconstruction

The Data Center shall be capable of improving the quality of science data frames impacted by the terrestrial atmosphere through the controlled implementation of documented algorithms for image reconstruction.

Rationale: Many data sets will benefit from quality improvements enabled by the application of image reconstruction techniques to data. Current algorithms are computationally expensive and therefore likely prohibitive within the effort to deliver Data Center functions to meet DKIST first light. However, image reconstruction is sufficiently important that the Data Center should strive to include their development in its workplan should resources permit.

Priority: Goal

Verification: Demonstration

Requirement Origin: DC-OCD 6.1

 

Calculate Calibration Parameters [Deleted]

 

Storage of Calibration Parameters

The Data Center shall be capable of storing calibration parameters and associated values for the lifetime of the facility.

Rationale: The parameters used in performing calibration processing should be maintained as an archival resource akin to other metadata. Maintaining a store of some calibration parameters imposes a negligible impact on available storage while enabling the derivation of long-term performance measures relevant to calibrations, such as parameter drift.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 7.1.1, DC-OCD 5.3.1

 

 

Search & Distribution

Web and Script Search Access

The Data Center shall provide the capability for users to search and download data and metadata using network-enabled interfaces accessible from web browsers and select scripting languages.

Rationale: Users want to access data holdings through a web browser and through scripts in common languages. This should be supported by the creation of an API that can be implemented in numerous forms.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 7.1.2

 

Search Interface Help Documentation

The Data Center shall make available for viewing and download help documentation that provides explanation of the search interface functionality, including descriptions of parameters and valid ranges where appropriate.

Rationale: Informative documentation aiding the use of search interfaces will enable users to utilize the interfaces more effectively while decreasing the impact on DKIST staff in responding to help requests.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 10.3.8

 

Search Parameters

The Data Center shall provide the capability to search, singly and in combination using the following categories: Temporal, Spatial, Spectral, Polarimetric, Instrument, requested Target Type, and Unique Identifiers. The Data Center has a goal to also enable searching by observed Target Type.

 

 

Temporal

  • Timestamp of data acquisition, and intervals thereof

  • Temporal sampling

Spatial

  • Polygonal bounding box

  • Spatial coordinates specified by coordinate entries consistent with the helioprojective Cartesian coordinate system

  • Spatial sampling

Spectral

  • Spectral band(s) contained in data

  • Spectral line(s) contained in data

  • Spectral sampling

  • Numeric search + controlled tag names (e.g., “Ca II”)

Polarimetric

  • Stokes I, Q, U, V

  • Polarimetric accuracy, if available

Instruments

  • Instrument(s) used

  • Instrument modes: Spectropolarimetric, spectroscopic, filtergraph

Unique identifiers

  • Proposal ID

  • Experiment ID

Target Type

Type requested in proposal (per configured vocabulary defined in the DKIST ICD 4.2/7.0 OCS to DSSC and the SPEC-0122 DKIST Data Model Specification)

 

Search Goals

 

Target Type

Observed in DKIST data

 

Rationale: This represents a common set of search criteria for many astronomical data search interfaces, and particularly for those providing access to spectropolarimetric data obtained through facility operation.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 7.1.1

 

Further definitions of Search Parameters:

 

Temporal sampling: We define temporal sampling as the particular frequency at which common observations have the same Data Set Parameters at the observed cadence within level 1 datasets The meaning of cadence here can change slightly depending on the type of observation. E.g. in “field sampling” mode, mosaics would be created for the full scanned FOV, and in this case cadence would refer to the period between whole mosaics rather than the constituent frames with the same Data Set Parameters (but different pointing). This is in line with SOLARNET definitions and what solar physicists would expect. We expect temporal sampling to be consistent within a dataset.

Spectral sampling: Spectral sampling will be based on the wavelength plate scale for the dataset. If non-uniform wavelength sampling is used, then the average wavelength plate scale will be calculated.

 

Stokes I, Q, U V: This will be a Boolean. Either the observation contains polarization observations, or it does not.

 

Instrument modes: Spectropolarimetric, spectroscopic, filtergraph: Observations will be labelled with one of these three modes.

 

Target:

  • Type will be a configured vocabulary. An example vocabulary already in use, can be found at https://www.lmsal.com/hek/VOEvent_Spec.html.

    • Ideally, we will be able to identify the observed features from the data itself, but in lieu of that, the intended target for the requested experiment may make a good starting point, if easily determined.

 

Query Frequency

The Data Center shall support at minimum 100 simultaneous queries against its available data holdings.

Rationale: The rule of thumb used here is 2% of a user base will access the Data Center search facility simultaneously. This user base is estimated conservatively at 5000 individuals.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 7.1

 

Query Performance Time

The Data Center shall return an informational status response to a submitted query of its data holdings within two seconds of submission when 100 or fewer concurrent users are querying the available data.

 

Rationale: A user submitting a query request needs to be informed that their query has been received and is being processed.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 7.1.2, DC-OCD 8.5.4

 

Query Result Set Return Time

The Data Center shall return a query result set, defined as descriptive metadata and associated ancillary information, or a subset of no less than twenty-five items if the result set is large, to a submitted query within seven seconds, when 100 or fewer concurrent users are querying the data holdings.

Rationale: A user submitting a query request should be shown at least a preliminary list of results quickly.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 7.1.2, DC-OCD 8.5.4

 

Visual Presentation of Search Results

The Data Center shall be capable of utilizing simple preliminary result presentation techniques for the purpose of summarizing and/or presenting large and/or complex query result sets.

Rationale: For some data sets, simple visualization of result sets will be more effective in providing user access and subsetting capabilities than, say, tabular formats. Thus, the use of simple data plots, overlays of statistical summaries, browse images and movies, etc. will aid the user in identifying data of interest.

Priority: Required

Verification: Inspection/Demonstration

Requirement Origin: DC-OCD 7.1.2

 

Search Metadata from Non-DKIST Data Providers [deleted]

 

Science Data Set Definition

The Data Center shall distribute science data sets that shall be comprised of, at minimum,

  • Calibrated (L1) Science Data, containing the output of one or more instrument actions driven by one or more scientific experiments and the acquisition of directly-related ancillary data (e.g., calibration parameters), and calibrated using predefined, quality-controlled means;

  • Science data quality information via a quality record, produced during data processing, and documented for digestion by end-users;

  • Revision-controlled structured metadata and ancillary information with full and structured attribute descriptions, including measurement activity and processing provenance, to include at a minimum processing and utility code versions, code execution status information, and reported quality;

  • Relevant calibration information, including parameters used and their validity and the calibration steps performed;

  • Related documentation, e.g., algorithms and code versions used, or relevant links to them.

Rationale: Data sets contain the scientific data and metadata documenting the observations performed and their context, the processing steps performed and their context, and suitable additional information such as data quality. Data quality information provides users with input into fitness-for-use considerations; reproducibility information allows users to recreate, and, hopefully, improve upon Data Center data processing techniques; and related documentation aids users in understanding what was done to the data. Relevant data quality information might include quantitative parameters such as Fried parameter, or process-driven parameters such as completion of intended calibration and the list of calibration steps used. The exact specification of the bulleted items listed above will be described in lower-level requirements.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 8.1.1, 8.2

 

Browse Images & Movies

The Data Center shall provide browse images and movies of calibrated (L1) data, when required input data are available, to allow previewing and browsing of data sets or full science images prior to requesting download full science data sets.

Rationale: Browse imagery available from major astronomical data archives typically utilize JPEG, GIF, and PNG methods for distribution. Browse movies are often distributed in AVI and/or MPEG standards. These standards – which variously encompass digital encoding schemes, compression techniques, and file formatting – are easily viewable by the multimedia visualization applications in common use. Note that it is expected that the input data (e.g., the individual raw or calibrated images) necessary to create browse media may not always be available.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 8.3.2

 

Calibration Data Set

The Data Center shall temporarily store, and utilize calibration data arrays for the express purpose of generating quality metrics and parameters that may be used to assess the quality of the produced data as well as to perform trend analysis on fundamental properties that may affect data quality.

Rationale: Calibration information is often generated from facility input sources or is useful more broadly than for a single experiment. Making calibration information accessible enables their use to generate quality metrics and trend analyses. The calibration data array is primarily the derived calibration parameters, not the instrument outputs.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 8.3.3

 

Raw Data Sets Available on Request

The Data Center shall be capable of assembling and distributing raw data sets – including the calibration information acquired at the DKIST, upon authorized user request.

Rationale: The Data Center will make available raw data to users, which must be published in a usable form. These raw data must be accompanied by essential calibration information acquired at the DKIST in order for the data to be effectively used. Such calibration information would include, for instance, dark frames and flat fields acquired during Observing Programs.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 8.3.1

 

 

Metadata Set

The Data Center shall be capable of distributing collections of selected metadata for end-user consumption in structured machine-readable formats and accompanied by full and structured attribute descriptions for all metadata elements.

Rationale: The Data Center will enable users to search on various metadata fields, and should make available a way for users to download metadata of interest to end-users. An example might be seeing quality over some interval during which an experiment of interest was being executed.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 8.3.4

 

Data Set Standards

The Data Center shall distribute data sets adhering to predefined, commonly-used and supported standards for structured representation, file format, defined content, and archival.

Rationale: The use of standards lowers the barrier of entry for users to find value and utility in data sets produced by the DKIST. Defined standards also improve longevity for long-term datasets by instituting rules for use and verification capable of being implemented and enforced. The use of commonly-used and supported file formats optimally supports data usage by end-users. Most users of solar data have existing scripts and programs that accommodate commonly-used file formats such as FITS and its popular variants such as compressed FITS, which are accompanied by (some) standardization of definition and usage. Additionally, some DKIST instruments (notably the DL-NIRSP) may benefit from data representation in other formats such as HDF. The Data Center should utilize such standards, and ensure that the distributed data is accompanied by descriptive information for the format. Further, the selection of a few commonly-used file formats can provide a flexible representation of data within files, appropriate for the data and for the needs of users.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 8.5.3, 8.5.1

 

R1.145 Level 0 Search Interface Requirements for DKIST Staff

The Data Center shall provide a low level interface to enable authorized DKIST staff to locate raw data sets for download.

Rationale: Authorized DKIST staff, such as instrument scientists, will require a means to locate and download raw data sets, for trouble shooting, performance monitoring or assisting with calibration errors.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 8.3.1

 

 

 

Data Center Usage

Non-Repudiative Data Source

The Data Center shall distribute checksum information with distributed data sets.

Rationale: While neither the DKIST nor the Data Center can logically enforce acceptable data usage, it can provide information to a user that allows the user to determine whether the data obtained are the same as that residing in the Data Center data stores. Such information can take different forms, such as file checksums which provide an integrity check for downloaded material. Such an approach is termed “non-repudiation”, i.e., that no user of the data can deny its authenticity since a verifiable test (i.e., the checksum) is distributed along with the data itself. This facilitates trusted consumption of data, by individuals using it directly or by institutions choosing to mirror some of this data, and provides an entry for the implemented system to verify the fidelity of any data set.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 5

 

User Community

The Data Center shall be capable of supporting 5000 unique account-holding users requesting access to Data Center holdings.

Rationale: Conservative estimates for the number of potential individual users of the Data Center are based on approximate membership of national and international professional societies.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 7.1

 

Concurrent Usage

The Data Center shall be capable of supporting the concurrent usage of Data Center resources by at minimum 100 unique users with minimum bandwidth of 100 Megabits per second per user.

Rationale: The Data Center must serve the needs of multiple individuals simultaneously accessing its resources. “Resources” here are defined as data holdings, web pages, search interfaces, and available data analysis resources.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 7.1

 

Access Variability [Deleted]

 

Sandbox

The Data Center shall make available a managed computing resource for the purpose of addressing code or algorithm improvements, bugs, and failures, determining the feasibility or utility of feature requests or modifications.

Rationale: In order to increase the efficacy and improve the performance of DKIST data calibrations, it will be necessary for staff to utilize a centralized, commonly-available computing resource to contribute work to the Data Center effort.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 10.3.11

 

Software, SYSTEMS, Support

Open Source Software

The Data Center shall make publicly available for download the documentation, algorithms, and code used in calibration.

Rationale: To supplement an open data policy, open source provides not only transparency of process for the Data Center, but affords external users the ability to contribute to improvement and extension of Data Center methods.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 2.1

 

Open Documentation

The Data Center shall make available for download selected documentation for public consumption, including but not limited to those describing Data Center design, data model, and data curation mechanisms.

Rationale: To supplement an open data policy, open source provides not only transparency of process for the Data Center, but affords external users the ability to contribute to improvement and extension of Data Center methods.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 2.1

 

Data Access Tools

The Data Center shall make publicly available for download simple, documented tools, at minimum for both IDL and/or Python, that can be used for accessing and viewing data and metadata distributed by the Data Center.

Rationale: To speed adoption and usage of DKIST data, simple tools should be provided that provide users access to data and metadata and support their ability to perform analysis on data.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 6.5.1

 

Data Access Through Published APIs

The Data Center shall make publicly available for viewing and download APIs and implementations thereof, in at minimum IDL and Python, for accessing the DKIST data stores.

Rationale: The majority of users of data utilize scripts to acquire data. In solar physics, the two predominant languages used are IDL and Python. Additional languages can be used by implementing the published API.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 7.1.2

 

Software and Document Version Control

The Data Center shall be capable of distributing version control information related to software and documentation that it maintains in publicly released repositories.

Rationale: Version controlled software provides users with stable feature sets with which to maintain optimized work environments.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 5.5.2

 

Security

The Data Center shall implement and enforce cybersecurity hardware and software controls and systems to ensure the integrity, confidentiality, and availability of its internal systems and stored data.

Rationale: Ensuring that deployed systems are guided by effective cybersecurity policies and practices helps prevent and mitigate damage from malevolent attack. The three “pillars of cybersecurity” are integrity, that one maintains the accuracy and consistency of systems and data over their lifecycle, confidentiality, preventing unauthorized disclosure of information, and availability, ensuring that systems and security controls are functioning well.

Priority: Required

Verification: Inspection

Requirement Origin: DC-OCD 10.3.4

 

Help

The Data Center shall provide information and services to users to aid usage of its services and functionality.

Rationale: Some aspects of a helpdesk would provide information to users navigating available web services, reading through provided documentation, or parsing downloaded files. Other aspects would include the ability to report bugs in software, request features in software, describe errors or inconsistencies in downloaded data and metadata, and to provide generalized commentary on the quality of service that the Data Center is providing.

Priority: Required

Verification: Demonstration

Requirement Origin: DC-OCD 10.3.8

 

1

I