03 - Science Data Processing SA Process

Related PageLink
Composite Application03 - Data Processing Management


Scope

The scope of the Science Data Processing system includes the preparatory steps to organize input data and associate it with the necessary processing algorithms (Recipes) through the processing and storage of output data. This preparatory process begins with the identification of Processing Candidates that are ready, followed by the assembly of Input Datasets and the selection of a Recipe to apply. Processing is then done in an automated fashion for the nominal case, or a ticket is generated to have someone execute the processing manually. Finally, the output data is ingested.


Goals

Goals
Notify PI's of processed data becoming available for their proposal(s)

Provide an infrastructure for the execution of implemented algorithms capable of transforming raw DKIST data into processed Level 1 data

Facilitate automated processing

Facilitate manual processing by default

Facilitate manual processing by exception
Automatically schedule processing jobs when they are "ready"

Create versions for processed datasets based upon the usage of Input Frames, Parameter Values, task order, code versions, and a record of manual influence

Package processed data appropriately for distribution, including relevant Parameter Values and algorithm document links
Capture the derived Fried parameter for the quality framework of a processed frame
Produce non-repudiation information (checksum) to include with processed frames
Ingest Processed Data
Generate browsable media for processed datasets
Generate a quality report for processed datasets
Map resource availability to the processing that needs to be done based upon priority
Provide visibility into the status of work associated with confirmed complete data from the summit
Enable the discovery of data frames based upon processing provenance data for internal (data center) use

Provide an infrastructure for the reprocessing of implemented algorithms capable of transforming raw DKIST data into processed Level 1 data

Key Concept: When to Start Processing

Processing can begin when all of the frames associated with the OPs in a Processing Candidate have arrived. This requires 3 pieces of information:

  • Processing Candidates: Contains all Observing Program As Run IDs that are needed to calibrate the observe OP.
  • Transfer Manifest: Contains the number of frames to expect by Observing Program As Run ID.
  • Science Frame Count: Incremented by 1 for each Science Frame ingested and grouped by Observing Program Run ID.

These data are stored in the Metadata Store for a Cron-based aggregation.

A back stop of 13 days also exists in case there are issues with resolving the planned vs received counts.


Key Concept: Recipe Run

A Recipe Run is used to track the execution of processing/calibration work. It contains references to Recipe information enabling the determination of what algorithm to run, Input Dataset information for determining algorithm inputs, and finally, to the Processing Candidate it is working toward fully calibrating. While the Recipe Run has an explicit status, the Recipe Instance and Processing Candidates also have an implicit status based upon the statuses of Recipe Runs associated with it. 


Key Concept: Algorithm Documentation Access

The headers of processed data contain links to the documentation and code for the version of the workflow that was used to produce it.



Key Concept: Algorithm Documentation Factoring

The algorithms are implemented as workflows which may share tasks between them. e.g., dark correction. The documentation would support guiding users through these relations without duplication of the underlying algorithm docs.



Key Concept: Recipe Run Configuration

Recipe Runs have dynamic configuration dictionaries that are used by a processing pipeline to do things such as set which bucket the results of a pipeline will be saved into. The run configuration is comprised of the default configuration for a recipe merged with any run specific configuration specified at recipe run creation time, with new values overriding defaults where they exist. The actual configuration is stored with the run record to provide tracing to the "as run" configurations, which are independent from modifications that may be made to the defaults over time.

Manage Recipes

The Manage Recipes process has the responsibility of data management for the defining elements of a Recipe. This includes the creation and update of Recipes and their associated workflow specifications, instruments, parameters and valid parameter values.



Processing Scheduling

The Processing Scheduling system is responsible for the planning and management of calibration work. This includes:

  • Identification of Processing Candidates that are ready for processing (automated)
  • Assembly of Input Datasets and associated algorithms (currently manually and software supported, with a plan to introduce automation once real world experience has been gained)
  • Scheduling of resources for automated processing jobs (automated)
  • Generation of service desk tickets for manual processing jobs (automated)
  • Management of jobs currently being processed or scheduled for resource allocation (manually and software supported)



Input Data Assembly

The Input Data Assembly process is responsible for creating a record of all Frames and Parameter Values that are necessary as inputs to an algorithm planned for execution.




Automated Processing Execution

The Automated Processing system is responsible for the execution of Recipe Runs that were scheduled for automated processing. These Recipe Runs all follow a similar pattern: 

  • Executes business management functions such as data retrieval and updating Recipe Run status
  • Performs calibration steps
  • Executes business management functions such as processed data ingest and updating Recipe Run status



Manual Processing Execution

The Manual Processing Execution system is responsible for handling two separate types of processing requests. Manual by Definition requests are those which are planned to be processed manually due to some not yet automated component. Manual by Exception jobs are those that were attempted in an automated fashion but failed. The resolution of this second class can result in a manual process, simple retry or an adjustment to the planning information and a retry.



Catalog Processed Data

The Ingest Processed Data system is responsible for the ingest of processed data as well as the generation and ingest of metadata to support the discovery and data management of processed data. These metadata include:

  • Dataset Inventory
  • Frame Inventory
  • ASDF records
  • Quality Report
  • Object Inventory
  • Browse Movies



Reprocessing

The Reprocessing system is primarily an assembly of other Science Data Processing options. In cases such as improved algorithms, reprocessing will be necessary. When such a circumstance arises, there are activities required to discover those datasets that require reprocessing and harvesting their processing planning information. This planning information is copied and updated to use the new plan (algorithm, inputs or both) and re-executed. At the conclusion of the execution, it is the responsibility of the DQAC to determine if both datasets must continue to be curated or, if not, which one should be discarded.