05 - Distribution SA Process

Related PagesLink
Composite Application05 - Science Data Distribution Management


Scope

The scope of the Distribution process encompasses the preparation for and execution of data downloads. The preparation can be triggered via the Search and Discover process for processed data or the Ops Support process for raw data. Regardless of the triggering process, the download process is the same.


Goals

Goals
Distribute data to authenticated & authorized (i.e., if needed for embargoed data) users based upon search results via a network
Limit concurrent downloads from users to prevent monopolization of bandwidth
Enable secure and check-pointed downloads

Key Concept: Endpoints

Data is moved from one endpoint to another. In the download process, the source endpoint is the data center and the destination is one of the user's choosing. The source and destination endpoints don't need to be the same as the computer the user is using to trigger the download. The Globus transfer service enables this process through the endpoint software. The endpoint and cloud software handles the necessary authorization/authentication to securely execute the process, with the data being transferred between the source and destination.


Key Concept: ASDF

For each dataset produced by the DKIST Data Center, a single associated ASDF file will be produced that allows for the examination of all header metadata in the dataset without downloading the full collection of FITS files that contain the science data arrays. This ASDF file will be delivered in a format compliant with the ASDF standard, and will be leveraged by the DKIST User Tools as well as made available for download by DKIST users. The ASDF file enables the User Tools to facilitate metadata exploration as well as use forms of filtering/slicing to reduce the number of FITS files that are actually downloaded out of a dataset. 


Key Concept: Endpoint ACL Topology

In order to support authorization requirements and the use of Globus to facilitate data transfer, a pattern for ACL application is needed. The pattern needs to fit within Globus limits and DKIST authorization requirements. DKIST authorization requirements fall into two broad categories: raw and processed. 

For raw data, the pattern is simple in that the ACLs are broadly applied and can be captured near the root (e.g., the bucket) of the data tree. There are exceptions, but they are expected to be infrequent and temporary. 

For processed data, the pattern is more complex. ACLs cannot be broadly applied at the root because the mode ACL (all_authenticated_users) is permissive and would preclude restrictions lower in the tree. The primary need for more restrictive permissions comes from a requirement to support embargoing data for an as yet undetermined amount of time. Also unknown, but more bounded (on the order of 10 - 100 per year), is the number of proposals that are expected to be embargoed. ACLs on embargoed proposals would allow read rights to the investigators on the proposal.

Constraint Summary:

  • Proposals can be embargoed
    • applies to processed (data bucket) and unprocessed (raw bucket) data
    • embargoed proposals can only be accessed by proposal investigators
      • exceptions can be made for an authorized agents group
  • Raw data is generally unavailable
    • exceptions can be made in the form of an approved request (embargo rule evaluation is part of approval process)
    • exceptions can be made for a member of an authorized agent group
  • Processed Data Metadata is available independent of embargo
  • Additive ACL evaluation, i.e., if a user has access anywhere in the tree, they will have access there and below.


Key Concept: ACL Timeline


Manage Embargo Authorizations

The Manage Embargo Authorizations process is responsible for setting and updating ACLs on the endpoint to reflect the current embargo state.



User Download

The User Download process is responsible for distributing the data that was requested. A user can execute this process via the web, User Tools, or a custom library that's accessing Data Center APIs. The workflow with the web is a simple download of those files to which the user has been authorized. For the User Tools, the workflow involves first downloading the ASDF file to facilitate further filtering/slicing, followed by the targeted (by FITS file) retrieval of data within a dataset.