06 - Operational Support SA Process

Related PageLink
Composite Application06 - Operations Support Management


Scope

The scope of the Operational Support processes covers a variety of capabilities that surround the core functions of the Data Center. These processes can be broken down into 3 categories:

  • User Management: User Registration, User Authorization Management, and Processed Data Availability Notification all relate to user-facing functions.
  • System Management: Data Holding Audit, Monitor System Health, and Poison Message Management are focused on internal system health.
  • Help Desk Management: Raw Data Search, DQAC Reduction, Service Desk Management and Code and Algorithm Documentation Distribution contain functionality that will be executed by a multi-tier help desk.



Goals

Goal
Provide user registration management capabilities
Facilitate the management of user authorizations
Provide visibility into Data Center health status
Facilitate the management of health status data
Facilitate the auditing of data holdings
Support the tracking and execution of DQAC data reduction activities
Support the annotation of lost or degraded data holdings
Facilitate service management through help desk tickets

Support the following types of service requests:

  • Requests for uncalibrated raw data
  • Data review due to reprocessing
  • Manual processing by definition
  • Manual processing by exception
  • Ops Tools OP association data ingest failure (update post processing)
  • Processing Jobs without Recipe
  • Generic help request
Notify Investigators of processed data availability
Support the removal of deprecated datasets
Support raw data discovery by users authorized by the data center
Support raw data discovery based upon approved requests
Store document, code and tool artifacts
Provide a search capability for documents, code, and tool artifacts that include relevant metadata, such as document version
Maintain validity of links to artifacts included in datasets
Enable the retrieval of stored artifacts

Key Concept: Algorithm Document Distribution

The documentation on algorithms is embedded in the code itself as markup that can be processed into a documentation website (e.g., https://docs.sunpy.org/en/stable/). These documents are versioned along with the code when modifications are built and released. The output Frames of a Processing Run are tagged with provenance information that includes the version of the code as well as the relevant sections of the documentation (e.g., the DAG name). 


Key Concept: Auditing Book to Floor and Floor to Book

The concept of auditing the inventory of science objects is very similar to the concept of cycle/physical counting in a warehouse. One must verify that what you think you have in Inventory (a book) is what you actually have in the Science Object Store (the floor). This will catch any discrepancies for objects you think you have but don't. In order to catch objects that you have but didn't know you had, one must verify starting at the floor (i.e., the Science Object Store) and verify that you have a record of it in inventory (the book).



Key Concept: Audit Sample Size Determination

Source information can be found here: https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/

Cochran's Formula
"""
Calculate a sample size given population and desired confidence
"""
from scipy.stats import norm


def calculate_sample_size(population: int, confidence_level: float, confidence_interval) -> int:
    """
    Calculate the sample size using Cochran's sample size formula.
    https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/

    :param population: Integer representing the size of the population (e.g. 1000)

    :param confidence_level: Float representing the decimal percentage confidence in the sample
    result being representative of the population (e.g. .95 for 95% confidence)

    :param confidence_interval: Float representing the decimal percentage error acceptable on the
    result (e.g. .05 for +- 5%)

    :return: Integer representing the number of samples required to achieve the confidence level / interval (e.g. 278)

    Sample Size
    ss =	Z^2 * (p) * (1-p)
            -----------------
                   c^2
    Where:

    Z = Z value (e.g. 1.96 for 95% confidence level)
    p = percentage picking a choice, expressed as decimal
    (.5 used for sample size needed)
    c = confidence interval, expressed as decimal
    (e.g., .04 = ±4)


    Correction for Finite Population
                   ss
                --------
    new ss =
                1+	ss-1
                --------
                  pop


    Where: pop = population

    Example:
    >>>calculate_sample_size_v2(1000, .95, .05)
    278

	Example Answer Explained:
    If you sampled 278 elements of the 1000 element population and found 10% loss then you would be 95%
    confident that the population loss rate was between 5% and 15%
    """

    # TODO type check
    # TODO range check

    p = .5  # Worst case (i.e. largest sample size needed) probability where half the files are missing
    confidence_level += (1 - confidence_level)/2  # Update confidence level to include the entire left tail
    Z = norm.ppf(confidence_level)  # Determine Z score based upon confidence level

    ss = Z**2 * p * (1-p) / confidence_interval**2  # Determine sample size for infinite population

    sample_size = ss/(1 + (ss-1) / population) # Correct for finite population

    return int(sample_size)


User Registration

The User Registration process is responsible for the generation of a user account in Globus. A Globus user account is required in order to leverage the advanced transfer functionality within Globus.



User Authorization Management

The User Authorization Management process is primarily responsible for updates to the list of DKIST Authorized Agents. DKIST Authorized Agents are people who were authorized by the DKIST director to be exempt from embargo restrictions for the purpose of instrument analyses.

Tickets that have been approved relating to authorization changes to embargoed proposal group membership are also executed here.



Data Holding Audit

The Data Holding Audit process is responsible for the floor to book, book to floor, and object integrity verifications that occur on an ongoing basis.



Service Desk Management

The Service Desk Management process is responsible for managing the work tickets in the ticketing system. There are multiple types of tickets which get routed according to their type. In some cases (e.g., a Raw Data search), other processes must be executed in order to resolve a ticket. This process is intended as a guideline for the capabilities a Data Center component of a larger DKIST-wide Help Desk would need to perform, and as such, would be just a part of the DKIST-wide Service Desk Management process.



Raw Data Search

The Raw Data Search process is responsible for the fulfillment of two types of Raw Data searches. The first are those requested by users outside of DKIST operations and require prior approval. Should the external Raw Data search request be approved, the requested data is collated and prepared for distribution by support personnel. In the second case (i.e., users within DKIST Operations), an API is provided for the discovery of Raw Data via Observing Program ID. The API method puts load on systems critical to data processing so is limited in search criteria.



DQAC Reduction

The DQAC Reduction process is responsible for the execution of data reduction activities within the Data Center. The management of the request that surrounds the execution is handled by the Service Desk Management process. Removal of data requires the maintenance of any associated inventory information.



Processed Data Availability Notification

The Processed Data Availability Notification system is responsible for notifying Investigators when data for their Proposal has been processed. The notifications are delivered in a digest form. These processing events can occur over a long timescale, as reprocessing events would also trigger notification.



Monitor System Health

The Monitor System Health process is responsible for providing the capabilities to inspect the run-time operations of the Data Center. This includes logs, system telemetry, and Data Center custom events. This information degrades in utility over time so it is not retained forever. The culling process allows for the retention of aggregations of select events though to provide monitoring information over timescales longer than the retained history.

Code and Algorithm Documentation Distribution

The Code and Algorithm Documentation Distribution process is responsible for displaying for search queries the algorithm documentation and user libraries created by the Data Center.



Poison Message Management

The Poison Message Management process is responsible for the review and corrective action associated with un-processable messages that were on the Interservice Bus.