06 - Operational Support SA Process
Related Page | Link |
---|---|
Composite Application | 06 - Operations Support Management |
Scope
The scope of the Operational Support processes covers a variety of capabilities that surround the core functions of the Data Center. These processes can be broken down into 3 categories:
- User Management: User Registration, User Authorization Management, and Processed Data Availability Notification all relate to user-facing functions.
- System Management: Data Holding Audit, Monitor System Health, and Poison Message Management are focused on internal system health.
- Help Desk Management: Raw Data Search, DQAC Reduction, Service Desk Management and Code and Algorithm Documentation Distribution contain functionality that will be executed by a multi-tier help desk.
Goals
Goal |
---|
Provide user registration management capabilities |
Facilitate the management of user authorizations |
Provide visibility into Data Center health status |
Facilitate the management of health status data |
Facilitate the auditing of data holdings |
Support the tracking and execution of DQAC data reduction activities |
Support the annotation of lost or degraded data holdings |
Facilitate service management through help desk tickets |
Support the following types of service requests:
|
Notify Investigators of processed data availability |
Support the removal of deprecated datasets |
Support raw data discovery by users authorized by the data center |
Support raw data discovery based upon approved requests |
Store document, code and tool artifacts |
Provide a search capability for documents, code, and tool artifacts that include relevant metadata, such as document version |
Maintain validity of links to artifacts included in datasets |
Enable the retrieval of stored artifacts |
Key Concept: Algorithm Document Distribution
The documentation on algorithms is embedded in the code itself as markup that can be processed into a documentation website (e.g., https://docs.sunpy.org/en/stable/). These documents are versioned along with the code when modifications are built and released. The output Frames of a Processing Run are tagged with provenance information that includes the version of the code as well as the relevant sections of the documentation (e.g., the DAG name).
Key Concept: Auditing Book to Floor and Floor to Book
The concept of auditing the inventory of science objects is very similar to the concept of cycle/physical counting in a warehouse. One must verify that what you think you have in Inventory (a book) is what you actually have in the Science Object Store (the floor). This will catch any discrepancies for objects you think you have but don't. In order to catch objects that you have but didn't know you had, one must verify starting at the floor (i.e., the Science Object Store) and verify that you have a record of it in inventory (the book).
Key Concept: Audit Sample Size Determination
Source information can be found here: https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/
""" Calculate a sample size given population and desired confidence """ from scipy.stats import norm def calculate_sample_size(population: int, confidence_level: float, confidence_interval) -> int: """ Calculate the sample size using Cochran's sample size formula. https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/find-sample-size/ :param population: Integer representing the size of the population (e.g. 1000) :param confidence_level: Float representing the decimal percentage confidence in the sample result being representative of the population (e.g. .95 for 95% confidence) :param confidence_interval: Float representing the decimal percentage error acceptable on the result (e.g. .05 for +- 5%) :return: Integer representing the number of samples required to achieve the confidence level / interval (e.g. 278) Sample Size ss = Z^2 * (p) * (1-p) ----------------- c^2 Where: Z = Z value (e.g. 1.96 for 95% confidence level) p = percentage picking a choice, expressed as decimal (.5 used for sample size needed) c = confidence interval, expressed as decimal (e.g., .04 = ±4) Correction for Finite Population ss -------- new ss = 1+ ss-1 -------- pop Where: pop = population Example: >>>calculate_sample_size_v2(1000, .95, .05) 278 Example Answer Explained: If you sampled 278 elements of the 1000 element population and found 10% loss then you would be 95% confident that the population loss rate was between 5% and 15% """ # TODO type check # TODO range check p = .5 # Worst case (i.e. largest sample size needed) probability where half the files are missing confidence_level += (1 - confidence_level)/2 # Update confidence level to include the entire left tail Z = norm.ppf(confidence_level) # Determine Z score based upon confidence level ss = Z**2 * p * (1-p) / confidence_interval**2 # Determine sample size for infinite population sample_size = ss/(1 + (ss-1) / population) # Correct for finite population return int(sample_size)
User Registration
The User Registration process is responsible for the generation of a user account in Globus. A Globus user account is required in order to leverage the advanced transfer functionality within Globus.
User Authorization Management
The User Authorization Management process is primarily responsible for updates to the list of DKIST Authorized Agents. DKIST Authorized Agents are people who were authorized by the DKIST director to be exempt from embargo restrictions for the purpose of instrument analyses.
Tickets that have been approved relating to authorization changes to embargoed proposal group membership are also executed here.
Data Holding Audit
The Data Holding Audit process is responsible for the floor to book, book to floor, and object integrity verifications that occur on an ongoing basis.
Service Desk Management
The Service Desk Management process is responsible for managing the work tickets in the ticketing system. There are multiple types of tickets which get routed according to their type. In some cases (e.g., a Raw Data search), other processes must be executed in order to resolve a ticket. This process is intended as a guideline for the capabilities a Data Center component of a larger DKIST-wide Help Desk would need to perform, and as such, would be just a part of the DKIST-wide Service Desk Management process.
Raw Data Search
The Raw Data Search process is responsible for the fulfillment of two types of Raw Data searches. The first are those requested by users outside of DKIST operations and require prior approval. Should the external Raw Data search request be approved, the requested data is collated and prepared for distribution by support personnel. In the second case (i.e., users within DKIST Operations), an API is provided for the discovery of Raw Data via Observing Program ID. The API method puts load on systems critical to data processing so is limited in search criteria.
DQAC Reduction
The DQAC Reduction process is responsible for the execution of data reduction activities within the Data Center. The management of the request that surrounds the execution is handled by the Service Desk Management process. Removal of data requires the maintenance of any associated inventory information.
Processed Data Availability Notification
The Processed Data Availability Notification system is responsible for notifying Investigators when data for their Proposal has been processed. The notifications are delivered in a digest form. These processing events can occur over a long timescale, as reprocessing events would also trigger notification.
Monitor System Health
The Monitor System Health process is responsible for providing the capabilities to inspect the run-time operations of the Data Center. This includes logs, system telemetry, and Data Center custom events. This information degrades in utility over time so it is not retained forever. The culling process allows for the retention of aggregations of select events though to provide monitoring information over timescales longer than the retained history.
Code and Algorithm Documentation Distribution
The Code and Algorithm Documentation Distribution process is responsible for displaying for search queries the algorithm documentation and user libraries created by the Data Center.
Poison Message Management
The Poison Message Management process is responsible for the review and corrective action associated with un-processable messages that were on the Interservice Bus.