Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • generated and tested an emergency procedure to recover the monitors in a ceph cluster after the loss of all monitors. The next ceph upgrade is planned to be done during normal DC operations in all environments including prod. Although this issue is not expected to occur, if it does occur then I need to be ready to handle it because prod ceph and everything that uses it - including transfers from the summit - will be blocked.

  • Implemented parallel paging in the processing-preparation-worker to reduce load time. The load time was reduced by an order of magnitude with the solution. This query was taking 30-40 seconds per page, and erve frames so just the observe component takes more than half an hour. Multiple pages could be requested/collated concurrently since the majority of the time is being taken up in the metadata-store-api fulfilling the requestwere taking hours just to load.

  • Resolved a problem on the production OpsTools stack that prevented experiments from being sent. Troubleshooting was difficult because the issue did not occur on the development/test stack, and it was unclear why it was occuring on the production stack.