Update details about current filesystem and OnDemand issues

OnDemand: OnDemand is periodically losing connection to our gateway nodes. This makes home and scratch unavailable. We are still investigating the cause. Home directories: The home file system underwent diagnostics from 6/24-6/28. This caused slowdowns for logging in and using the HPCC. We have restarted our backup process after the scan ended 6/28 evening and users may see pauses as the file system catches up. NewOS: We upgraded our operating system to Ubuntu 22.04 in mid June. This included a reinstallation of all software modules. Please read our documentation here for more details about the upgrade, and contact us if you are having issues not covered by this documentation. Please click the title of this post for more detailed information and our planned timeline. Updated: 7/10 at the end.

The Problem:

Although many jobs are still running successfully, the current system instabilities are an unfortunate confluence of multiple issues.

Home Directories: First and foremost are instabilities in our home directory filesystem. This instability is causing intermittent access to the HPCC via SSH and OnDemand. Home is experiencing performance issues for two reasons:

After the failure of the migration process to our new file system in May, there were a small number of files that have prevented us from resuming the migration to the new hardware. After multiple discussions with the vendor, IBM provided a process to identify the failed files. We started that process last week which caused a significant performance penalty and took significantly longer than expected to run. It started Wednesday and ended late Friday.
To start the file check process, offsite backup replications needed to be disabled. When backups are restarted, the system needs to take a snapshot of every fileset on the system and scan for changes. Each snapshot requires the entire system to pause (for up to a minute) to ensure that the filesystem is consistent across all 1,000 nodes. We have restarted the replication after the scan failed Friday evening and users may continue to see pauses as the file system catches up. The vendor anticipates that these pauses may continue for a couple of days but we acknowledge that this is just a rough estimation and may be unreliable given our previous estimates.

Our immediate goal is to fix the issues with the file migration and move accounts to the new file system. This will require that we work with the vendor to analyze the data we got to try and identify the underlying problem. It is possible we may need to run another system diagnostic but we want to avoid the problems we had last week and are working with the vendor to identify ways to make it less painful if we didn`t catch the problem this first time.

OnDemand: The OnDemand server is periodically experiencing a communication error between it and the rest of the gateway nodes. When this happens, home or the scratch system becomes temporarily unavailable in OnDemand.

These communication instabilities may also cause a user’s OnDemand session to be improperly disconnected. These disconnects can cause “stale” cookies and result in local browser issues that require users to clear out OnDemand browser cookies before being able to connect to OnDemand. The cookie issue and the communication issues can result in similar error messages. The exact kind of error will also vary based on the user`s computer and version of browser they are running. Users may need to clear their browser cookies when trying to connect.

Although we have been able to eliminate many potential sources of the problem from consideration, it is currently not clear what is the root cause of the communication errors between ondemand and the gateway servers. ICER has some short term “fixes” that require our manual intervention but we are still debugging to identify a long term solution.

New OS: Last week we also started a major migration of compute nodes to the new Ubuntu Operating System. This is a long overdue upgrade and will significantly improve the long term stability and reliability of the system. Unfortunately, as with any major upgrade, there is a long list of issues and bugs that will need to be addressed.

Although the new OS is not the cause of the home directory filesystem issues, its changeover has complicated the debugging process.

Timeline:

Right now (Week of July 1st 2024), the system should finish up its resynchronization process in the next few days which should result in a much more stable system in the short term. We will continuously monitor and watch the system while we work with the vendor to review the diagnostic data and debug the problems.

It is unclear how long it will take the vendor to get back to us with a fix to their file system migration process. If the vendor is able to find a solution to the home directory issues this week, we would likely try to avoid trying another “live” migration and thus schedule some migration downtime which would not happen for at least another 2 weeks in order to empty out the scheduler.

If the vendor is unable to identify the problem they may be asking us to rerun the system diagnostic again. If this is required we are trying to identify ways to ensure that the system will remain stable during the diagnosis.

August: water cooling is being added to the MSU Data Center to allow for more high power compute systems.

Fall: a new CPU and GPU cluster will be installed to connect to the new water cooling system.

Spring Semester: Installation of a new high speed file system optimized for lots of small files. This new system will help us optimize workflows based on file types and significantly improve performance across all of our file systems.

Workarounds:

We realize it can be extremely frustrating debugging problems on an unstable system. It is often difficult to know if the problem is short term, long term, a known system issue, something you need to report or something wrong with your own workflow. Please contact us if you need help.

Although we have a number of system monitors and tests, they do not always pick up the scale of the problems. We encourage everyone to submit a ticket when they are experiencing a problem (http://contact.icer.msu.edu) to ensure we know that there are issues and we may be able to suggest a temporary solution or workaround.

The ICER Research Consultants have been working with individuals and groups to identify ways to work around all of these issues. These fixes are often workflow dependent. Please reach out to us if you would like help with your workflow. When possible we will try to document the most common of these workarounds in lab notebooks on our documentation page:

Lab Notebook: https://docs.icer.msu.edu/2024-07-01_LabNotebook_OnDemandWorkaround/

Updates:

7/10/24:

The main OnDemand issue has been resolved last week. We have seen some “Proxy Connection Errrors” when Slurm is under significant load; reloading should resolve the issue.

The home file system has been stable. We are waiting on the vendor to analyze the logs to determine to resolve the issue with the migration process.

The NewOS migration is continuing.