Posts

Jul 1, 2025
Scratch Filesystem Errors

Due to an issue with the scratch filesystem, users may notice intermittent errors when attempting to write data to their scratch directory. These errors will state that the scratch filesystem or device is out of space. The ICER system administration team is actively working with our storage vendor to resolve these errors and we will post additional updates as soon as we have more information.
Jun 19, 2025
Old Home Directory Maintenance Beginning Monday 6/23/25

On Monday, June 23, 2025, we will begin consolidating all data remaining on the old home filesystem to make room for additional research space storage. While many of you who were moved to the new filesystem in the past several weeks still have access to this data, you should not be using the old home filesystem for any running jobs or active workflows.
Jun 19, 2025
Security Patch Applied on 6/19/25

On 19 June, 2025 a security patch was applied to the system. Although no user impact is expected, please open a ticket using our contact page https://contact.icer.msu.edu/contact if you encounter any issues.
Jun 10, 2025
Home migration - RESOLVED

RESOLVED: As of June 10th, 8:43AM, all users except those with running jobs have been migrated. Users with running jobs will be migrated when their jobs complete. Please see the details in this post for information on moving files to new research spaces.
Jun 2, 2025
Minor Network Configuration Change - RESOLVED

RESOLVED 6/9/2025 - This maintenance was completed the morning of June 9, 2025.
May 23, 2025
Minor SLURM Update on 05/29/25

On Thursday, May 29th, we will be deploying a minor update to the SLURM scheduling software. This update patches a security vulnerability and will allow us to re-enable buy-in account coordinator functionality. Running and queued jobs should not be affected. A brief interruption to client commands may be experienced (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.
May 11, 2025
HPCC Scheduled Downtime - Resolved on 5/14/2025 5:30pm

Update 5:30 PM 5/14 - Current status:
- gs21 has been returned to service, and all held jobs have been released. We have some open issues to address with the vendor but the system should operate as expected.
- research/ufs24 should be performing as normal. We are continuing to investigate to identify a root cause.
May 5, 2025
SLURM Database Outage at 10:00AM on 5/8/25 - RESOLVED

RESOLVED: The database upgrade completes without issue and the database is back online
May 5, 2025
Contact Form Outage - RESOLVED

UPDATE: 5/6/2025 8:30 PM - Contact form maintenance is completed and services have been restored.
Apr 17, 2025
Gateway ssh host key updates - RESOLVED

UPDATE: 4/17/2025 7:00 AM - Gateway node ssh host identification keys have been changed, please see documentation listed below.
Apr 16, 2025
2025-04-18 amd24 water cooling work

IPF will be performing work to test the water cooling system on Friday April 18th, 2025. We do not anticipate that it will impact users but there is a possiblity that workloads may be impacted. We will update this blog when the window starts and end if there’s any disruption noticed.
Apr 15, 2025
Updates to Intel MPI - UPDATED

UPDATE 4/17/2025 12:00 PM - There have been reports that Intel MPI no longer works using mpirun and mpiexec. While we are investigating the source of this issue, please switch to using the srun command as a workaround.
Apr 1, 2025
Minor SLURM Update on 04/03/25 - RESOLVED

UPDATE: 4/3/2025 11:00 AM - The update completed without issue.
Mar 31, 2025
Gateway Node Operating System Upgrades - RESOLVED 4/1/2025

RESOLVED: 4/1/2025 Maintenance has been completed. If you experience issues logging into HPCC resources, please let us know by submitting a ticket through our Contact Forms.
Mar 20, 2025
Research Filesystem Performance

After resolving an unplanned hardware error earlier this morning, slower performance may be noticed on the Research Space filesystem as the filesystem catches up on disaster recovery snapshots. All hardware errors have been resolved, and the slower performance should resolve in a few hours. This post will be updated once performance has returned to normal.
Mar 17, 2025
Contact Form Update

The ICER Contact From has been heavily revamped to better serve our ICER community. Documentation is available. Being new, please let us know if you experience any problems.
Mar 13, 2025
Home and Research Space Storage Maintenance - RESOLVED 3/14/2025

RESOLVED: 3/14/2024 - Home and Research Space storage maintenance has been completed.
Mar 6, 2025
Job queuing issues - RESOLVED 3/6/25

RESOLVED The job queuing system has been restarted and jobs should be starting as normal. Please contact us at https://contact.icer.msu.edu/contact if you experience issues.

The system is currently experiencing issues queuing jobs, affecting OnDemand and regular job submission. We are investigating the issue and will update this post when we have more information.
Feb 27, 2025
OnDemand access to amd24 nodes - RESOLVED 2/28/25

RESOLVED: AMD24 nodes are now usable through OnDemand

It is not currently possible to use the new amd24 nodes with OnDemand due to a firewall configuration issue. This will be resolved by tomorrow pending testing.
Feb 19, 2025
Compute and development nodes down - RESOLVED 2/19/2025

UPDATE: 2/19/2025 11:00AM - Compute nodes and dev-intel16 have been restored to service. Note that dev-intel18 will remain offline while due to the rearrangement of the intel18 cluster because of water cooling.
Feb 18, 2025
Deprecating 2022a modules on non-intel16 nodes - RESOLVED 2/24/2025

UPDATE: 2/24/2025 2:00PM - All affected modules have been removed from the main software library and are only available on intel16. Please follow the instructions in this post for managing this transition.
Feb 12, 2025
amd24 Cluster Beta Availability - RESOLVED 2/26/2025

UPDATE: AMD24 is in production.
Hardware information has been added to our documentation
Feb 11, 2025
Some GPU Jobs Affected by Scheduler Bug

An update last Thursday introduced a bug into our scheduling logic that was present through yesterday afternoon. This bug affected GPU jobs submitted with certain types of constraints. The bug resulted in these jobs getting a constraint that differed from that which was requested, potentially running them on incompatible hardware. All GPU and CPU hours consumed by these affected jobs have been refunded. We apologize for the inconvenience. If you have any questions, please contact us.
Jan 30, 2025
Intel 18 limited availability - Updated

The Intel 18 cluster, both general and buy-in nodes, will have limited availability starting the end of next week with limited to no availability the week starting Feb 10th. The downtime is part of a rearrangement plan with bringing water cooling into the data center for the new cluster. Please contact us if this does not work for you and we can temporarily move you to a different buy-in node.

Update: 5 PM 2/25/2025- Most nodes have been returned to service. There are a few nodes that require additional cabling or diagnostics that will be completed by the end of the week. Users can check the status of their nodes with the node_status tool.
Jan 21, 2025
OnDemand Server Reboot - RESOLVED 1/22/25

RESOLVED: 1/22/2025 - Our OnDemand server has been successfully rebooted and is back online with more memory.
Jan 12, 2025
Transition to a new homedir

ICER is migrating to a new HPCC home directory system called VAST, an all flash system that will enable fast access to files in your home spaces, and a significantly better working environment overall. Along with the migration, we are increasing the default home space size from 50 GB to 100 GB which will be a hard limit going forward to ensure that usage of home spaces is aligned with their intended purpose (home space documentation). To help accommodate these changes, we are increasing the maximum free research space per principal investigator from 1TB to 3TB. This move toward more research space storage will also enhance collaboration amongst your team.

The process of moving all HPCC home directories to the new VAST system will be spread over time. Starting the week of January 20th, we will start the process of migrating users with less than 100GB of home directory usage to the VAST storage. This will require no HPCC usage during the migration, including scheduled jobs and interactive sessions. We will send users an individual notification ahead of time when their migration is scheduled to start and when their migration starts. Once the move to VAST is complete, they will again be notified and will automatically use the VAST system on their next login.

If you are already using less than 100GB of home space or you can get your usage below this limit before January 15th, you will be among the initial group of users migrated to the VAST system. We will reach out to users above 100GB with further tools and processes later this semester. These users will receive a usage report showing their home directory usage. PIs will also receive team and research group usage.

If you or your team need extra help with this change or want to opt out of the initial migration group, please reach out to the ICER team by opening a ticket in the following website:

Contact Form

ICER Documentation

Thanks for your patience during this transition.
Jan 7, 2025
Minor SLURM Update - RESOLVED 1/9/2025

RESOLVED: 1/9/2025 - All nodes are back online. Users affected by job failures will be contacted and refunded any used CPU or GPU hours.
Jan 2, 2025
Login issues - RESOLVED 1/2/2025

RESOLVED: All login gateways are now active.

We are aware of login issues with SSH. These are caused by an outage of a single login gateway. We are working to resolve the issue.

Workaround: connect via our OnDemand portal at https://ondemand.hpcc.msu.edu/
Dec 4, 2024
Winter Break Limited Coverage - RESOLVED 1/2/2025

RESOLVED: Support at ICER has returned to normal.
Dec 3, 2024
HPCC Scheduled Downtime - RESOLVED 12/19/2024

RESOLVED: Maintenance is complete, thank you for your patience. Job submissions will continue to run after 5PM on 12/19. Please note that as the intel14 cluster has been retired, the intel14 constraint must be removed from any jobs.
Nov 19, 2024
Intel16 Cluster Currently Offline - RESOLVED 11/19/2024

RESOLVED: 11/19/2024 12:10PM - On 11/18/2024 ITS performed maintenance on a number of switches in the data center that required rebooting critical network infrastructure. After these reboots, several links connecting to the intel16 cluster did not recover. During this time, you may have also noticed brief pauses in OnDemand and on Gateway nodes. This morning we were able to work with ITS to re-establish connectivity to all intel16 nodes, and the intel16 cluster, along with all other nodes, are now back in production and running jobs via Slurm.
Oct 31, 2024
MATLAB License issue - RESOLVED 10/31/2024

RESOLVED: 10/31/2024 5:15PM - The issue is resolved on development and compute nodes.
Oct 31, 2024
Shared Module and Software Server Restart - RESOLVED 11/1/2024

RESOLVED: 11/1/2024 6:15 AM - The system restart is complete and all services should be online.
Oct 30, 2024
Shared Software File Server Restart - RESOLVED 12:50 10/30/2024

RESOLVED: 1250 10/30/2024 - The system restart is complete and all services should be online.
Oct 28, 2024
ICER Web Application Login Error - RESOLVED 10/29/2024

UPDATE: 10/29/2024 - Logins to RT, OpenOnDemand, and Contact forms looks to be fully functional again. Values might be cached and you might need to clear your cache. You can test by opening a private browser. Email general@rt.hpcc.msu.edu if you still experience problems.
Oct 25, 2024
ICER Contact Form UserID Information Lookup Error RESOLVED 10/29/2024

The ICER contact form is currently experiencing a technical error retrieving userID information for some MSU accounts. This error may result in your inability to log new account or new research space requests. While we continue to troubleshoot this error, please use the general contact form to submit your requests. This post will continue to be updated as we have more information.
Oct 24, 2024
Gateway Node Operating System Upgrades

Starting on Monday 10/28/2024 and over the next few weeks, we will be upgrading the operating systems on the gateway nodes. If you experience a timeout while attempting to connect to the HPCC during this time, please try again after a short delay or use our open ondemand instance. If you continue to have difficulty loging into HPCC resources, please let us know by submitting a ticket through our Contact Forms
Oct 23, 2024
2024-10-24 Development node reboots - RESOLVED 2024-10-24 0715

RESOLVED: 10/24/2024 - All reboots are complete and the development nodes should be available. Please report any issues through our contact forms
Oct 17, 2024
2024-10-29 Nondisruptive firewall update

Between 7 PM and 9 PM on October 29th, ITS will perform updates to the ICER firewall. We do not anticipate any impact to users as the firewall is configured with full redundancy, but please open a ticket if you notice any issues.
Sep 27, 2024
File System Performance - RESOLVED 9/27/2024

RESOLVED: 9/27/2024 - ICER has completed the migration of data from the old home and research file system to the new file system. This should resolve the occasional slowdowns that have occurred since the start of the project this past spring. Home and research file system operations have returned to normal. This includes disaster recovery replication and our file system quota processes. Thank you for your patience during this transition.
Sep 26, 2024
'Illegal instruction (core dumped)' Errors - RESOLVED 10/14/2024

RESOLVED: 10/14/2024 - We have applied a fix that we believe has solved the issue. If you are still experiencing problems, please contact contact ICER support with a description and steps to reproduce the issue.
Sep 16, 2024
OnDemand Portal Update on Friday 9/20 - RESOLVED 9/23/24

At 9:00PM on Friday, September 20th, ICER’s OnDemand portal will undergo an update from version 3.0.1 to version 3.1.7. The most notable change to the portal following this update will be Globus integration. When browsing files in the updated OnDemand portal, a ‘Globus’ button will be available that will open the current directory inside of Globus. A full list of changes made by this update can be viewed here. If you have any questions about this update or encounter any issues with the OnDemand portal following the update, please contact us.
Sep 4, 2024
Development node dev-intel18 maintenance - RESOLVED

RESOLVED: The maintenance on dev-intel18 is complete as of 11:20 AM, September 6, 2024 and the node should be available for use.
Sep 3, 2024
Change to Loading Modules in SLURM Scripts

In one week, ICER will make a small change to the way modules are loaded in SLURM scripts. Please make sure that all SLURM scripts you submit load modules in scripts before you use them! For more information and also how this affects workflow managers like Nextflow and Snakemake, please see our documentation.
Aug 29, 2024
OnDemand and Contact form login issues - RESOLVED

RESOLVED: ITS has resolved the login issue and all systems are accessible as normal.
Aug 9, 2024
Filesystem Slowdown and User Creation Pause - RESOLVED

UPDATE 8/13/2024: The recovery processes have finished running, and Home filesystem performance has now returned to normal.
Aug 6, 2024
August 6, 2024: HPCC Scheduled Downtime and Transition of Remaining CentOS Nodes (Completed 8/6/2024)

Updates: 05:00PM - Upgrades are complete and in the processes of moving the system to production. This process takes about 30 minutes. HPCC should be available by 5:30PM or shortly after. Home and Research filesystem is little slow while snapshots catch up. Those will clear later this evening. If you notice problems, contact us
Jul 22, 2024
RESOLVED 7/31/24 Scavenger Queue jobs not starting

The scavenger queue is operating normally now that the buyin node OS transition has been completed.
Jul 12, 2024
Data machine nodes not showing up in scontrol - UPDATED

On July 12th, it was discovered that the data machine nodes are not properly responding to diagnostic commands. However, these nodes are still available and scheduling jobs.
Jul 11, 2024
Rebuilding default OpenMPI, may cause login issues - RESOLVED

On July 11th, 2024 from 5:30-6:00PM Michigan time, we will be rebuilding the default OpenMPI module, OpenMPI/4.1.5-GCC-12.3.0. This will result in errors from the module system when logging in, as the module needs to be deleted to be rebuilt. This will not affect running jobs, and will be isolated to development nodes only. The rebuild should be complete by 6:00PM at which time this blog post will be updated.
Jul 1, 2024
Update details about current filesystem and OnDemand issues

OnDemand: OnDemand is periodically losing connection to our gateway nodes. This makes home and scratch unavailable. We are still investigating the cause. Home directories: The home file system underwent diagnostics from 6/24-6/28. This caused slowdowns for logging in and using the HPCC. We have restarted our backup process after the scan ended 6/28 evening and users may see pauses as the file system catches up. NewOS: We upgraded our operating system to Ubuntu 22.04 in mid June. This included a reinstallation of all software modules. Please read our documentation here for more details about the upgrade, and contact us if you are having issues not covered by this documentation. Please click the title of this post for more detailed information and our planned timeline. Updated: 7/10 at the end.
Jun 27, 2024
Current system issues

We are aware of two issues affected the system at this time: slow response to commands/slow login, and OnDemand scratch space missing. The system slowdowns are caused by diagnostics on the home filesystem as part of our upgrade to a new home filesystem. We do not currently have an estimate for when these diagnostics will complete. The OnDemand scratch space connection is also being diagnosed and addressed with our storage vendor. Please check back for updates as we have them.
Jun 24, 2024
Home filesystem issues affecting OnDemand

OnDemand functionality has been partially recovered. Users should be able to log in, connect, and access their home and research spaces, as well as interactive app sessions. Scratch remains unavailable at this time. Please report access issues at https://contact.icer.msu.edu/contact
Jun 17, 2024
Home filesystem issues affecting OnDemand - Resolved

At approximately 12:00 PM on 6/17/2024 we started experiencing an outage with the Home filesystem. This outage primarily affects OnDemand, but may be apparent on other nodes as well.
Jun 17, 2024
Compute Operating system upgrades (complete)

On 17 June, 2024 the primary operating system on HPCC resources is being changed from Centos 7 to Ubuntu 22.04. Please review our operating system upgrade documentation for details.
May 16, 2024
Samba connectivity issues

UPDATE 3:45pm 5/16/24 Samba file sharing is now back online. Please submit a ticket at https://contact.icer.msu.edu/contact if you continue to experience issues.
May 13, 2024
Home filesystem issues - update

At approximately 6:15PM on 5/13/2024, users began reporting issues accessing their home directory on HPCC. We are aware of the issue and are working with our vendors to address it.
May 10, 2024
Home filesystem issues causing login problems

At approximately 11:10 AM on 5/10/2024 we experienced a transient outage while conducting upgrades and hardware refresh of our Home filesystem. This outage may have caused login issues or stale filemounts. Services were restored after approximately 15 minutes and home directories should be available again. If you countinue to experience issues with your home directory, please contact us.
May 6, 2024
System Reboots Thursday May 9

On Thursday May 9 the following systems will be rebooted from 10-12am:
May 3, 2024
Home filesystem issue UPDATED 5/3/2024 5:00PM

UPDATE (5/3/2024 5:00 pm) - The issue has been resolved and all services should be available. If you encounter any additional issues, please contact us.
May 2, 2024
Home filesystem issue causing login problems - UPDATED 5/2/2024 12:30 pm

UPDATE (5/2/2024 12:30 pm) - File system and connectivity issues have been resolved.
Mar 18, 2024
Scheduler Reboot at 10:00AM on 3/19/24

At 10:00AM on Tuesday, March 19th, the SLURM scheduling server will go offline for a reboot. This reboot is necessary to apply updates to the underlying hardware that hosts the scheduler. The scheduler is expected to be offline for roughly 15 minutes. During this time, jobs may not be submitted and scheduler specific client commands will not work (e.g. squeue, sbatch, etc). Running jobs will not be affected. If you have any questions about this outage, please contact us.
Mar 15, 2024
Scheduler Reboot at 10:00AM on 3/18/24

At 10:00AM on Monday, March 18th, the SLURM scheduling server will go offline for a reboot. This reboot is necessary to apply updates to the underlying hardware that hosts the scheduler. The scheduler is expected to be offline for roughly 15 minutes. During this time, jobs may not be submitted and scheduler specific client commands will not work (e.g. squeue, sbatch, etc). Running jobs will not be affected. If you have any questions about this outage, please contact us.
Mar 1, 2024
Scratch space not acccessible via OnDemand

UPDATE (3/1/2024) - Access to scratch via OnDemand has been restored
Feb 1, 2024
VSCode updates will break access

This post applies to users of VS Code that SSH into the ICER HPCC from their own copy of VS Code.

Error message: “This machine does not meet Visual Studio Code Server’s prerequisites, expected either…: - find GLIBC >= v2.28.0 (but found v2.17.0 instead) for GNU environments”

Details Microsoft recently updated Visual Studio Code to version 1.86, and it is no longer compatible with the operating system we use at ICER. The change note that lists the change is here https://code.visualstudio.com/updates/v1_86#_engineering (scroll down to “Linux minimum requirements update”) Although we plan to upgrade our operating system this year, in the meantime there are two solutions to this incompatibility.

Solutions

1) Use our code server app in OnDemand (Interactive Apps -> Code Server (beta)) You can request compute nodes to work on for a specified amount of time, and use VS Code in your browser.

2) Downgrade to the previous 1.85 version of VS Code and disable automatic updates. You can access the previous version here https://code.visualstudio.com/updates/v1_85 (see the Downloads section for a version for your PC or Mac)
Jan 5, 2024
Minor SLURM Update on 01/11/24

On Thursday, January 11th, we will be deploying a minor update to the SLURM scheduling software. This update will bring ICER to the latest minor revision of SLURM 23.02. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.
Dec 13, 2023
Winter Break Limited Coverage

There will be limited coverage while MSU observes winter break from December 22, 2023 through January 2, 2024. The system will continue to run jobs and be monitored for emergency issues. Tickets will be sorted by priority on January 3 when our team returns to work after the holiday break. If you have any questions, please contact us
Dec 4, 2023
Retirement of dev-intel14 and dev-intel14-k20 on 12/14/23

On Thursday, December 14th, we will be retiring the dev-intel14 and dev-intel14-k20 nodes. After this date, the dev-intel14 and dev-intel14-k20 nodes will no longer be avialable for use as development nodes. Users should connect to the remaining active development nodes for any development node tasks. If you have any questions about this change, please contact us.
Nov 30, 2023
Minor SLURM Update on 12/05/23

On Tuesday, December 5th, we will be deploying a minor update to the SLURM scheduling software. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.
Nov 26, 2023
HPCC Scheduled Downtime - Completed

The HPCC will be unavailable on Wednesday, December 20th for our regularly scheduled maintenance. No jobs will run during this time. Jobs that will not be completed before December 20th will not begin until after maintenance is complete. For example, if you submit a four day job three days before the maintenance outage, your job will be postponed and will not begin to run until after maintenance is completed.
Nov 15, 2023
RT Ticketing system problem last night 11/15/23

The RT/Ticketing systems had problems after an upgrade last night. The time of the problem was from 9:00 pm 11-14-23 to 9:00 am 11-15-23. If you had problems during that timeframe please try again now. If you experience problems again please clear your browser cache. Thank You.
Nov 6, 2023
Minor SLURM Update on 11/09/23

On Thursday, November 9th, we will be deploying a minor update to the SLURM scheduling software. This update will improve the efficiency of our SLURM controllers application logs. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.
Oct 27, 2023
Jobs Now Always Automatically Requeued On Prolog Failure

As of Thursday, October 26th, jobs that fail to start due to a prolog script error will always be requeued.
Oct 24, 2023
Performance problem on home system - UPDATED 10/24/2023

UPDATE (10/24/2023) - The performance issues with the home directory system have now been resolved.
Oct 18, 2023
Minor SLURM Update on 10/23/23

On Monday, October 23rd, we will be deploying a minor update to the SLURM scheduling software. This update brings our installation to the latest release and includes many bug fixes. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.
Oct 10, 2023
Minor Singularity Update on 10/12/23

On Thursday, October 12th, we will be deploying a minor update to the Singularity container software. This update will bring the HPCC from version 3.11.4 to the latest 3.11.5. A handful of bug fixes and new features are available in this version. For a full list of changes, please refer to Singularity’s release notes on GitHub. If you have any questions about this update or you experience issues following this update, please contact us
Oct 2, 2023
HPCC Connectivity Issues - UPDATED 10/2/23

UPDATE (10/2/2023): We experienced an issue at 0930 this morning with home directories that prevented user logins. All services are now recovered and login should again be successful. Please let us know if you continue to experience issues.
Sep 28, 2023
Minor SLURM Update on 10/09/23

On Monday, October 9th, we will be deploying a minor update to the SLURM scheduling software. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.
Sep 20, 2023
Contact and Gateway Issues on 9/19/23

Due to a failure of a supporting service, gateway-02 and the contact forms were unavailable at around 6 PM this evening. Staff have restored these services.
Sep 18, 2023
Minor SLURM Update on 9/21/23

On Thursday, September 21st, we will be deploying a minor update to the SLURM scheduling software. This update is built against newer Nvidia drivers to support scheduling of multi-instance GPUs. If you have any questions about this update or you experience issues following this update, please contact us.
Aug 31, 2023
Performance problem on home system - resolved

UPDATE: 8/31/2023 The cause of the system slowdowns was identified on 8/29/2023 as jobs saturating the storage I/O. Please follow the lab notebook for details and best practices to prevent this from happening again.
Aug 16, 2023
Globus Restored to Service - 8/16/2023

8/16/2023:
Aug 15, 2023
HPCC Scheduled Downtime - UPDATED 8/15/2023

UPDATE (8/15/2023): All scheduled updates are completed for the 8/15/2023 summer maintenance.
Aug 2, 2023
HPCC Connectivity Issues

Update: Network problems in the data center were fixed by 3pm.
Stability with home directories and gateways were restored by 5:30pm. File a ticket if you notice any other issues. We will continue to monitor closely this evening.
Jul 26, 2023
Intermittent HPCC Performance Issues

We are experiencing sporadic episodes of slowness with logging in to the gateways and/or interactive work on the development nodes. We’re in the process of tracking down this issue. If you are experiencing this issue and/or have any other comments or questions, please feel free to file a ticket with us here: https://contact.icer.msu.edu/contact
Jul 11, 2023
Scheduler Outage on 7/25/23 at 6:00PM

Starting at 6:00PM on Tuesday, July 25th, the SLURM scheduler will go offline in order to perform a migration of its underlying compute resources. This migration is necessary to complete routine maintenance on underlying compute resources. This outage is expected to last up to 30 minutes. During this time, SLURM client commands (sbatch, squeue, etc.) will be unavailable and no new jobs will be started. Queued and running jobs will not be affected. If you have any question about this outage, please contact us.
Jul 6, 2023
Minor SLURM Update on 7/10/23

On Monday, July 10th, we will be deploying a minor update to the SLURM scheduling software. This update contains a patch designed to address a bug experienced with some large jobs (>50 nodes) that causes job processes to persist past a job’s end time. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.
Jun 29, 2023
Minor Singularity Update on 7/3/23

On Monday, July 3rd, we will be deploying a minor update to the Singularity container software. This update will bring the HPCC from version 3.11.2 to the latest 3.11.4. Several bug fixes and new features are available in this version. For a full list of changes, please refer to Singularity’s release notes on GitHub. If you have any questions about this update or you experience issues following this update, please contact us
Jun 27, 2023
Minor SLURM Update on 6/28/23

On Wednesday, June 28th, we will be deploying a minor update to the SLURM scheduling software. This update contains minor bug fixes and should not impact HPCC users. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.
Jun 23, 2023
HPCC Connectivity Issues - UPDATED

UPDATE: Network connectivity has been restored, and all ICER services are operational.
Jun 21, 2023
Server Maintenance on June 22 at 5:30AM - UPDATED

UPDATE: This maintenance work is now complete.
Jun 15, 2023
Network Maintenance Planned for June 19, 2023 at 6:30PM - UPDATED

UPDATE: Scheduled HPCC network maintenance is now complete.
Jun 5, 2023
Email Delivery Delays - June 5, 2023 - UPDATED

UPDATE: Email to ICER is now functioning again without errors or delays.
Jun 1, 2023
Temporary Service Slowdown Possible - June 3, 2023

ICER users may notice slow network speeds from June 3 to June 7, 2023. In support of the XPRIZE competition, ICER will share significant HPCC resources during this timeframe which may result in service slowdowns
May 23, 2023
Scheduled Home Filesystem Update - Tuesday, May 30, 2023

On Tuesday, May 30th at 10am EDT, we will be performing a minor version upgrade of our home filesystem. This process will take approximately two hours. While we will be performing the update with the filesystem online, there is a possibility that the cluster may briefly lose connection to the storage. Please take this into consideration for any jobs which will be running during this time.
May 2, 2023
Local File Quotas Are Now Set for /tmp and /var on All Nodes

Beginning May 3, 2023, user quotas will be in place on all nodes for the /tmp and /var directories. All user accounts will be limited to 95% of the total /tmp partition space that is available on a particular node, and a 5GB limit on the /var partition. If a user account exceeds this quota, a 2 hour grace period will be allowed before the user account is no longer able to write to the /tmp or /var directory.
Apr 3, 2023
Intel14 Nodes Now Dedicated to OnDemand

Intel14 nodes have been removed from general queues and repurposed. A combined total of 2468 CPU cores and 15.66TB of memory has been dedicated to running jobs submitted through ICER’s installation of Open OnDemand. Dedicating these resources will help to reduce the amount of time users have to wait to launch interactive jobs through OnDemand.
Mar 28, 2023
SLURM Node Updates on Thursday, March 30th

On Thursday, March 30th, at 10:00AM, SLURM clients will be updated to the latest version. This update will bring the node and user components of SLURM to the same version as our SLURM controller and database. Most client commands (e.g. squeue, sbatch, sacct) should work seemlessly through this update. New jobs can be queued as normal and running jobs should not be affected. During these updates, nodes will appear as offline and no new jobs will start. Please note that pending srun/salloc commands may fail to start after this update is complete. If you have a job submitted through srun/salloc that fails after this update, please contact us. We can boost the priority of your job after resubmission.
Mar 23, 2023
MPI Performance Issues Following SLURM Controller Update - Updated

UPDATE: We applied a patch from the software vendor that eliminates the performance issue.
Mar 21, 2023
Intel14 nodes to be removed from general queues - Updated

UPDATE: Intel14 nodes have been removed from general queues
Mar 13, 2023
SLURM Scheduler Update at 5:00PM on 3/16/23 - Updated

UPDATE: The scheduler is back online and functioning normally.
Mar 10, 2023
SLURM Database Outage at 10:00AM on 3/9/23 - UPDATED

UPDATE: The database upgrade is complete. The sacct command will now function as expected.
Feb 1, 2023
Scratch purge of 45 day old files

Starting on February 15th, files on /mnt/scratch (/mnt/gs21) that have not been modified within the last 45 days will be deleted. Due to technical issues, this purge has not been running and older files have not been regularly removed from scratch/gs21. This issue has been fixed and automatic deletion will resume on February 15th. Users should ensure that any data older than 45 days on scratch/gs21 that they wish to save has been moved to persistent storage (home/research spaces or external storage.)
Jan 5, 2023
HPCC Scheduled Downtime

Update 1/5/2023 All updates were completed by 3pm on 1/4/2023. Globus had problems and was brought back online 1/5/2023. If you experience any problems, please contact us
Dec 22, 2022
Resolved: Rsync gateway issues

RESOLVED 12/22/22: The issue with the rsync gateway is resolved and file transfers are fully functional.
Dec 12, 2022
Resolved: Rsync gateway issues

RESOLVED 12/13/22: The issue with the rsync gateway is resolved and file transfers are fully functional.
Dec 7, 2022
Winter Break Limited Coverage

There will be limited coverage while MSU observes winter break from December 23, 2022 through January 2, 2023. The system will continue to run jobs and monitored for emergency issues. Tickets will be sorted by priority on January 3 when our team returns to work after the holiday break. If you have any questions, please contact us
Nov 18, 2022
New Limits on Scavenger Queue

We have implemented a new limit of 520 running jobs per user and 1000 submitted jobs per user in the scavenger queue. We have put this limit in place ensure that the scheduler is able to evaluate all the jobs in the queue during its regular scheduling cycles. This matches our general queue limits. Please see our documentation for more information about our scheduler policy and scavenger queue. If you have any questions regarding this change, please contact us.
Nov 15, 2022
Resolved: Login issue - Stale file handle

We are currently experiencing a login issue with our gateway nodes that report /mnt/home/<username>/.bash_profile: Stale file handle. We are working to resolve this issue.
Nov 1, 2022
Scheduler Outage on November 1st at 8PM

On November 1st at 8PM the scheduler will be offline momentarily in order to add additional computing resources to the machine that hosts the scheduling software. If you have any questions or concerns regarding this outage, please contact us.
Oct 26, 2022
Resolved: Request Tracker rt.hpcc.msu.edu outage.

From about 4 AM to 9 AM this morning (10-26) RT was unavailable due to a configuration management issue. It has been resolved but please let us know if you have any issues.
Oct 12, 2022
Resolved: Ondemand failing when job is scheduled on a new acm node.

RESOVLED 10/14/2022: OnDemand Desktop works on the amd22 cluster now
Oct 10, 2022
Service availability issues 10/10

At about 12:20 PM on October 10th, a bad git merge for our configuration management software caused old configurations to get pushed out to all nodes, which broke a number of services (including the contact forms and job submission on some nodes.) This was reverted by 1:08 PM, but due to caching some nodes may have received this configuration through 2 PM. All nodes and services should be back to normal functionality by 3 PM on October 10th.
Oct 7, 2022
Resolved: Request Tracker and Contact Forms outage on 10/11

Update 10/11 8 AM: Maintenance on RT has completed. Please let us know if you have any issues.
Oct 4, 2022
HPCC Scratch filesystem issues - Resolved

The HPCC scratch filesystem is currently experiencing an issue. Users may have seen issues as early as 7:30 AM this morning. We are working to identify the cause and correct the issue and will post updates here as they become available.
Sep 27, 2022
Password logins to the rsync gateway will be disabled on 10/12/22

UPDATE: 10/14: This has been implemented. Users using sshfs on Windows should contact the ICER help desk for help using public key authentication with rsync.hpcc.msu.edu.
Aug 31, 2022
New Scratch gs21 availability and gs18/ls15 retirement - UPDATED

We are excited to announce the general release of our new gs21 scratch system, now available at /mnt/gs21/scratch on all user systems, including gateways, development nodes, and the compute cluster. The new scratch system provides 3 PB of space for researchers and allows us to continue to maintain 50 TB quotas for our growing community. The new system also includes 200 TB of high-speed flash. You may begin to utilize the new scratch system immediately. Please read on for more information about the transition to this space.
Aug 31, 2022
File Transfer Service Network Migration - Resolved

UPDATE: The rsync service (rsync.hpcc.msu.edu) is available (8-31). A reminder that the rsync service node should only be used for file transfers.
Aug 17, 2022
Brief Scheduler Outage at 8:00PM 8/18/22 - UPDATED

On Thursday, August 18th, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.
Aug 4, 2022
Boosts to Job Priority Being Offered to Users Affected by Scheduler Issue

Many running jobs were cancelled due to unforseen complications with yesterdays SLURM configuration update. We are reaching out to affected users and offering boosts to job priority to make up for any lost productivity.
Aug 1, 2022
Brief Scheduler Outage at 8:00PM 8/3/22 - UPDATED

On Wednesday, August 3rd, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.
Jul 29, 2022
Firewall Maintenance on August 9th

On Tuesday, August 9th, MSU ITS will be upgrading the ICER firewall between 10 PM and 2 AM. This should not impact any running jobs or access to the HPCC. Users may experience intermittent, minor delays during interactive use.
Jul 21, 2022
Minor SLURM Update on 7/28/22

On Wednesday, July 28th, we will be deploying a minor update to the SLURM scheduling software. This update contains minor bug fixes and should not impact HPCC users. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.
Jul 18, 2022
HPCC performance issues - resolved

A performance issue was identified this morning with the home directory servers that caused ~30 second delays for access to files or directories . We identified a set of nodes that were causing the problem and restarted services as needed to resolve the issue at 12:30 pm 7/18/22.
Jul 1, 2022
HPCC offline - resolved

The HPCC is currently down due to a hardware failure and a failed failover. We are currently working with NetApp to resolve the issue. Users may have seen issues as soon as 2 PM, and the system has been fully down since about 3:30 PM.
Jun 3, 2022
Welcome to the new ICER Announcements Blog!

Hi! Welcome to the new ICER Announcements Blog. We have a new user documentation site at https://docs.icer.msu.edu. Please contact us if you have any questions.