Posts

  • Winter Break Limited Coverage

    There will be limited coverage while MSU observes winter break from December 24, 2024 through January 1, 2025. The system will continue to run jobs and be monitored for emergency issues. Tickets will be sorted by priority on January 2 when our team returns to work after the holiday break. If you have any questions, please contact us

  • HPCC Scheduled Downtime - RESOLVED 12/19/2024

    RESOLVED: Maintenance is complete, thank you for your patience. Job submissions will continue to run after 5PM on 12/19. Please note that as the intel14 cluster has been retired, the intel14 constraint must be removed from any jobs.

  • Intel16 Cluster Currently Offline - RESOLVED 11/19/2024

    RESOLVED: 11/19/2024 12:10PM - On 11/18/2024 ITS performed maintenance on a number of switches in the data center that required rebooting critical network infrastructure. After these reboots, several links connecting to the intel16 cluster did not recover. During this time, you may have also noticed brief pauses in OnDemand and on Gateway nodes. This morning we were able to work with ITS to re-establish connectivity to all intel16 nodes, and the intel16 cluster, along with all other nodes, are now back in production and running jobs via Slurm.

  • MATLAB License issue - RESOLVED 10/31/2024

    RESOLVED: 10/31/2024 5:15PM - The issue is resolved on development and compute nodes.

  • Shared Module and Software Server Restart - RESOLVED 11/1/2024

    RESOLVED: 11/1/2024 6:15 AM - The system restart is complete and all services should be online.

  • Shared Software File Server Restart - RESOLVED 12:50 10/30/2024

    RESOLVED: 1250 10/30/2024 - The system restart is complete and all services should be online.

  • ICER Web Application Login Error - RESOLVED 10/29/2024

    UPDATE: 10/29/2024 - Logins to RT, OpenOnDemand, and Contact forms looks to be fully functional again. Values might be cached and you might need to clear your cache. You can test by opening a private browser. Email general@rt.hpcc.msu.edu if you still experience problems.

  • ICER Contact Form UserID Information Lookup Error RESOLVED 10/29/2024

    The ICER contact form is currently experiencing a technical error retrieving userID information for some MSU accounts. This error may result in your inability to log new account or new research space requests. While we continue to troubleshoot this error, please use the general contact form to submit your requests. This post will continue to be updated as we have more information.

  • Gateway Node Operating System Upgrades

    Starting on Monday 10/28/2024 and over the next few weeks, we will be upgrading the operating systems on the gateway nodes. If you experience a timeout while attempting to connect to the HPCC during this time, please try again after a short delay or use our open ondemand instance. If you continue to have difficulty loging into HPCC resources, please let us know by submitting a ticket through our Contact Forms

  • 2024-10-24 Development node reboots - RESOLVED 2024-10-24 0715

    RESOLVED: 10/24/2024 - All reboots are complete and the development nodes should be available. Please report any issues through our contact forms

  • 2024-10-29 Nondisruptive firewall update

    Between 7 PM and 9 PM on October 29th, ITS will perform updates to the ICER firewall. We do not anticipate any impact to users as the firewall is configured with full redundancy, but please open a ticket if you notice any issues.

  • File System Performance - RESOLVED 9/27/2024

    RESOLVED: 9/27/2024 - ICER has completed the migration of data from the old home and research file system to the new file system. This should resolve the occasional slowdowns that have occurred since the start of the project this past spring. Home and research file system operations have returned to normal. This includes disaster recovery replication and our file system quota processes. Thank you for your patience during this transition.

  • 'Illegal instruction (core dumped)' Errors - RESOLVED 10/14/2024

    RESOLVED: 10/14/2024 - We have applied a fix that we believe has solved the issue. If you are still experiencing problems, please contact contact ICER support with a description and steps to reproduce the issue.

  • OnDemand Portal Update on Friday 9/20 - RESOLVED 9/23/24

    At 9:00PM on Friday, September 20th, ICER’s OnDemand portal will undergo an update from version 3.0.1 to version 3.1.7. The most notable change to the portal following this update will be Globus integration. When browsing files in the updated OnDemand portal, a ‘Globus’ button will be available that will open the current directory inside of Globus. A full list of changes made by this update can be viewed here. If you have any questions about this update or encounter any issues with the OnDemand portal following the update, please contact us.

  • Development node dev-intel18 maintenance - RESOLVED

    RESOLVED: The maintenance on dev-intel18 is complete as of 11:20 AM, September 6, 2024 and the node should be available for use.

  • Change to Loading Modules in SLURM Scripts

    In one week, ICER will make a small change to the way modules are loaded in SLURM scripts. Please make sure that all SLURM scripts you submit load modules in scripts before you use them! For more information and also how this affects workflow managers like Nextflow and Snakemake, please see our documentation.

  • OnDemand and Contact form login issues - RESOLVED

    RESOLVED: ITS has resolved the login issue and all systems are accessible as normal.

  • Filesystem Slowdown and User Creation Pause - RESOLVED

    UPDATE 8/13/2024: The recovery processes have finished running, and Home filesystem performance has now returned to normal.

  • August 6, 2024: HPCC Scheduled Downtime and Transition of Remaining CentOS Nodes (Completed 8/6/2024)

    Updates: 05:00PM - Upgrades are complete and in the processes of moving the system to production. This process takes about 30 minutes. HPCC should be available by 5:30PM or shortly after. Home and Research filesystem is little slow while snapshots catch up. Those will clear later this evening. If you notice problems, contact us

  • RESOLVED 7/31/24 Scavenger Queue jobs not starting

    The scavenger queue is operating normally now that the buyin node OS transition has been completed.

  • Data machine nodes not showing up in scontrol - UPDATED

    On July 12th, it was discovered that the data machine nodes are not properly responding to diagnostic commands. However, these nodes are still available and scheduling jobs.

  • Rebuilding default OpenMPI, may cause login issues - RESOLVED

    On July 11th, 2024 from 5:30-6:00PM Michigan time, we will be rebuilding the default OpenMPI module, OpenMPI/4.1.5-GCC-12.3.0. This will result in errors from the module system when logging in, as the module needs to be deleted to be rebuilt. This will not affect running jobs, and will be isolated to development nodes only. The rebuild should be complete by 6:00PM at which time this blog post will be updated.

  • Update details about current filesystem and OnDemand issues

    OnDemand: OnDemand is periodically losing connection to our gateway nodes. This makes home and scratch unavailable. We are still investigating the cause. Home directories: The home file system underwent diagnostics from 6/24-6/28. This caused slowdowns for logging in and using the HPCC. We have restarted our backup process after the scan ended 6/28 evening and users may see pauses as the file system catches up. NewOS: We upgraded our operating system to Ubuntu 22.04 in mid June. This included a reinstallation of all software modules. Please read our documentation here for more details about the upgrade, and contact us if you are having issues not covered by this documentation. Please click the title of this post for more detailed information and our planned timeline. Updated: 7/10 at the end.

  • Current system issues

    We are aware of two issues affected the system at this time: slow response to commands/slow login, and OnDemand scratch space missing. The system slowdowns are caused by diagnostics on the home filesystem as part of our upgrade to a new home filesystem. We do not currently have an estimate for when these diagnostics will complete. The OnDemand scratch space connection is also being diagnosed and addressed with our storage vendor. Please check back for updates as we have them.

  • Home filesystem issues affecting OnDemand

    OnDemand functionality has been partially recovered. Users should be able to log in, connect, and access their home and research spaces, as well as interactive app sessions. Scratch remains unavailable at this time. Please report access issues at https://contact.icer.msu.edu/contact

  • Home filesystem issues affecting OnDemand - Resolved

    At approximately 12:00 PM on 6/17/2024 we started experiencing an outage with the Home filesystem. This outage primarily affects OnDemand, but may be apparent on other nodes as well.

  • Compute Operating system upgrades (complete)

    On 17 June, 2024 the primary operating system on HPCC resources is being changed from Centos 7 to Ubuntu 22.04. Please review our operating system upgrade documentation for details.

  • Samba connectivity issues

    UPDATE 3:45pm 5/16/24 Samba file sharing is now back online. Please submit a ticket at https://contact.icer.msu.edu/contact if you continue to experience issues.

  • Home filesystem issues - update

    At approximately 6:15PM on 5/13/2024, users began reporting issues accessing their home directory on HPCC. We are aware of the issue and are working with our vendors to address it.

  • Home filesystem issues causing login problems

    At approximately 11:10 AM on 5/10/2024 we experienced a transient outage while conducting upgrades and hardware refresh of our Home filesystem. This outage may have caused login issues or stale filemounts. Services were restored after approximately 15 minutes and home directories should be available again. If you countinue to experience issues with your home directory, please contact us.

  • System Reboots Thursday May 9

    On Thursday May 9 the following systems will be rebooted from 10-12am:

  • Home filesystem issue UPDATED 5/3/2024 5:00PM

    UPDATE (5/3/2024 5:00 pm) - The issue has been resolved and all services should be available. If you encounter any additional issues, please contact us.

  • Home filesystem issue causing login problems - UPDATED 5/2/2024 12:30 pm

    UPDATE (5/2/2024 12:30 pm) - File system and connectivity issues have been resolved.

  • Scheduler Reboot at 10:00AM on 3/19/24

    At 10:00AM on Tuesday, March 19th, the SLURM scheduling server will go offline for a reboot. This reboot is necessary to apply updates to the underlying hardware that hosts the scheduler. The scheduler is expected to be offline for roughly 15 minutes. During this time, jobs may not be submitted and scheduler specific client commands will not work (e.g. squeue, sbatch, etc). Running jobs will not be affected. If you have any questions about this outage, please contact us.

  • Scheduler Reboot at 10:00AM on 3/18/24

    At 10:00AM on Monday, March 18th, the SLURM scheduling server will go offline for a reboot. This reboot is necessary to apply updates to the underlying hardware that hosts the scheduler. The scheduler is expected to be offline for roughly 15 minutes. During this time, jobs may not be submitted and scheduler specific client commands will not work (e.g. squeue, sbatch, etc). Running jobs will not be affected. If you have any questions about this outage, please contact us.

  • Scratch space not acccessible via OnDemand

    UPDATE (3/1/2024) - Access to scratch via OnDemand has been restored

  • VSCode updates will break access

    This post applies to users of VS Code that SSH into the ICER HPCC from their own copy of VS Code.

    Error message: “This machine does not meet Visual Studio Code Server’s prerequisites, expected either…: - find GLIBC >= v2.28.0 (but found v2.17.0 instead) for GNU environments”

    Details Microsoft recently updated Visual Studio Code to version 1.86, and it is no longer compatible with the operating system we use at ICER. The change note that lists the change is here https://code.visualstudio.com/updates/v1_86#_engineering (scroll down to “Linux minimum requirements update”) Although we plan to upgrade our operating system this year, in the meantime there are two solutions to this incompatibility.

    Solutions

    1) Use our code server app in OnDemand (Interactive Apps -> Code Server (beta)) You can request compute nodes to work on for a specified amount of time, and use VS Code in your browser.

    2) Downgrade to the previous 1.85 version of VS Code and disable automatic updates. You can access the previous version here https://code.visualstudio.com/updates/v1_85 (see the Downloads section for a version for your PC or Mac)

  • Minor SLURM Update on 01/11/24

    On Thursday, January 11th, we will be deploying a minor update to the SLURM scheduling software. This update will bring ICER to the latest minor revision of SLURM 23.02. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • Winter Break Limited Coverage

    There will be limited coverage while MSU observes winter break from December 22, 2023 through January 2, 2024. The system will continue to run jobs and be monitored for emergency issues. Tickets will be sorted by priority on January 3 when our team returns to work after the holiday break. If you have any questions, please contact us

  • Retirement of dev-intel14 and dev-intel14-k20 on 12/14/23

    On Thursday, December 14th, we will be retiring the dev-intel14 and dev-intel14-k20 nodes. After this date, the dev-intel14 and dev-intel14-k20 nodes will no longer be avialable for use as development nodes. Users should connect to the remaining active development nodes for any development node tasks. If you have any questions about this change, please contact us.

  • Minor SLURM Update on 12/05/23

    On Tuesday, December 5th, we will be deploying a minor update to the SLURM scheduling software. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • HPCC Scheduled Downtime - Completed

    The HPCC will be unavailable on Wednesday, December 20th for our regularly scheduled maintenance. No jobs will run during this time. Jobs that will not be completed before December 20th will not begin until after maintenance is complete. For example, if you submit a four day job three days before the maintenance outage, your job will be postponed and will not begin to run until after maintenance is completed.

  • RT Ticketing system problem last night 11/15/23

    The RT/Ticketing systems had problems after an upgrade last night. The time of the problem was from 9:00 pm 11-14-23 to 9:00 am 11-15-23. If you had problems during that timeframe please try again now. If you experience problems again please clear your browser cache. Thank You.

  • Minor SLURM Update on 11/09/23

    On Thursday, November 9th, we will be deploying a minor update to the SLURM scheduling software. This update will improve the efficiency of our SLURM controllers application logs. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • Jobs Now Always Automatically Requeued On Prolog Failure

    As of Thursday, October 26th, jobs that fail to start due to a prolog script error will always be requeued.

  • Performance problem on home system - UPDATED 10/24/2023

    UPDATE (10/24/2023) - The performance issues with the home directory system have now been resolved.

  • Minor SLURM Update on 10/23/23

    On Monday, October 23rd, we will be deploying a minor update to the SLURM scheduling software. This update brings our installation to the latest release and includes many bug fixes. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • Minor Singularity Update on 10/12/23

    On Thursday, October 12th, we will be deploying a minor update to the Singularity container software. This update will bring the HPCC from version 3.11.4 to the latest 3.11.5. A handful of bug fixes and new features are available in this version. For a full list of changes, please refer to Singularity’s release notes on GitHub. If you have any questions about this update or you experience issues following this update, please contact us

  • HPCC Connectivity Issues - UPDATED 10/2/23

    UPDATE (10/2/2023): We experienced an issue at 0930 this morning with home directories that prevented user logins. All services are now recovered and login should again be successful. Please let us know if you continue to experience issues.

  • Minor SLURM Update on 10/09/23

    On Monday, October 9th, we will be deploying a minor update to the SLURM scheduling software. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • Contact and Gateway Issues on 9/19/23

    Due to a failure of a supporting service, gateway-02 and the contact forms were unavailable at around 6 PM this evening. Staff have restored these services.

  • Minor SLURM Update on 9/21/23

    On Thursday, September 21st, we will be deploying a minor update to the SLURM scheduling software. This update is built against newer Nvidia drivers to support scheduling of multi-instance GPUs. If you have any questions about this update or you experience issues following this update, please contact us.

  • Performance problem on home system - resolved

    UPDATE: 8/31/2023 The cause of the system slowdowns was identified on 8/29/2023 as jobs saturating the storage I/O. Please follow the lab notebook for details and best practices to prevent this from happening again.

  • Globus Restored to Service - 8/16/2023

    8/16/2023:

  • HPCC Scheduled Downtime - UPDATED 8/15/2023

    UPDATE (8/15/2023): All scheduled updates are completed for the 8/15/2023 summer maintenance.

  • HPCC Connectivity Issues

    Update: Network problems in the data center were fixed by 3pm.
    Stability with home directories and gateways were restored by 5:30pm. File a ticket if you notice any other issues. We will continue to monitor closely this evening.

  • Intermittent HPCC Performance Issues

    We are experiencing sporadic episodes of slowness with logging in to the gateways and/or interactive work on the development nodes. We’re in the process of tracking down this issue. If you are experiencing this issue and/or have any other comments or questions, please feel free to file a ticket with us here: https://contact.icer.msu.edu/contact

  • Scheduler Outage on 7/25/23 at 6:00PM

    Starting at 6:00PM on Tuesday, July 25th, the SLURM scheduler will go offline in order to perform a migration of its underlying compute resources. This migration is necessary to complete routine maintenance on underlying compute resources. This outage is expected to last up to 30 minutes. During this time, SLURM client commands (sbatch, squeue, etc.) will be unavailable and no new jobs will be started. Queued and running jobs will not be affected. If you have any question about this outage, please contact us.

  • Minor SLURM Update on 7/10/23

    On Monday, July 10th, we will be deploying a minor update to the SLURM scheduling software. This update contains a patch designed to address a bug experienced with some large jobs (>50 nodes) that causes job processes to persist past a job’s end time. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.

  • Minor Singularity Update on 7/3/23

    On Monday, July 3rd, we will be deploying a minor update to the Singularity container software. This update will bring the HPCC from version 3.11.2 to the latest 3.11.4. Several bug fixes and new features are available in this version. For a full list of changes, please refer to Singularity’s release notes on GitHub. If you have any questions about this update or you experience issues following this update, please contact us

  • Minor SLURM Update on 6/28/23

    On Wednesday, June 28th, we will be deploying a minor update to the SLURM scheduling software. This update contains minor bug fixes and should not impact HPCC users. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.

  • HPCC Connectivity Issues - UPDATED

    UPDATE: Network connectivity has been restored, and all ICER services are operational.

  • Server Maintenance on June 22 at 5:30AM - UPDATED

    UPDATE: This maintenance work is now complete.

  • Network Maintenance Planned for June 19, 2023 at 6:30PM - UPDATED

    UPDATE: Scheduled HPCC network maintenance is now complete.

  • Email Delivery Delays - June 5, 2023 - UPDATED

    UPDATE: Email to ICER is now functioning again without errors or delays.

  • Temporary Service Slowdown Possible - June 3, 2023

    ICER users may notice slow network speeds from June 3 to June 7, 2023. In support of the XPRIZE competition, ICER will share significant HPCC resources during this timeframe which may result in service slowdowns

  • Scheduled Home Filesystem Update - Tuesday, May 30, 2023

    On Tuesday, May 30th at 10am EDT, we will be performing a minor version upgrade of our home filesystem. This process will take approximately two hours. While we will be performing the update with the filesystem online, there is a possibility that the cluster may briefly lose connection to the storage. Please take this into consideration for any jobs which will be running during this time.

  • Local File Quotas Are Now Set for /tmp and /var on All Nodes

    Beginning May 3, 2023, user quotas will be in place on all nodes for the /tmp and /var directories. All user accounts will be limited to 95% of the total /tmp partition space that is available on a particular node, and a 5GB limit on the /var partition. If a user account exceeds this quota, a 2 hour grace period will be allowed before the user account is no longer able to write to the /tmp or /var directory.

  • Intel14 Nodes Now Dedicated to OnDemand

    Intel14 nodes have been removed from general queues and repurposed. A combined total of 2468 CPU cores and 15.66TB of memory has been dedicated to running jobs submitted through ICER’s installation of Open OnDemand. Dedicating these resources will help to reduce the amount of time users have to wait to launch interactive jobs through OnDemand.

  • SLURM Node Updates on Thursday, March 30th

    On Thursday, March 30th, at 10:00AM, SLURM clients will be updated to the latest version. This update will bring the node and user components of SLURM to the same version as our SLURM controller and database. Most client commands (e.g. squeue, sbatch, sacct) should work seemlessly through this update. New jobs can be queued as normal and running jobs should not be affected. During these updates, nodes will appear as offline and no new jobs will start. Please note that pending srun/salloc commands may fail to start after this update is complete. If you have a job submitted through srun/salloc that fails after this update, please contact us. We can boost the priority of your job after resubmission.

  • MPI Performance Issues Following SLURM Controller Update - Updated

    UPDATE: We applied a patch from the software vendor that eliminates the performance issue.

  • Intel14 nodes to be removed from general queues - Updated

    UPDATE: Intel14 nodes have been removed from general queues

  • SLURM Scheduler Update at 5:00PM on 3/16/23 - Updated

    UPDATE: The scheduler is back online and functioning normally.

  • SLURM Database Outage at 10:00AM on 3/9/23 - UPDATED

    UPDATE: The database upgrade is complete. The sacct command will now function as expected.

  • Scratch purge of 45 day old files

    Starting on February 15th, files on /mnt/scratch (/mnt/gs21) that have not been modified within the last 45 days will be deleted. Due to technical issues, this purge has not been running and older files have not been regularly removed from scratch/gs21. This issue has been fixed and automatic deletion will resume on February 15th. Users should ensure that any data older than 45 days on scratch/gs21 that they wish to save has been moved to persistent storage (home/research spaces or external storage.)

  • HPCC Scheduled Downtime

    Update 1/5/2023 All updates were completed by 3pm on 1/4/2023. Globus had problems and was brought back online 1/5/2023. If you experience any problems, please contact us

  • Resolved: Rsync gateway issues

    RESOLVED 12/22/22: The issue with the rsync gateway is resolved and file transfers are fully functional.

  • Resolved: Rsync gateway issues

    RESOLVED 12/13/22: The issue with the rsync gateway is resolved and file transfers are fully functional.

  • Winter Break Limited Coverage

    There will be limited coverage while MSU observes winter break from December 23, 2022 through January 2, 2023. The system will continue to run jobs and monitored for emergency issues. Tickets will be sorted by priority on January 3 when our team returns to work after the holiday break. If you have any questions, please contact us

  • New Limits on Scavenger Queue

    We have implemented a new limit of 520 running jobs per user and 1000 submitted jobs per user in the scavenger queue. We have put this limit in place ensure that the scheduler is able to evaluate all the jobs in the queue during its regular scheduling cycles. This matches our general queue limits. Please see our documentation for more information about our scheduler policy and scavenger queue. If you have any questions regarding this change, please contact us.

  • Resolved: Login issue - Stale file handle

    We are currently experiencing a login issue with our gateway nodes that report /mnt/home/<username>/.bash_profile: Stale file handle. We are working to resolve this issue.

  • Scheduler Outage on November 1st at 8PM

    On November 1st at 8PM the scheduler will be offline momentarily in order to add additional computing resources to the machine that hosts the scheduling software. If you have any questions or concerns regarding this outage, please contact us.

  • Resolved: Request Tracker rt.hpcc.msu.edu outage.

    From about 4 AM to 9 AM this morning (10-26) RT was unavailable due to a configuration management issue. It has been resolved but please let us know if you have any issues.

  • Resolved: Ondemand failing when job is scheduled on a new acm node.

    RESOVLED 10/14/2022: OnDemand Desktop works on the amd22 cluster now

  • Service availability issues 10/10

    At about 12:20 PM on October 10th, a bad git merge for our configuration management software caused old configurations to get pushed out to all nodes, which broke a number of services (including the contact forms and job submission on some nodes.) This was reverted by 1:08 PM, but due to caching some nodes may have received this configuration through 2 PM. All nodes and services should be back to normal functionality by 3 PM on October 10th.

  • Resolved: Request Tracker and Contact Forms outage on 10/11

    Update 10/11 8 AM: Maintenance on RT has completed. Please let us know if you have any issues.

  • HPCC Scratch filesystem issues - Resolved

    The HPCC scratch filesystem is currently experiencing an issue. Users may have seen issues as early as 7:30 AM this morning. We are working to identify the cause and correct the issue and will post updates here as they become available.

  • Password logins to the rsync gateway will be disabled on 10/12/22

    UPDATE: 10/14: This has been implemented. Users using sshfs on Windows should contact the ICER help desk for help using public key authentication with rsync.hpcc.msu.edu.

  • New Scratch gs21 availability and gs18/ls15 retirement - UPDATED

    We are excited to announce the general release of our new gs21 scratch system, now available at /mnt/gs21/scratch on all user systems, including gateways, development nodes, and the compute cluster. The new scratch system provides 3 PB of space for researchers and allows us to continue to maintain 50 TB quotas for our growing community. The new system also includes 200 TB of high-speed flash. You may begin to utilize the new scratch system immediately. Please read on for more information about the transition to this space.

  • File Transfer Service Network Migration - Resolved

    UPDATE: The rsync service (rsync.hpcc.msu.edu) is available (8-31). A reminder that the rsync service node should only be used for file transfers.

  • Brief Scheduler Outage at 8:00PM 8/18/22 - UPDATED

    On Thursday, August 18th, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.

  • Boosts to Job Priority Being Offered to Users Affected by Scheduler Issue

    Many running jobs were cancelled due to unforseen complications with yesterdays SLURM configuration update. We are reaching out to affected users and offering boosts to job priority to make up for any lost productivity.

  • Brief Scheduler Outage at 8:00PM 8/3/22 - UPDATED

    On Wednesday, August 3rd, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.

  • Firewall Maintenance on August 9th

    On Tuesday, August 9th, MSU ITS will be upgrading the ICER firewall between 10 PM and 2 AM. This should not impact any running jobs or access to the HPCC. Users may experience intermittent, minor delays during interactive use.

  • Minor SLURM Update on 7/28/22

    On Wednesday, July 28th, we will be deploying a minor update to the SLURM scheduling software. This update contains minor bug fixes and should not impact HPCC users. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.

  • HPCC performance issues - resolved

    A performance issue was identified this morning with the home directory servers that caused ~30 second delays for access to files or directories . We identified a set of nodes that were causing the problem and restarted services as needed to resolve the issue at 12:30 pm 7/18/22.

  • HPCC offline - resolved

    The HPCC is currently down due to a hardware failure and a failed failover. We are currently working with NetApp to resolve the issue. Users may have seen issues as soon as 2 PM, and the system has been fully down since about 3:30 PM.

  • Welcome to the new ICER Announcements Blog!

    Hi! Welcome to the new ICER Announcements Blog. We have a new user documentation site at https://docs.icer.msu.edu. Please contact us if you have any questions.

subscribe via RSS