Posts

  • Scheduler Reboot at 10:00AM on 3/19/24

    At 10:00AM on Tuesday, March 19th, the SLURM scheduling server will go offline for a reboot. This reboot is necessary to apply updates to the underlying hardware that hosts the scheduler. The scheduler is expected to be offline for roughly 15 minutes. During this time, jobs may not be submitted and scheduler specific client commands will not work (e.g. squeue, sbatch, etc). Running jobs will not be affected. If you have any questions about this outage, please contact us.

  • Scheduler Reboot at 10:00AM on 3/18/24

    At 10:00AM on Monday, March 18th, the SLURM scheduling server will go offline for a reboot. This reboot is necessary to apply updates to the underlying hardware that hosts the scheduler. The scheduler is expected to be offline for roughly 15 minutes. During this time, jobs may not be submitted and scheduler specific client commands will not work (e.g. squeue, sbatch, etc). Running jobs will not be affected. If you have any questions about this outage, please contact us.

  • Scratch space not acccessible via OnDemand

    UPDATE (3/1/2024) - Access to scratch via OnDemand has been restored

  • VSCode updates will break access

    This post applies to users of VS Code that SSH into the ICER HPCC from their own copy of VS Code.

    Error message: “This machine does not meet Visual Studio Code Server’s prerequisites, expected either…: - find GLIBC >= v2.28.0 (but found v2.17.0 instead) for GNU environments”

    Details Microsoft recently updated Visual Studio Code to version 1.86, and it is no longer compatible with the operating system we use at ICER. The change note that lists the change is here https://code.visualstudio.com/updates/v1_86#_engineering (scroll down to “Linux minimum requirements update”) Although we plan to upgrade our operating system this year, in the meantime there are two solutions to this incompatibility.

    Solutions

    1) Use our code server app in OnDemand (Interactive Apps -> Code Server (beta)) You can request compute nodes to work on for a specified amount of time, and use VS Code in your browser.

    2) Downgrade to the previous 1.85 version of VS Code and disable automatic updates. You can access the previous version here https://code.visualstudio.com/updates/v1_85 (see the Downloads section for a version for your PC or Mac)

  • Minor SLURM Update on 01/11/24

    On Thursday, January 11th, we will be deploying a minor update to the SLURM scheduling software. This update will bring ICER to the latest minor revision of SLURM 23.02. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • Winter Break Limited Coverage

    There will be limited coverage while MSU observes winter break from December 22, 2023 through January 2, 2024. The system will continue to run jobs and be monitored for emergency issues. Tickets will be sorted by priority on January 3 when our team returns to work after the holiday break. If you have any questions, please contact us

  • Retirement of dev-intel14 and dev-intel14-k20 on 12/14/23

    On Thursday, December 14th, we will be retiring the dev-intel14 and dev-intel14-k20 nodes. After this date, the dev-intel14 and dev-intel14-k20 nodes will no longer be avialable for use as development nodes. Users should connect to the remaining active development nodes for any development node tasks. If you have any questions about this change, please contact us.

  • Minor SLURM Update on 12/05/23

    On Tuesday, December 5th, we will be deploying a minor update to the SLURM scheduling software. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • HPCC Scheduled Downtime - Completed

    The HPCC will be unavailable on Wednesday, December 20th for our regularly scheduled maintenance. No jobs will run during this time. Jobs that will not be completed before December 20th will not begin until after maintenance is complete. For example, if you submit a four day job three days before the maintenance outage, your job will be postponed and will not begin to run until after maintenance is completed.

  • RT Ticketing system problem last night 11/15/23

    The RT/Ticketing systems had problems after an upgrade last night. The time of the problem was from 9:00 pm 11-14-23 to 9:00 am 11-15-23. If you had problems during that timeframe please try again now. If you experience problems again please clear your browser cache. Thank You.

  • Minor SLURM Update on 11/09/23

    On Thursday, November 9th, we will be deploying a minor update to the SLURM scheduling software. This update will improve the efficiency of our SLURM controllers application logs. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • Jobs Now Always Automatically Requeued On Prolog Failure

    As of Thursday, October 26th, jobs that fail to start due to a prolog script error will always be requeued.

  • Performance problem on home system - UPDATED 10/24/2023

    UPDATE (10/24/2023) - The performance issues with the home directory system have now been resolved.

  • Minor SLURM Update on 10/23/23

    On Monday, October 23rd, we will be deploying a minor update to the SLURM scheduling software. This update brings our installation to the latest release and includes many bug fixes. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • Minor Singularity Update on 10/12/23

    On Thursday, October 12th, we will be deploying a minor update to the Singularity container software. This update will bring the HPCC from version 3.11.4 to the latest 3.11.5. A handful of bug fixes and new features are available in this version. For a full list of changes, please refer to Singularity’s release notes on GitHub. If you have any questions about this update or you experience issues following this update, please contact us

  • HPCC Connectivity Issues - UPDATED 10/2/23

    UPDATE (10/2/2023): We experienced an issue at 0930 this morning with home directories that prevented user logins. All services are now recovered and login should again be successful. Please let us know if you continue to experience issues.

  • Minor SLURM Update on 10/09/23

    On Monday, October 9th, we will be deploying a minor update to the SLURM scheduling software. Running and queued jobs should not be affected. No interruptions are expected to client command functionality (e.g. squeue, sbatch, sacct). If you have any questions about this update or you experience issues following this update, please contact us.

  • Contact and Gateway Issues on 9/19/23

    Due to a failure of a supporting service, gateway-02 and the contact forms were unavailable at around 6 PM this evening. Staff have restored these services.

  • Minor SLURM Update on 9/21/23

    On Thursday, September 21st, we will be deploying a minor update to the SLURM scheduling software. This update is built against newer Nvidia drivers to support scheduling of multi-instance GPUs. If you have any questions about this update or you experience issues following this update, please contact us.

  • Performance problem on home system - resolved

    UPDATE: 8/31/2023 The cause of the system slowdowns was identified on 8/29/2023 as jobs saturating the storage I/O. Please follow the lab notebook for details and best practices to prevent this from happening again.

  • Globus Restored to Service - 8/16/2023

    8/16/2023:

  • HPCC Scheduled Downtime - UPDATED 8/15/2023

    UPDATE (8/15/2023): All scheduled updates are completed for the 8/15/2023 summer maintenance.

  • HPCC Connectivity Issues

    Update: Network problems in the data center were fixed by 3pm.
    Stability with home directories and gateways were restored by 5:30pm. File a ticket if you notice any other issues. We will continue to monitor closely this evening.

  • Intermittent HPCC Performance Issues

    We are experiencing sporadic episodes of slowness with logging in to the gateways and/or interactive work on the development nodes. We’re in the process of tracking down this issue. If you are experiencing this issue and/or have any other comments or questions, please feel free to file a ticket with us here: https://contact.icer.msu.edu/contact

  • Scheduler Outage on 7/25/23 at 6:00PM

    Starting at 6:00PM on Tuesday, July 25th, the SLURM scheduler will go offline in order to perform a migration of its underlying compute resources. This migration is necessary to complete routine maintenance on underlying compute resources. This outage is expected to last up to 30 minutes. During this time, SLURM client commands (sbatch, squeue, etc.) will be unavailable and no new jobs will be started. Queued and running jobs will not be affected. If you have any question about this outage, please contact us.

  • Minor SLURM Update on 7/10/23

    On Monday, July 10th, we will be deploying a minor update to the SLURM scheduling software. This update contains a patch designed to address a bug experienced with some large jobs (>50 nodes) that causes job processes to persist past a job’s end time. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.

  • Minor Singularity Update on 7/3/23

    On Monday, July 3rd, we will be deploying a minor update to the Singularity container software. This update will bring the HPCC from version 3.11.2 to the latest 3.11.4. Several bug fixes and new features are available in this version. For a full list of changes, please refer to Singularity’s release notes on GitHub. If you have any questions about this update or you experience issues following this update, please contact us

  • Minor SLURM Update on 6/28/23

    On Wednesday, June 28th, we will be deploying a minor update to the SLURM scheduling software. This update contains minor bug fixes and should not impact HPCC users. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.

  • HPCC Connectivity Issues - UPDATED

    UPDATE: Network connectivity has been restored, and all ICER services are operational.

  • Server Maintenance on June 22 at 5:30AM - UPDATED

    UPDATE: This maintenance work is now complete.

  • Network Maintenance Planned for June 19, 2023 at 6:30PM - UPDATED

    UPDATE: Scheduled HPCC network maintenance is now complete.

  • Email Delivery Delays - June 5, 2023 - UPDATED

    UPDATE: Email to ICER is now functioning again without errors or delays.

  • Temporary Service Slowdown Possible - June 3, 2023

    ICER users may notice slow network speeds from June 3 to June 7, 2023. In support of the XPRIZE competition, ICER will share significant HPCC resources during this timeframe which may result in service slowdowns

  • Scheduled Home Filesystem Update - Tuesday, May 30, 2023

    On Tuesday, May 30th at 10am EDT, we will be performing a minor version upgrade of our home filesystem. This process will take approximately two hours. While we will be performing the update with the filesystem online, there is a possibility that the cluster may briefly lose connection to the storage. Please take this into consideration for any jobs which will be running during this time.

  • Local File Quotas Are Now Set for /tmp and /var on All Nodes

    Beginning May 3, 2023, user quotas will be in place on all nodes for the /tmp and /var directories. All user accounts will be limited to 95% of the total /tmp partition space that is available on a particular node, and a 5GB limit on the /var partition. If a user account exceeds this quota, a 2 hour grace period will be allowed before the user account is no longer able to write to the /tmp or /var directory.

  • Intel14 Nodes Now Dedicated to OnDemand

    Intel14 nodes have been removed from general queues and repurposed. A combined total of 2468 CPU cores and 15.66TB of memory has been dedicated to running jobs submitted through ICER’s installation of Open OnDemand. Dedicating these resources will help to reduce the amount of time users have to wait to launch interactive jobs through OnDemand.

  • SLURM Node Updates on Thursday, March 30th

    On Thursday, March 30th, at 10:00AM, SLURM clients will be updated to the latest version. This update will bring the node and user components of SLURM to the same version as our SLURM controller and database. Most client commands (e.g. squeue, sbatch, sacct) should work seemlessly through this update. New jobs can be queued as normal and running jobs should not be affected. During these updates, nodes will appear as offline and no new jobs will start. Please note that pending srun/salloc commands may fail to start after this update is complete. If you have a job submitted through srun/salloc that fails after this update, please contact us. We can boost the priority of your job after resubmission.

  • MPI Performance Issues Following SLURM Controller Update - Updated

    UPDATE: We applied a patch from the software vendor that eliminates the performance issue.

  • Intel14 nodes to be removed from general queues - Updated

    UPDATE: Intel14 nodes have been removed from general queues

  • SLURM Scheduler Update at 5:00PM on 3/16/23 - Updated

    UPDATE: The scheduler is back online and functioning normally.

  • SLURM Database Outage at 10:00AM on 3/9/23 - UPDATED

    UPDATE: The database upgrade is complete. The sacct command will now function as expected.

  • Scratch purge of 45 day old files

    Starting on February 15th, files on /mnt/scratch (/mnt/gs21) that have not been modified within the last 45 days will be deleted. Due to technical issues, this purge has not been running and older files have not been regularly removed from scratch/gs21. This issue has been fixed and automatic deletion will resume on February 15th. Users should ensure that any data older than 45 days on scratch/gs21 that they wish to save has been moved to persistent storage (home/research spaces or external storage.)

  • HPCC Scheduled Downtime

    Update 1/5/2023 All updates were completed by 3pm on 1/4/2023. Globus had problems and was brought back online 1/5/2023. If you experience any problems, please contact us

  • Resolved: Rsync gateway issues

    RESOLVED 12/22/22: The issue with the rsync gateway is resolved and file transfers are fully functional.

  • Resolved: Rsync gateway issues

    RESOLVED 12/13/22: The issue with the rsync gateway is resolved and file transfers are fully functional.

  • Winter Break Limited Coverage

    There will be limited coverage while MSU observes winter break from December 23, 2022 through January 2, 2023. The system will continue to run jobs and monitored for emergency issues. Tickets will be sorted by priority on January 3 when our team returns to work after the holiday break. If you have any questions, please contact us

  • New Limits on Scavenger Queue

    We have implemented a new limit of 520 running jobs per user and 1000 submitted jobs per user in the scavenger queue. We have put this limit in place ensure that the scheduler is able to evaluate all the jobs in the queue during its regular scheduling cycles. This matches our general queue limits. Please see our documentation for more information about our scheduler policy and scavenger queue. If you have any questions regarding this change, please contact us.

  • Resolved: Login issue - Stale file handle

    We are currently experiencing a login issue with our gateway nodes that report /mnt/home/<username>/.bash_profile: Stale file handle. We are working to resolve this issue.

  • Scheduler Outage on November 1st at 8PM

    On November 1st at 8PM the scheduler will be offline momentarily in order to add additional computing resources to the machine that hosts the scheduling software. If you have any questions or concerns regarding this outage, please contact us.

  • Resolved: Request Tracker rt.hpcc.msu.edu outage.

    From about 4 AM to 9 AM this morning (10-26) RT was unavailable due to a configuration management issue. It has been resolved but please let us know if you have any issues.

  • Resolved: Ondemand failing when job is scheduled on a new acm node.

    RESOVLED 10/14/2022: OnDemand Desktop works on the amd22 cluster now

  • Service availability issues 10/10

    At about 12:20 PM on October 10th, a bad git merge for our configuration management software caused old configurations to get pushed out to all nodes, which broke a number of services (including the contact forms and job submission on some nodes.) This was reverted by 1:08 PM, but due to caching some nodes may have received this configuration through 2 PM. All nodes and services should be back to normal functionality by 3 PM on October 10th.

  • Resolved: Request Tracker and Contact Forms outage on 10/11

    Update 10/11 8 AM: Maintenance on RT has completed. Please let us know if you have any issues.

  • HPCC Scratch filesystem issues - Resolved

    The HPCC scratch filesystem is currently experiencing an issue. Users may have seen issues as early as 7:30 AM this morning. We are working to identify the cause and correct the issue and will post updates here as they become available.

  • Password logins to the rsync gateway will be disabled on 10/12/22

    UPDATE: 10/14: This has been implemented. Users using sshfs on Windows should contact the ICER help desk for help using public key authentication with rsync.hpcc.msu.edu.

  • New Scratch gs21 availability and gs18/ls15 retirement - UPDATED

    We are excited to announce the general release of our new gs21 scratch system, now available at /mnt/gs21/scratch on all user systems, including gateways, development nodes, and the compute cluster. The new scratch system provides 3 PB of space for researchers and allows us to continue to maintain 50 TB quotas for our growing community. The new system also includes 200 TB of high-speed flash. You may begin to utilize the new scratch system immediately. Please read on for more information about the transition to this space.

  • File Transfer Service Network Migration - Resolved

    UPDATE: The rsync service (rsync.hpcc.msu.edu) is available (8-31). A reminder that the rsync service node should only be used for file transfers.

  • Brief Scheduler Outage at 8:00PM 8/18/22 - UPDATED

    On Thursday, August 18th, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.

  • Boosts to Job Priority Being Offered to Users Affected by Scheduler Issue

    Many running jobs were cancelled due to unforseen complications with yesterdays SLURM configuration update. We are reaching out to affected users and offering boosts to job priority to make up for any lost productivity.

  • Brief Scheduler Outage at 8:00PM 8/3/22 - UPDATED

    On Wednesday, August 3rd, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.

  • Firewall Maintenance on August 9th

    On Tuesday, August 9th, MSU ITS will be upgrading the ICER firewall between 10 PM and 2 AM. This should not impact any running jobs or access to the HPCC. Users may experience intermittent, minor delays during interactive use.

  • Minor SLURM Update on 7/28/22

    On Wednesday, July 28th, we will be deploying a minor update to the SLURM scheduling software. This update contains minor bug fixes and should not impact HPCC users. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.

  • HPCC performance issues - resolved

    A performance issue was identified this morning with the home directory servers that caused ~30 second delays for access to files or directories . We identified a set of nodes that were causing the problem and restarted services as needed to resolve the issue at 12:30 pm 7/18/22.

  • HPCC offline - resolved

    The HPCC is currently down due to a hardware failure and a failed failover. We are currently working with NetApp to resolve the issue. Users may have seen issues as soon as 2 PM, and the system has been fully down since about 3:30 PM.

  • Welcome to the new ICER Announcements Blog!

    Hi! Welcome to the new ICER Announcements Blog. We have a new user documentation site at https://docs.icer.msu.edu. Please contact us if you have any questions.

subscribe via RSS