Posts

  • SLURM Node Updates on Thursday, March 30th

    On Thursday, March 30th, at 10:00AM, SLURM clients will be updated to the latest version. This update will bring the node and user components of SLURM to the same version as our SLURM controller and database. Most client commands (e.g. squeue, sbatch, sacct) should work seemlessly through this update. New jobs can be queued as normal and running jobs should not be affected. During these updates, nodes will appear as offline and no new jobs will start. Please note that pending srun/salloc commands may fail to start after this update is complete. If you have a job submitted through srun/salloc that fails after this update, please contact us. We can boost the priority of your job after resubmission.

  • MPI Performance Issues Following SLURM Controller Update - Updated

    UPDATE: We applied a patch from the software vendor that eliminates the performance issue.

  • Intel14 nodes to be removed from general queues - Updated

    UPDATE: Intel14 nodes have been removed from general queues

  • SLURM Scheduler Update at 5:00PM on 3/16/23 - Updated

    UPDATE: The scheduler is back online and functioning normally.

  • SLURM Database Outage at 10:00AM on 3/9/23 - UPDATED

    UPDATE: The database upgrade is complete. The sacct command will now function as expected.

  • Scratch purge of 45 day old files

    Starting on February 15th, files on /mnt/scratch (/mnt/gs21) that have not been modified within the last 45 days will be deleted. Due to technical issues, this purge has not been running and older files have not been regularly removed from scratch/gs21. This issue has been fixed and automatic deletion will resume on February 15th. Users should ensure that any data older than 45 days on scratch/gs21 that they wish to save has been moved to persistent storage (home/research spaces or external storage.)

  • HPCC Scheduled Downtime

    Update 1/5/2023 All updates were completed by 3pm on 1/4/2023. Globus had problems and was brought back online 1/5/2023. If you experience any problems, please contact us

  • Resolved: Rsync gateway issues

    RESOLVED 12/22/22: The issue with the rsync gateway is resolved and file transfers are fully functional.

  • Resolved: Rsync gateway issues

    RESOLVED 12/13/22: The issue with the rsync gateway is resolved and file transfers are fully functional.

  • Winter Break Limited Coverage

    There will be limited coverage while MSU observes winter break from December 23, 2022 through January 2, 2023. The system will continue to run jobs and monitored for emergency issues. Tickets will be sorted by priority on January 3 when our team returns to work after the holiday break. If you have any questions, please contact us

  • New Limits on Scavenger Queue

    We have implemented a new limit of 520 running jobs per user and 1000 submitted jobs per user in the scavenger queue. We have put this limit in place ensure that the scheduler is able to evaluate all the jobs in the queue during its regular scheduling cycles. This matches our general queue limits. Please see our documentation for more information about our scheduler policy and scavenger queue. If you have any questions regarding this change, please contact us.

  • Resolved: Login issue - Stale file handle

    We are currently experiencing a login issue with our gateway nodes that report /mnt/home/<username>/.bash_profile: Stale file handle. We are working to resolve this issue.

  • Scheduler Outage on November 1st at 8PM

    On November 1st at 8PM the scheduler will be offline momentarily in order to add additional computing resources to the machine that hosts the scheduling software. If you have any questions or concerns regarding this outage, please contact us.

  • Resolved: Request Tracker rt.hpcc.msu.edu outage.

    From about 4 AM to 9 AM this morning (10-26) RT was unavailable due to a configuration management issue. It has been resolved but please let us know if you have any issues.

  • Resolved: Ondemand failing when job is scheduled on a new acm node.

    RESOVLED 10/14/2022: OnDemand Desktop works on the amd22 cluster now

  • Service availability issues 10/10

    At about 12:20 PM on October 10th, a bad git merge for our configuration management software caused old configurations to get pushed out to all nodes, which broke a number of services (including the contact forms and job submission on some nodes.) This was reverted by 1:08 PM, but due to caching some nodes may have received this configuration through 2 PM. All nodes and services should be back to normal functionality by 3 PM on October 10th.

  • Resolved: Request Tracker and Contact Forms outage on 10/11

    Update 10/11 8 AM: Maintenance on RT has completed. Please let us know if you have any issues.

  • HPCC Scratch filesystem issues - Resolved

    The HPCC scratch filesystem is currently experiencing an issue. Users may have seen issues as early as 7:30 AM this morning. We are working to identify the cause and correct the issue and will post updates here as they become available.

  • Password logins to the rsync gateway will be disabled on 10/12/22

    UPDATE: 10/14: This has been implemented. Users using sshfs on Windows should contact the ICER help desk for help using public key authentication with rsync.hpcc.msu.edu.

  • New Scratch gs21 availability and gs18/ls15 retirement - UPDATED

    We are excited to announce the general release of our new gs21 scratch system, now available at /mnt/gs21/scratch on all user systems, including gateways, development nodes, and the compute cluster. The new scratch system provides 3 PB of space for researchers and allows us to continue to maintain 50 TB quotas for our growing community. The new system also includes 200 TB of high-speed flash. You may begin to utilize the new scratch system immediately. Please read on for more information about the transition to this space.

  • File Transfer Service Network Migration - Resolved

    UPDATE: The rsync service (rsync.hpcc.msu.edu) is available (8-31). A reminder that the rsync service node should only be used for file transfers.

  • Brief Scheduler Outage at 8:00PM 8/18/22 - UPDATED

    On Thursday, August 18th, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.

  • Boosts to Job Priority Being Offered to Users Affected by Scheduler Issue

    Many running jobs were cancelled due to unforseen complications with yesterdays SLURM configuration update. We are reaching out to affected users and offering boosts to job priority to make up for any lost productivity.

  • Brief Scheduler Outage at 8:00PM 8/3/22 - UPDATED

    On Wednesday, August 3rd, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.

  • Firewall Maintenance on August 9th

    On Tuesday, August 9th, MSU ITS will be upgrading the ICER firewall between 10 PM and 2 AM. This should not impact any running jobs or access to the HPCC. Users may experience intermittent, minor delays during interactive use.

  • Minor SLURM Update on 7/28/22

    On Wednesday, July 28th, we will be deploying a minor update to the SLURM scheduling software. This update contains minor bug fixes and should not impact HPCC users. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.

  • HPCC performance issues - resolved

    A performance issue was identified this morning with the home directory servers that caused ~30 second delays for access to files or directories . We identified a set of nodes that were causing the problem and restarted services as needed to resolve the issue at 12:30 pm 7/18/22.

  • HPCC offline - resolved

    The HPCC is currently down due to a hardware failure and a failed failover. We are currently working with NetApp to resolve the issue. Users may have seen issues as soon as 2 PM, and the system has been fully down since about 3:30 PM.

  • Welcome to the new ICER Announcements Blog!

    Hi! Welcome to the new ICER Announcements Blog. We have a new user documentation site at https://docs.icer.msu.edu. Please contact us if you have any questions.

subscribe via RSS