Posts

  • HPCC Scheduled Downtime

    The HPCC will be unavailable on Wednesday, January 4th for our regularly scheduled maintenance. No jobs will run during this time. Jobs that will not be completed before January 4th will not begin until after maintenance is complete. For example, if you submit a four day job three days before the maintenance outage, your job will be postponed and will not begin to run until after maintenance is completed. If you have any questions, please contact us

  • Winter Break Limited Coverage

    There will be limited coverage while MSU observes winter break from December 23, 2022 through January 2, 2023. The system will continue to run jobs and monitored for emergency issues. Tickets will be sorted by priority on January 3 when our team returns to work after the holiday break. If you have any questions, please contact us

  • New Limits on Scavenger Queue

    We have implemented a new limit of 520 running jobs per user and 1000 submitted jobs per user in the scavenger queue. We have put this limit in place ensure that the scheduler is able to evaluate all the jobs in the queue during its regular scheduling cycles. This matches our general queue limits. Please see our documentation for more information about our scheduler policy and scavenger queue. If you have any questions regarding this change, please contact us.

  • Resolved: Login issue - Stale file handle

    We are currently experiencing a login issue with our gateway nodes that report /mnt/home/<username>/.bash_profile: Stale file handle. We are working to resolve this issue.

  • Scheduler Outage on November 1st at 8PM

    On November 1st at 8PM the scheduler will be offline momentarily in order to add additional computing resources to the machine that hosts the scheduling software. If you have any questions or concerns regarding this outage, please contact us.

  • Resolved: Request Tracker rt.hpcc.msu.edu outage.

    From about 4 AM to 9 AM this morning (10-26) RT was unavailable due to a configuration management issue. It has been resolved but please let us know if you have any issues.

  • Resolved: Ondemand failing when job is scheduled on a new acm node.

    RESOVLED 10/14/2022: OnDemand Desktop works on the amd22 cluster now

  • Service availability issues 10/10

    At about 12:20 PM on October 10th, a bad git merge for our configuration management software caused old configurations to get pushed out to all nodes, which broke a number of services (including the contact forms and job submission on some nodes.) This was reverted by 1:08 PM, but due to caching some nodes may have received this configuration through 2 PM. All nodes and services should be back to normal functionality by 3 PM on October 10th.

  • Resolved: Request Tracker and Contact Forms outage on 10/11

    Update 10/11 8 AM: Maintenance on RT has completed. Please let us know if you have any issues.

  • HPCC Scratch filesystem issues - Resolved

    The HPCC scratch filesystem is currently experiencing an issue. Users may have seen issues as early as 7:30 AM this morning. We are working to identify the cause and correct the issue and will post updates here as they become available.

  • Password logins to the rsync gateway will be disabled on 10/12/22

    UPDATE: 10/14: This has been implemented. Users using sshfs on Windows should contact the ICER help desk for help using public key authentication with rsync.hpcc.msu.edu.

  • New Scratch gs21 availability and gs18/ls15 retirement - UPDATED

    We are excited to announce the general release of our new gs21 scratch system, now available at /mnt/gs21/scratch on all user systems, including gateways, development nodes, and the compute cluster. The new scratch system provides 3 PB of space for researchers and allows us to continue to maintain 50 TB quotas for our growing community. The new system also includes 200 TB of high-speed flash. You may begin to utilize the new scratch system immediately. Please read on for more information about the transition to this space.

  • File Transfer Service Network Migration - Resolved

    UPDATE: The rsync service (rsync.hpcc.msu.edu) is available (8-31). A reminder that the rsync service node should only be used for file transfers.

  • Brief Scheduler Outage at 8:00PM 8/18/22 - UPDATED

    On Thursday, August 18th, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.

  • Boosts to Job Priority Being Offered to Users Affected by Scheduler Issue

    Many running jobs were cancelled due to unforseen complications with yesterdays SLURM configuration update. We are reaching out to affected users and offering boosts to job priority to make up for any lost productivity.

  • Brief Scheduler Outage at 8:00PM 8/3/22 - UPDATED

    On Wednesday, August 3rd, at 8:00PM, there will be a brief interruption in scheduling as we push an update to our SLURM configration. We expect this outage to last roughly 30 minutes. During this outage, SLURM client commands will be unavailable (e.g. srun/salloc/sbatch). Running jobs should not be affected.

  • Firewall Maintenance on August 9th

    On Tuesday, August 9th, MSU ITS will be upgrading the ICER firewall between 10 PM and 2 AM. This should not impact any running jobs or access to the HPCC. Users may experience intermittent, minor delays during interactive use.

  • Minor SLURM Update on 7/28/22

    On Wednesday, July 28th, we will be deploying a minor update to the SLURM scheduling software. This update contains minor bug fixes and should not impact HPCC users. If you have any questions about this update or you experience issues following this update, please contact us at https://contact.icer.msu.edu/.

  • HPCC performance issues - resolved

    A performance issue was identified this morning with the home directory servers that caused ~30 second delays for access to files or directories . We identified a set of nodes that were causing the problem and restarted services as needed to resolve the issue at 12:30 pm 7/18/22.

  • HPCC offline - resolved

    The HPCC is currently down due to a hardware failure and a failed failover. We are currently working with NetApp to resolve the issue. Users may have seen issues as soon as 2 PM, and the system has been fully down since about 3:30 PM.

  • Welcome to the new ICER Announcements Blog!

    Hi! Welcome to the new ICER Announcements Blog. We have a new user documentation site at https://docs.icer.msu.edu. Please contact us if you have any questions.

subscribe via RSS