Over the past two weeks we have experienced Oakely login node crashes potentially caused by a Lustre bug.

Lustre is still offline. HPC systems back up

Category: 
Resolution: 
Resolved

Day One of the scheduled downtime has been completed, and HPC operations have resumed. As planned, Lustre work will extend into Day Two. Jobs using /fs/lustre or $PFSDIR cannot run until this work is completed, but all other jobs can run.

UPDATE: Performance problems with Lustre have prevented us from bringing up the filesystem. We are working on a resolution.

UPDATE: Lustre returned to service the afternoon of July 12th, 2014.

Alert users: 
display