Recorded on January 13, 2021.
- Kate Cahill
[Slide: "An introduction to OSC services, hardwware, and environment"] So we're going to do an introduction to OSC services, hardware and environment today. My name is Kate Cahill. As I said, I'm the education and training specialist at OSC, and I've been doing this for several years. My background is chemistry. So I was a computational chemist and I was working as a postdoc at Ohio State. And then I started working at OSC.
[Slide: "Outline"] So today, we're going to cover just a brief overview of what what OSC is and some high performance computing concepts for new users to keep in mind. We'll go over the hardware that OSC offers to our users, how to get a new… how to get started as a new user or a new API with a new project, the user environment on our clusters, some information about the software that's available at OSC Systems, a brief introduction to batch processing – talking about how to submit jobs to our batch system – and then, like I said, we'll switch over to a browser and look at our OnDemand portal.
[Slide: "What is the Ohio Supercomputer Center?"] So what is the Ohio Supercomputer Center?
[Slide: "About OSC"] Well, OSC has been around since 1987, and we're part of the Ohio Department of Higher Education, so we're actually a state agency and we're a statewide high performance computing and computational science resource for all universities in Ohio. So we provide the high performance computing resources as well as expertise in doing different kinds of computational science for all higher ed institutions, and also commercial entities in Ohio as well can use us.
[Slide: "Service Catalog"] And so here are some of the services that OSC provides. You probably are aware of our clusters, so our main services are cluster computing, but we also provide different types of research data storage for different data needs, education services, education and training services – like workshops and support for educa…, you know, using HPC and education in different ways. We have a Web Software Development team that works to make our portal more efficient and more effective for our users, and our Scientific Software Development team, and they work on keeping our software on our on our clusters up-to-date and effective and also to support users doing different things with the software that we have.
[Slide: "Client Services"] So you can see kind of what we've been up to in the past year. How many universities  have been active with us. We have 47 companies as well that that do computation using our our resources. We also have  universities outside of Ohio that also contract to use our resources. Over four thousand [4,419] clients, we had 64 courses that used HPC resources through us, several hundred  new projects and active projects over the past year, as well as several  training opportunities [462 trainees]. And then we also track publications, so we had almost two hundred  publications that cited (that used) OSC as part of their research.
[Slide: "Fields of Study"] And you can see here [a pie graph], this is just sort of a breakdown of the information that we have about the areas of science and technology that use OSC resources. So this is kind of a breakdown by but the information that we have – we may not have this for all of our jobs – but for the jobs that we do know this information, you can see that the large majority of our jobs are in natural sciences with chemistry and biology being the large ones and then engineering and technology of different types. But we have jobs that run for many different domains, so this is just a snapshot.
[Slide: "OSC Classroom Usage"] And then here, some details [a bar chart] about OSC classroom usage, so this is number of classes on the left by institution that have used OSC in the past year [19 universities overall in 2020], and then on the right this is the number of students per institution [7,600 students in 2020]. So you can see we have a lot of different types of courses at different size institutions all around Ohio.
[Slide: "HPC Concepts"] And so now I'm going to switch over to talk about the high performance computing concepts that can be important to sort of cover quickly, particularly for people who have never used the high performance computing research before. It's just good to have these concepts kind of clear at the beginning.
[Slide: "Why Use HPC?"] And so there are many different reasons why people need to use high performance computing. It serves a lot of different types of calculations. And certainly in recent years a lot of new types of calculations have been developed. So data intensive or machine learning type applications are new and growing areas. But from the user perspective, you could come to to use HPC resources because the job that you are running on your your laptop or your PC is just taking too long. So you may have an analysis that takes several days or a week or more to run on your personal computer. And you want to see if if a high performance computing resource will speed that process up for you. And so there are a lot of different types of hardware you can take advantage of to change how long your calculations might take or the calculations you may want to… you can run one on your PC, but you may want to run one hundred of these calculations, so you just need more processing ability. And then there's a question of data size. So if the data that you were working with or generating is too large to be stored or accessed effectively on a on a personal-computer-size device, you need more memory or storage to work with that. And so those resources are also part of what HPC can do. And for some people, it's that there's a specific software package that they can have access to or that will run effectively on an HPC system that we can support or we can help you work with if we don't support it ourselves.
[Slide: "What is the difference between your laptop and a supercomputer?"] And it's important to kind of just recognize the difference between using an HPC resource and using a laptop. So just from a basic perspective, your laptop has a certain number of cores. It has a certain number of memory, a certain amount of memory and storage. A supercomputer will be ten thousand times or more that processing and data analysis ability. Another feature of a supercomputer is that it's a remote system. So you're not going to you're not going to connect with it directly. You're not going to sit at a terminal and just type right into the supercomputer. You have to connect with it over a network. And so that changes how you interact with it. You need to learn the tools of how to connect remotely, and the type of network you use can change your experience. So if you have a slower network for some reason, because you're away from your office or traveling, that can change how you can interact with the system in some ways – it depends. And the other point to make is that this is a shared resource, so at any given time you'll have hundreds of users all accessing the system, and so that means that we have to use the system in a certain way so that everybody can share it effectively and that there's not any kind of slowdown or lag because you are not able to… because somebody is using the system improperly. So we'll talk about some of the things you have to keep in mind to be a good citizen of HPC.
[Slide: "HPC Terminology"] And then some terminology that's important to keep in mind also, it's just nice to have this clear, so we talk about a "compute node" on a on a cluster and that is about equivalent to high end personal computer or desktop computer. So that would have a certain number of cores in it and a certain amount of memory and storage. That's kind of a unit of a supercomputer. The "cluster" is all of those nodes networked together. And so that is… that's why we call it a "cluster", because it's a group of nodes. And that's also what we call a "supercomputer". And an "individual core": that's the CPU unit. So most computers these days have multiple cores, so you can talk about "quad core" or something like that to refer to a four-core system. And you'll see that in a supercomputer there are many, many cores in each node. And you do have to know the specifics of that so you can request your job correctly. And then the other term it's useful to to be aware of is the GPU, which stands for "graphical processing unit". And that's another type of parallel processing unit that's available on supercomputers. It's just a separate multicore processor. And depending on the type of calculation you want to do or the software that you want to access, it may change – it may affect – whether GPS are going to be useful to you, but if you can use them, they do provide a great deal of speed up to parallel calculations.
[Slide: "Memory"] And then just to talk very briefly about memory. Memory is – the definition of memory that we're using is – it is a space to hold data that is being calculated on, so actively being used, as well as the instructions that you're giving to the machine to do your calculation. So it's just a place to kind of manage data that's being actively accessed for your calculation. And there's different types of memory that you have to be aware of when you're working with a supercomputer. So on a single node, if you think of a single node as like a single PC, you have your processor and your memory and your storage all together. So we consider that shared memory because it's shared across all the processors. Once you start working with multiple nodes, then you have memory that is distributed across those nodes, so the memory that's on Node 1 is separate from the memory that's on Node 2, and so you have to be aware of where your data is being kept. Is it accessible to all the the the CPUs that are going to do your calculations? These are just some considerations. As you as you work with the system and develop your jobs, you'll see how to take advantage of that and how to make sure that you're accounting for that in your computational instructions. And each core will have – each core of a node would have – an associated amount of memory. And so that can vary with different types of hardware. So you'll see we have some standard nodes and then we have some large memory nodes that'll provide just greater memory capacity for jobs that require that.
[Slide: "Storage"] Storage is – so we take memory and storage is the other option – storage is storing data for longer-term usage. So it's data that's not necessarily being actively-calculated, used in a calculation. And there's different types of storage for different needs. And so OSC has several different types of storage from storage on the compute node that you access during your job to longer-term project storage to be available when needed: scratch storage to longer-term or archive storage. And so all those have different features that can optimize for those different uses. And as a user, you'll have access to most of those.
[Slide: "Structure of a Supercomputer"] So here is just a sort of quick overview [diagram of the parts of the supercomputer system] of what a supercomputer – how all the components of a supercomputer…. So as a user, you're going to access this system remotely. So you're going to use some tool to do that, either a terminal window or web portal, and you're going to log into the system. And when you do that, you're going to access these specialized nodes called "login nodes". And so these are not the main part of the system. These are just where you access the system directly to manage your files or to submit jobs and read your output and things like that. And these are shared nodes. This is where you'd have to be aware of any calculations you try and run, any memory you try and use. It can affect other users if you try and use too much of the resources on the login nodes. And then the main part of the cluster are the compute nodes, and so you can see they're represented here is these small boxes that are all linked together. So that is individual nodes making up a cluster. And there are different types of compute nodes. That's why you'll see that some of them are larger or different colors would indicate different features. And I'll show you the specifics of that when we talk about our specific hardware. And so those are all networked together. And then we have our different data storage areas and those are all accessible through the the entire system. So it's just a general overview of just a supercomputer, nonspecific.
[Slide: "Hardware Overview"] So, and now I can move on to what we what we have at OSC available specifically for our our our current clusters, but I'll just stop for a minute and see if anybody has any questions so far. So I see a question, "Has OSC been involved in COVID-19 research?" We certainly have. We've been a host to some COVID-19 data platforms so that people can store data and present information about COVID-19 data to the community. We've also supported research efforts by different research groups around the state. So we've had a separate effort to kind of support researchers who are doing COVID-19 research, so that we can give them specific attention. And you see that Wilbur shared a link [not displayed] to some of the activities that we've been doing to support people doing research on COVID-19.
So I see another question about signing in through the terminal on the computer using SSH. So it depends on what your credentials are, I guess, and what – I mean, you're probably using just the terminal app, so that probably would work OK. If you can post your question, Rohan, again, just to everyone or to Wilbur directly, he'll see if he can troubleshoot that for you, because it should work: SSH is one of the main ways to access the supercomputers.
[Slide: "Owens Compute Nodes"] So now I'm going to go on to our clusters and the details of our hardware, so I'm going to start – we have two clusters right now, Owens and Pitzer – so I'm going to start talking about Owens. So we've had Owens longer and Owens has 648 standard nodes, and each of those nodes has 28 cores, and there's one 128 GB of memory on each of those nodes. So as a user you can take advantage of a single node or you could request multiple nodes, but you'll have access to to 28 cores per node and 128 GB of memory.
[Slide: "Owens Data Analytics Nodes"] And then Owens also has data analytics nodes – so these are large memory nodes – there are 16 total available on Owens, and each of those nodes has 48 cores per node, just so you have a much greater parallelism per node. And then the memory available on each node is 1.5 TB. So these are made specifically for jobs that require more memory than is available on our standard compute nodes.
[Slide: "Owens GPU Nodes"] And Owens also has GPU nodes, so I mentioned the graphical processing units, so there's 160 standard compute nodes that also have a GPU available. And so that is the same 28 cores per node, 128 GB of memory, and then you also have access to a GPU on each node.
[Slide: "Owens Cluster Specifications"] And here's all that information put together, and you can find this schematic and the technical details of the Owen's hardware on our website, I put the link here, but you can just go under "Services" > "Cluster Computing", and then when you click "Owens", and you can see all these details. And I mentioned I will put these slides up on our event page so you can access them and get all these links that are in here.
[Slide: "Pitzer Cluster Specifications"] And so our other cluster is Pitzer, and so Pitzer is a newer cluster and it's also just had an expansion. So this this first view of Pitzer is the original Pitzer just so it's a little less difficult to see. So Pitzer similarly has standard compute nodes. There are 224 standard nodes that have a total of 40 cores per node and each node has 192 GB of memory. There's also GPU nodes on on the original Pitzer and those are slightly different different GPU's, but for similar purpose. So those are standard compute nodes, again, with the GPU available. And then for huge memory nodes – and these are set up differently – there's 80 cores per node and there's 3 TB of memory. So we found that people using Owen's actually needed more memory on the large memory node. So we expanded to these nodes. Now, Pitzer is had a very recent expansion. So now you can see it's it's sort of like two clusters in one. So we have our original setup with the compute nodes, GPU nodes, and huge memory nodes. And now we've added another set of compute nodes, large memory nodes, and GPU nodes. And so these would be newer, newer hardware, slightly different set up. You can see on the – so the expansion's on the bottom side – that we have 340 standard nodes, and each of those nodes has 48 cores per node and with the GPU nodes are the same standard node plus now two GPUs per node. And so there's other other features on the on the new Pitzer cluster. So we have a lot of new software, I mean new hardware, available with the with the Pitzer expansion.
[Slide: "Login Nodes"] And just to remind you, as I mentioned, when you log into the system you're working on the login nodes – you're not accessing the compute nodes directly there – and so on the login nodes, you have to be aware of not interrupting anybody else's work. So the loggin nodes are mainly for file editing, and managing your files, and setting up your your files and your jobs to submit to the batch system so that you can access the compute nodes. There are hard limits of activity on the login nodes: 20 minutes of CPU time. So if you start a process of any kind, like compiling a small file or something, if it runs for 20 minutes, the system will stop it. And you only have access to 1 GB of memory. So you really can't do large computing there. And if you do try and do that, it'll slow down other people's ability to access the system and, you know, the system will cut you off. So just trying to avoid doing any serious computing on the login nodes – it's really just for setting up.
[Slide: "Data Storage Systems"] So now I'll just go over the data storage systems that we have. So, as I mentioned, we have several different file systems. I'm going to focus on the four that you'll probably interact with the most. So we have a home directory. And so as a user on our system, everybody has a home directory, storage space. And you can… this is the main place for you to store your files. It is backed up daily. Generally as a general research user you would have 500 GB of storage space in your home directory and you can use your tilde username as a reference if you are familiar with Linux paths. There's also the project directory. And so this is a space that, as a whole project, so PI and all the users can share an extra storage space together. That's the project storage space, and that's available by request, so if your project finds that the individual home directory space isn't sufficient for your work, you can request project storage space as well. And that will have a reference – that will have a path that references your project number – so you will have to be aware of what your project number is and I'll talk about what that means. And then the scratch storage space. This is temporary storage. It's not backed up. And this is available to all users, but it's expected that you'll only use it for the time of your active jobs that you're running with, with this, with whatever you're storing on the scratch system. So you're not leaving it there indefinitely. If you need it for a few weeks that's what it's there for. It's optimized for very large files. So if you need to do calculations on large files without shifting them to the compute nodes, you can use scratch for that. But this is, as I said, not backed up. And there is a purge on scratch to keep it from getting too full. So after 90 or 120 days, any files that are unused but still there will get removed. And then the fourth file system we refer to is the temp directory. And so the reference there is the variable that you can use to to reference it, and that's the storage that's on the compute nodes. So when you're running a job, you have access to this to a compute node or multiple compute nodes. You have access to the storage on that compute node, but only during your job. So you can transfer files there and then you can read and write all on the compute node. And that's the recommended way to use those so that everything is contained, and you're not relying on the network too much during your job. And then so you can read and write to the the temp directory on the compute node. And then when your job ends, you, you have everything copied back to your home directory so that you have your results, because as soon as your job ends that storage is no longer available to you and it'll be wiped. So that's only available for that window when your job is running. And again, there's a link here where you can see more detail about the available filesystems. And again, I'll be posting these slides on the event page so you all can have access to it. And so this is just the detail of those four different file systems with the quota for each of them. As you can see, home and project directories are backed up and not purged. Scratch is not backed up and there is a purge for that. And then compute is only available while your job is running.
[Slide: "Getting Started at OSC"] So that was kind of all the details about our hardware. Now I'm going to talk about getting started at OSC with a new project or a new account and what you need to know is a new user. So let me know if you have any questions so far.
[Slide: "Who can get an OSC project?"] So academic projects are available at OSC to PIs (principal investigators) who are full-time faculty members or research scientists at an Ohio academic institution. Generally, that's the main person who runs the project. And then once a PI has a project, they can authorize whomever they like to work on that project with them. So they can invite students, postdocs, collaborators from other institutions, other research scientists to come and be on your project. But the PI is the manager of it. We also have classroom projects, as I mentioned. If you're teaching a course and you want to have access to OSC resources, those can be separate projects from your research project. And then commercial organizations can also have projects. And that's through purchasing time through our sales group.
[Slide: "Accounts and Projects at OSC"] And so just to be clear about how OSC uses the term, "project" and "account", a "project", as I said, is headed by a principal investigator and can include whatever other users that principal investigator chooses. And that's kind of the tool we use for managing resources is at the project level. So a project will have some some amount of resources available to use and all the users kind of charge against those resources and that's how we give access to the project directory (to software access) is usually at a project level. An "account" is related to a single user. So this is this is where you have your username and password to access the HPC systems. And this is connected to a single email. So we ask that you don't share accounts, that each person has a separate account, just so that we can always communicate with whoever is using that account, in case we see some activity that we need to question or we have a suggestion for optimizing your jobs. We always can communicate with with the person using the account. And you might be on multiple projects, but you'll have one account that can access all of those projects.
[Slide: "Usage Charges"] And I see a question about classroom usage: "If teaching a university class that has a large number of students, what happens if the initial resources are used up before the semester is finished?" And, yeah, that happens. So we do start with an initial project allocation for a classroom project at $500. And so that's for whatever compute and storage a class would need. But yeah, I mean, if you have a large course or you're expecting to to have a lot of calculations, you think that might not be enough. We can always provide more during the semester. So that's fully subsidized. And so the way we manage usage, we charge for core-hours, GPU-hours, and terabyte-months for storage, and so your project has an initial dollar balance and then whatever services you're using get charged against that. And we have some information on our website about what that looks like – these these prices are still subsidized by the state, so they're they're definitely cheaper than a commercial option – so we're still trying to be kind of a more reasonable option than than just going out and purchasing compute resources in the cloud or something like that.
[Slide: "Ohio Academic Projects"] And so for Ohio, academic project standard projects receive a $1000 grant annually to cover OSC services. So as a P.I., you can request and receive every year $1000 of OSC services and use that for whatever works for you. So that can be for compute, for storage, for whatever services we charge for. And you can make sure that you have budgeted so you don't – your students don't – accidentally run up a large bill so that there's a limit on the usage on the project. If you have a certain… if you want to stay within that thousand or you want to stay within whatever your budget is. And as I mentioned, classroom projects are fully subsidized. And this can all be managed at our client portal, which is separate from our – so we have set up several different websites – so osc.edu is our main website that has our documentation and information about our resources. My.osc.edu is our client portal, and that is where you can log in as a user and add people to your project, change your budget, request a classroom project, check on usage in your project and things like that, change your shell, actually.
[Slide: "Client Portal – my.osc.edu"] Here's just a view of the dashboard of the client portal so you can see what your active projects are on the bottom and then you have usage by project, usage by system just so you can keep an eye if you're not an active user, but you're managing some students, you can see what's going on, on your project using the snapshot. But the client portals where you make all those requests about budget and things like that.
[Slide: 'Statewide Users Group - SUG'] And we have a statewide users group, so this is open to anybody who uses our systems can come and give advice and help us help us develop our services. Over the years, we have two current committees, the Software and Activities Committee and the Hardware and Operations Committee, so you can attend and learn about our plans and make suggestions for what you think OSC should do in the future. We usually have two in-person meetings at OSC that usually include a research fair where we have posters and flash talks. We've been doing those, we've been doing the SUG meeting virtually so it's kind of pared down for the past couple of sessions. We're having a virtual SUG in March where you can attend meetings and get an OSC update and things like that, all virtually. But hopefully once we're back to in-person events, we can do the poster sessions and everything again.
[Slide: "Citing OSC"] And then there is a little information on our website about citing OSC, so definitely if you're going to publish anything, we would love to have a citation so we can, you can track those and see what people are doing.
[Slide: "User Environment"] And so now I'll move on to actually accessing the system as a User and Working with the Environment. So, a pause for a minute and give people a chance to ask questions.
[Slide: "Linux Operating System"] All right, so. Our user environment, we use the Linux operating system, which is widely used in HPC. Linux, is mostly accessed through command line and so that's a way to interact with them, with our clusters. You can choose your shells. So Bash is the default shell, but if you have a preference for other types of Linux shells, you can change your shell and you have to do that at the client portal so that would be the m y.OSC.edu website. You can change your shell there. It's open source software. There's a lot of tutorials available online to kind of learn the basics of doing some simple commands at the command line for Linux. Also, on our website, we have some links to some recommended Linux tutorials. But I'm also going to talk to you about using our On-Demand portal where you may not have to do so much with the command line.
[Slide: " Connecting to an OSC Cluster"] But so the main ways to connect to our clusters, the classic way is using SSH., which means Secure Shell. And so, if you have a terminal program on your PC already, you would open your terminal window and then type in SSH and then your user ID since this is your a HPC account ID and then at @onkosc.edu. So, if you're connecting to ONZGUO's, then you would use ONZGUO's, or switch that to Pitzer and so you just put that command on the command line and that would send a communication to the cluster, and then you'd have to put in your password and you can access the clusters that way. The other option is to use a web-based portal. So, we have our On Demand portal that you can log into and access the clusters directly through a web browser so you don't have to be as familiar with the terminal or download any particular software to do that. And if you are going to use any software that has a graphical user interface, then you would want to use an X 11 forwarding setup which takes. There's some information on our website if you need to do that. But when you log into the terminal, you'd need to change your command to SSH dash capital X to turn on that forwarding so you can run your graphical interface.
[Slide: "O SC OnDemand"] So, our OnDemand portal has several features, as I said, it's web based, so you're not working necessarily at a terminal. You don't have to download any other software because it's available on any web browser and you just need to be able to to use your username and password to log in. And then you have access to all of our resources and this includes file management, submitting jobs, looking at some graphical interfaces. So, we have some apps where you can run tools like Matlab or our Jupiter Notebook's, Abacas, ANSYS console. There is also terminal access so you can access our clusters through a terminal, through our portal as well. You can see some of the details about that on our main website under OnDemand. You can look at some of the features, but I'll go through them in a little bit.
[Slide: "Transferring Files to and from the Cluster"] So file transfers, there's several different methods you can use to, to manage your files and transfer files from your local PC over to the cluster. So, there's a tool in Linux called SFTP or SCP that you can use to transfer from the command line and those work on Linux and Mac. If you use a Windows machine, there's a software called FileZilla that can be used as well. That'll do some file transfers for you and you can do that at the General SSH login node. We also have a file transfer server so if you're transferring something large and you want to do it this way, you would connect to the SFTP.OSC.edu location and transfer your files that way. On OnDemand, you can transfer smaller files, so up to five gigabytes you can drag and drop using our file management tool and I'll show you what that looks like. And then we also have GLOBUS, which is another web based tool that can also handle large file transfers. So, that can be really useful for large files or for a group of files you want to transfer all at once. GLOBUS is a nice way to do that, kind of in the background and we have information about how you can set up that that tool on our website as well so that's what that link is at the bottom.
[Slide: "Using and Running Software at OSC"] So now I'm going to talk about more about our software at OSC, and I see a question about about versions of Git that are installed on our website. So, I mean, our software team manages software installs and so generally they keep us updated with the latest version available. So. I would say if you send an email to OSC.help and find out why, if there's a newer version of GIT, that might be more useful to you. Also with open source tools, you can also install your own versions as well in your local directory and so we'll go over some of that in this section.
[Slide: "Software Mantained by OSC"] So, there's over one hundred and forty five software packages maintained at OSC and so for any software that you want to use, definitely go to our main website, OSC.edu and you can look under resources and in the top menus, under resources, available software and you'll see a list of all of the available software and you can browse that list or you can just do a search for the software you're interested in. You definitely want to start with the software page of whatever software you want to use, because that will have the details about what versions are available on what system, how to get access to the software, if it has to be requested for some in some cases and it also gives you examples of how to use it on the different systems so, this can be really helpful to just sort of getting started and getting comfortable using the software there.
[Slide: "Third party applications"] So, some of the general programing software that's available, we have different types of compilers and debuggers and profiler's, so if you want to optimize your code, there are some tools to do that. MPI Library, Java, Python, R. Python and R being very, very popular tools. And there's lots of different versions and packages and you can also install locally, install your own packages for R and Python.
[Slide: "Third party applications"] Parallel programing software, MPI, libraries of various kinds, OpenMP, CUDA for GPU programing and OpenCL and OpenACC.
[Slide: "Acess to Licensed Software"] And so software licensing obviously is a really complicated area, but generally we try and license, get our software license for general academic usage. Sometimes software requires a signed license agreement and so that's why you have to check the software page at OSC just to make sure what the access is, because some software you may just need to request access or you may need to sign a license agreement before we can let you use that. And those details are all on each software page.
[Slide: "OSC doesn't have the software you need?"] And if we don't have the software you need, you can certainly request it so you can certainly send us some information and say I think the software is really useful and there's a whole bunch of people in my department who would use this if you if you supported it. And we can consider adding some software to our system. If the software you want to use is open source, you can install it yourself in your home directory and so that's available so that you can install it for yourself or in your group and and just manage it locally in your home directory. And the link at the bottom of the slide here is a how to of how you would install it yourself and if you have a license for software that we support that you want to use, we can help you use that license at OSC. So that's something you can ask. Just send an email to OSC.help and we can help you with that.
[Slide: "Loading Software Environment"] And so once you are on the cluster and you want to access some software, we use software modules to manage software so this is just a way that we can manage the software and then you can have access to it and make your environment work for the tools you want to use, just using these commands. So, and so, some of the main commands you want to use when you first log in, you can do a module list command and that will show you what software modules you have installed. So, there's always a set of default modules that get installed for everybody. So, you can see how we have Intel and just some general modules that --- just to set up your environment initially, you can always change those. And then if you want to search for a module, you can do a module spider with a keyword or a module avail and see what software modules are available for a certain software package you're interested in. And they'll be different versions, different modules for different versions and so when you want to add software to your environment, you use a module load and then the name of that software module. And if you're not specific about the version, there'll be a default version. That'll get loaded. You can also remove a software package from your environment by doing module unload and then the name of that software package and then you can change versions and that command is module swap where you swap out one version of software for a different version. And you just have to be careful about putting in the right versions you're interested in.
[Slide: "Batch Processing"] And I will point out that you do have to do that in your job, your batch script as well, so you want your job to have the right, the right software environment as well.
So, now I just want to give you a quick overview of what it looks like to submit jobs to the batch system. So we'll talk about batch processing. So any questions so far? I see there have been some questions in the chat and there has been been answering those. So that's good.
[Slide: "Why do supercomputers use queuing?"] So, talking about batch processing, we use batch processing or queuing in supercomputers because we have a lot of resources and we need to use them in the most efficient way possible. So, we have one hundred users all wanting to run jobs on our cluster. We have to make sure that the, the available resources are used as efficiently as possible and so that everybody gets their work done faster, even if individually we have to wait for the queue to start our job. And so the batch system has a scheduler and a manager so you submit your job to the queue and the scheduler keeps track of all the jobs that are submitted. And once the resource manager has the resources available, your job will move into into actively using the compute nodes. And then once your job completes, then you, you will have the results back, copy back to your home directory. And so it's the most efficient way to run a cluster and just make sure that that everybody gets their work done in a timely manner. But that means that you have to make all of your requests to the system in a job script and put all the relevant details in there so that your job runs effectively and you get the results you need.
[Slide: "Steps for Running a Job on the Compute Nodes"] And so the steps that you need to go through to run a job on the compute nodes, we need to make a batch script or a job script and I'll show you what that looks like and the information you have to put in those then you submit that to the queue. The job waits in the queue. When the resources become available, the job runs and then once the job is completed, your results will be available. Copy back to your home directory.
[Slide: "S pecifying Resources in a Job Script"] And so the resource requests you have to make in a job script can include the number of nodes, the number of compute nodes you need to use, the number of cores per node that you want to use. And if you want to use GPU's, you can specify a memory, but it's not required. So, memory is allocated proportionally to your request so if like on Pitzer standard, the standard nodes of the original Pitzer have 40 cores, so if you requested one node and 20 cores on Pitzer, you would get half the memory of that node. And so you can think your memory request as being implicit in your core request. So, if you are running, if you're going to run a job that you want to use maybe 10 cores, but you're going to need the full amount of memory on that node, you should request the whole node that request all 40 cores so that you have access to all of that memory as well. The other thing you have to request is wall time so if you think your job will take an hour to run, you might want to request a job for two hours just because you want to overestimate so that if your job doesn't quite finish in one hour, it doesn't get stopped because, well, time is a hard limit. You want to overestimate slightly, though, so a smaller job, you can overestimate a little more, but a larger job, if you think your job's going to take twenty four hours, you might want to request 30 or something like that just to give it a little extra time, but not so much that your job will take longer to wait in the queue if you over request your wall time.
[Slide: Specifying Resources in a Job Script"] So, a question in the chat is about a job is submitted to a single node, but requests less than the full number of cores, can a second job run on the same node? So, you can use the other cores and how does the memory get shared there? So, yeah, this is a problem. This is something, you have to be you have to consider if you're going to request a job. And not so you can only say one node and half the processors on that node. Another job could run on that node at the same time and the memory allocation will be relative to the number of cores but if a job is using more, trying to use more memory than it's supposed to use, you can have some some sort of negative interaction between those two jobs as far as what memory is available. So, that can be a problem. So, yeah, I would say that's why I say make sure you get something else you can get a sense of as you run your jobs is sort of how much memory is necessary to run your job efficiently and if you just want to request the full node so that you have access to all that memory. But as long as your request is is correct for what the resources that you need, you shouldn't run into any problems, but it can take a little while to get that to trial and error kind of process. And so the next thing you need to include in your job script is your project number. So, this is the project code. It's usually to the P and has a four digit code at the end. You need to include that in your job script so that the job can be accounted for in your project and then some software requires a specific license request as well. So, that has to be in the job script and so we'll show you what a job script looks like.
[Slide: "Batch Changes at OSC"] It is important to note that OSC has just switched over from a Torque/Moab scheduler, a resource manager, to SLURM, our scheduler and resource manager. So, we've just made this change a couple of months ago and so before this, if users were using Torque/Moab, they would use PBS scripts. This is just a PBS based job script. So, the variables look different, but the activity is the same between the old batch system and the new batch system. It's just the terminology has changed. But we have a compatibility layer active so if you are already running jobs on our system, using the PBS scripts, there generally will still work. But we we expect that anybody who's just starting out now will use SLURM as their main way to submit jobs. So, that's what we're going to talk about as far as job scripts from now on and current users, your job script should still work or you can start switching over to SLURM. We do have some information sessions monthly. The next one is scheduled for January 27. That's on the event calendar. So, you can always sign up for that to get more information about how to use SLURM at OSC. And then we have the link to sort of just general SLURM information that can give you the keywords and commands that you might want to know. But we'll go over some of that right now.
[Slide: "Sample SLURM Batch Script"] So here is a sample batch script, and this isn't using SLURM. And so in a batch script at the top, you always have your resource requests. So SLURM requires that you put in your, your shell statement right at the top. So, that first line is required and then you have all your SBatch requests. So, this is all the resource requests we talk about all the time. Number of nodes. Number of cores per node, which SLURMS calls tasks per node, you have to give the job a name and then you use the 'line account.' That's what SLURMS calls a project. So that's where you put your project code. And so that's just a general project code. You want to replace that with your specific project code and then the rest of the the job script or all the commands you need to run to manage your files and to run your calculations and then to get your results back at the end. So, you want to set up your software environment. As I said, you need to make sure that the software is accessible by your job. So, you want to load in the software modules that you need and then copy your input files over to the computer directory. So, that's the temp directory and then you want to run your job. So, here we're doing an MPI compile and then running that code and then copying back the results to your working directory so you have your results at the end of the job. So, this all becomes a text file and this is your job script and you give this text file a name and that's your job script name.
[Slide: "Submit & Manage Batch Jobs"] And so when you have your job script, then you use the command SBatch and the name of that job script and that submits it to the queue. And once you submitted a job successfully to the queue, you'll see on the bottom of this slide, you'll see a line that is submitted batch job, and then a code. That's your job ID. So, that's how the queue is going to label your job. So, if you happen to make a mistake and you want to cancel your job, you need that job ID and then you can use the command, S-cancel and that code to cancel that specific job. And then if you in general, just want to see what your jobs are doing in the queue, you can use the command S-queue with the flag dash you and then your username, and that'll give you information about all your jobs that are in the queue. And you can see their status, whether they're waiting, whether they're on hold or they're running. And when they've completed, they'll no longer be in the queue syou can see that they're gone. You can also put a job on hold if you wanted to, in case you submitted a bunch of jobs and one of them is dependent on another one finishing. You can put that on hold and then release it using the control hold and release commands and so that's how you manage, submit jobs, manage them and check on the Q.
[Slide: "Scheduling Policies and Limits"] We do have policies and limits on our scheduling just to manage the size of jobs that can get submitted generally to our clusters. So, we have a walltime limit for single node jobs that we call serial jobs, and that's one hundred and sixty eight hours. So, that's for a standard single node job. You can have up to a week of walltime for jobs that use more than one node. We have a ninety six hour limit and that's, those are called, we call them parallel jobs and then per user, you can get a total limit of running jobs of one hundred and twenty eight or two thousand forty processor cores in use at once or you can have a thousand jobs in the batch system all together. So, if you happen to be submitting a lot of jobs at once, you can have one hundred and twenty eight running and up to a thousand waiting in the queue and then group limits as well are similar. And that's just so that no one user, no one group can be submitting enough jobs to take over the whole system. And these are, these are the general guidelines we use. These are limits but if it happens to be a reason why you need to do a larger job or a longer job, you can always request that as well. We can accommodate different, different types of jobs. But these are just the standard limits that we have.
[Slide: "Waiting for Your Job to Run"] And so when you submit your job, it's sometimes hard to know how long it will take. It's always a question of system loads. Generally, our systems are pretty busy, but generally things we try and keep an eye on how fast things are going through so we hope to keep so you don't have to wait too long. But it's also about what resources you request. So, that's why you want to make sure that you're walltime isn't unnecessarily long or your core request isn't too large, isn't larger than you need. It can be as large as it needs to be. But you want to try and make it as reasonable as possible so that your job doesn't wait longer to start and then if you're requesting specific resources like memory or GPU use, that can change how long your job will take. But yeah, it's sometimes hard to tell. So, and it may vary.
[Slide: "Interactive Batch Jobs"] Another type of job that we do support is interactive jobs, and those are still submitted through the batch system, so the same resource limits and they're useful for debugging or just working on something to test out some stuff just in case to try and before you submit to a general batch job. Our system isn't optimized for interactive batch jobs, so you still have to wait for that job to start and you have to be there when it starts to actually take advantage of the resources. So, it can be a little difficult. And so, yeah, most jobs are fairly, start fairly soon, but sometimes you have to wait a little while. And at the bottom of the slide you can see an example of how to submit a batch of interactive jobs to the back system. You do have to make the same requests of a number of nodes, a number of cores and walltime.
[Slide: "Batch Queues"] And it's good to know that we have the two different clusters and their batch queues are completely separate, so you do have to make sure that you know which cluster you're submitting to. And then there are also a few nodes in each system that are reserved for short jobs, for debugging and testing things out so if you make a small, you can make a smaller request and use the debug nodes.
[Slide: "Parallel Computing"] And just in general, with parallel computing, if you can take advantage of multiple nodes, that's a good thing just because you want to use the resources effectively and get your work done efficiently as possible, but it can take some time to make that actually efficient. So, just keep in mind that there could be other considerations beyond just asking for more cores or more nodes. And so there are tools like multi-threading and MPI that can allow you to take advantage of multiple nodes and multiple cores, but it can vary. And so it's something you're going to have to figure out as you go along and we can help with specific questions or you can read the documentation that we have.
[Slide: "To Take Advantage of Parallel Computing"] And so always check with the software that you're going to use to see how it takes advantage of parallel computing and make sure that you have the right input for your jobs to do those things effectively.
[Slide: "https://ondemand.osc.edu/"] And so that was the main part of the presentation.
[Slide: "Resources to Get Your Questions Answered"] I'll just say we have some more resources for other ways to check on questions that may come up for you. So, we have on our main site FAQ'S that have a lot of important information. Our how-tos are really important for doing certain activities like installing software, installing Python in our packages and things like that, setting up GLOBUS. So, those are all on our website under 'how to.' We do have some tutorial materials, if you're interested. If you want to get started using our system and follow a tutorial for submitting jobs, we have a tutorial material available on GitHub. We also have office hours every other Tuesday that you can sign up for. Those are available. You can get the link to those from our event page and and sign up for a time to talk to us through your office hours. We also have a question forum tool called Ask.ci, that we're a member of, and that's a larger cyberinfrastructure resource. But we have a section on there. Another place you can have discussions and ask questions. If you want to get updated information about our system, we have a Twitter feed, HPC notices where we post any information about problems or downtimes. Also, when you log into the system, you'll see the message of the day, which will give you any new information about any changes to our systems we need our users to know.