vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hello all, I was curious of anyone out there has experience in using SMT on Power5+. IBM' documentation is not conclusive at all as to specific benefits, the disadvantages and the conditions under which these are obtained. More specifically, I would expect that SMT ON would benefit workload mixes with high thread concurrency but with low contention on common thread resources (L1/L2/L3 and functional units). For instance, I would expect a cluster with a large number of distributed (and multithreaded) processes running to benefit from SMT ON: twice as many h/w threads can be active at a time, obviating the need for costly thread-switch. Has enyone have had experience that points to that direction? If, this is not the case, could the conditions (and rason if known) be stated? Thanks Michael Thomadakis |
| |||
| On 2007-04-04, Michael E. Thomadakis <miket@sc.tamu.edu> wrote: > > More specifically, I would expect that SMT ON would benefit workload > mixes with high thread concurrency but with low contention on common > thread resources (L1/L2/L3 and functional units). For instance, I would > expect a cluster with a large number of distributed (and multithreaded) > processes running to benefit from SMT ON: twice as many h/w threads can > be active at a time, obviating the need for costly thread-switch. I did some benchmarking on a P5+ cluster last fall, and was quite surprised that SMT=ON (and 1 task per virtuall cpu vs. 1 task per physical cpu in SMT=OFF) was giving best performance on all the benchmarks we did. Even for 16 node (~250 task) mpi-jobs. -jf |
| |||
| Jan-Frode Myklebust wrote: > On 2007-04-04, Michael E. Thomadakis <miket@sc.tamu.edu> wrote: >> More specifically, I would expect that SMT ON would benefit workload >> mixes with high thread concurrency but with low contention on common >> thread resources (L1/L2/L3 and functional units). For instance, I would >> expect a cluster with a large number of distributed (and multithreaded) >> processes running to benefit from SMT ON: twice as many h/w threads can >> be active at a time, obviating the need for costly thread-switch. > > I did some benchmarking on a P5+ cluster last fall, and was > quite surprised that SMT=ON (and 1 task per virtuall cpu vs. > 1 task per physical cpu in SMT=OFF) was giving best performance > on all the benchmarks we did. Even for 16 node (~250 task) > mpi-jobs. > > > -jf Jan thanks for your reply. We have a 40 node p5-575 cluster with 32GB DRAM/node. Nodes are attached to 2 planes of HPS. I was contending that SMT ON should benefit the cluster since all of the cluster control processes (RSCT, GPFS, HPS assistance, etc.) are heavily multi-threaded and by not being compute-intensive, they should benefit from SMT ON. Are there any public benchmarks that I can run to demonstrate the benefits of SMT ? thanks again, Michael |
| |||
| try to search here for "smt": http://www-03.ibm.com/support/techdo...f/Web/Techdocs There is no special document for POWER5+. But I don't think there is a difference to POWER5 with SMT. Cheers, Andy On Wed, 04 Apr 2007 20:05:11 -0500, "Michael E. Thomadakis" <miket@sc.tamu.edu> wrote: >Jan-Frode Myklebust wrote: >> On 2007-04-04, Michael E. Thomadakis <miket@sc.tamu.edu> wrote: >>> More specifically, I would expect that SMT ON would benefit workload >>> mixes with high thread concurrency but with low contention on common >>> thread resources (L1/L2/L3 and functional units). For instance, I would >>> expect a cluster with a large number of distributed (and multithreaded) >>> processes running to benefit from SMT ON: twice as many h/w threads can >>> be active at a time, obviating the need for costly thread-switch. >> >> I did some benchmarking on a P5+ cluster last fall, and was >> quite surprised that SMT=ON (and 1 task per virtuall cpu vs. >> 1 task per physical cpu in SMT=OFF) was giving best performance >> on all the benchmarks we did. Even for 16 node (~250 task) >> mpi-jobs. >> >> >> -jf > >Jan thanks for your reply. We have a 40 node p5-575 cluster with 32GB >DRAM/node. Nodes are attached to 2 planes of HPS. I was contending that >SMT ON should benefit the cluster since all of the cluster control >processes (RSCT, GPFS, HPS assistance, etc.) are heavily multi-threaded >and by not being compute-intensive, they should benefit from SMT ON. > >Are there any public benchmarks that I can run to demonstrate the >benefits of SMT ? > >thanks again, >Michael -- Andreas Beckmann Andreas.Beckmann@muenster.de http://www.muenster.de/~andy |
| |||
| On 2007-04-05, Michael E. Thomadakis <miket@sc.tamu.edu> wrote: > > Jan thanks for your reply. We have a 40 node p5-575 cluster with 32GB > DRAM/node. Nodes are attached to 2 planes of HPS. I was contending that > SMT ON should benefit the cluster since all of the cluster control > processes (RSCT, GPFS, HPS assistance, etc.) are heavily multi-threaded > and by not being compute-intensive, they should benefit from SMT ON. I believe these cluster control processes will have a negligble impact (*). It's the applications you'll be running that will decide if you should use SMT or not. We let the users decide on a job by job basis if they want SMT by specifying this in their LoadLeveler scripts, and then have LL pre/post scripts turn on/off SMT > > Are there any public benchmarks that I can run to demonstrate the > benefits of SMT ? Don't know, but it will probably be easy to demonstrate higher troughput by running 32 separate cpu intensive tasks on a system with SMT=on compared to SMT=off. And then you could probably also experience the negative impact of SMT by running a 16 task MPI-job on a system with SMT=on -- where you might not get all tasks running on separate physical cpus... * BTW: do make sure that there's nothing causing any load on a system that's idle. An idle system should show a load of 0.00 in "uptime". We had an "xmtrend" system process that was generating a load of about 0.05-0.10 on our nodes, and that was killing the parallel benchmark performance on fully loaded systems. -jf |
| |||
| In article <slrnf19kpo.524.mykleb@lc4eb6380248654.ibm.com>, Jan-Frode Myklebust <mykleb@no.ibm.com> writes: |> |> I believe these cluster control processes will have a negligble |> impact (*). It's the applications you'll be running that will decide |> if you should use SMT or not. We let the users decide on a job |> by job basis if they want SMT by specifying this in their LoadLeveler |> scripts, and then have LL pre/post scripts turn on/off SMT Don't you believe it :-( The way that they have an impact is by disturbing the scheduling; if your applications, communications methods and scheduler get on well together, there won't be a problem. But, if the aggregate system is a bit sensitive, you can get some very strange effects. This applies as much to SMT as it does to distributed memory clusters. In extremis, you can get deadlock, livelock and massive performance degradation (I observed a factor of over 1,000 on a POWER3 running AIX, and have seen pretty bad ones on other (non-IBM) systems). This is rare, but changing an apparently irrelevant confuguration parameter is often the way to 'cure' it. Regards, Nick Maclaren. |
| |||
| On 2007-04-05, Nick Maclaren <nmm1@cus.cam.ac.uk> wrote: > >|> I believe these cluster control processes will have a negligble >|> impact (*). It's the applications you'll be running that will decide >|> if you should use SMT or not. We let the users decide on a job >|> by job basis if they want SMT by specifying this in their LoadLeveler >|> scripts, and then have LL pre/post scripts turn on/off SMT > > Don't you believe it :-( What I meant to say was that these "cluster control processes" will have a negligble impact on the decision to go for SMT on/off. It's the application that matters.. But do of course get rid of unneccessary disturbances (as mentiond in the (*)). > The way that they have an impact is by disturbing the scheduling; if > your applications, communications methods and scheduler get on well > together, there won't be a problem. But, if the aggregate system is > a bit sensitive, you can get some very strange effects. > This applies as much to SMT as it does to distributed memory clusters. And doesn't it also apply as much to non-SMT ? -jf |
| |||
| In article <slrnf19o6f.594.mykleb@lc4eb6380248654.ibm.com>, Jan-Frode Myklebust <mykleb@no.ibm.com> writes: |> |> What I meant to say was that these "cluster control processes" |> will have a negligble impact on the decision to go for SMT on/off. |> It's the application that matters.. But do of course get rid of |> unneccessary disturbances (as mentiond in the (*)). That's what I say is only USUALLY the case. The point about enabling SMT (whatever form it takes) is that it is a very different and more fine-grained form of parallelism than almost entirely separate cores. In particular, it is much more likely to cause trouble to the basic communication/synchronisation primitives if a control process and an application one are scheduled on the same core than if they are not (and, without SMT, they can't be). This comes back to the old, old issue where some applications ran perfectly well if run on the bare iron, but misbehaved as soon as you put them under a monitor. Old problems never die - they merely go into hibernation. |> > The way that they have an impact is by disturbing the scheduling; if |> > your applications, communications methods and scheduler get on well |> > together, there won't be a problem. But, if the aggregate system is |> > a bit sensitive, you can get some very strange effects. |> |> > This applies as much to SMT as it does to distributed memory clusters. |> |> And doesn't it also apply as much to non-SMT ? You mean non-SMT share memory? Of course. The POWER3 was, and so were two of the other systems on which I saw serious problems. It applies to ANY close-coupled, thread-based design. Regards, Nick Maclaren. |
| |||
| Andreas Beckmann wrote: > try to search here for "smt": > http://www-03.ibm.com/support/techdo...f/Web/Techdocs > > There is no special document for POWER5+. But I don't think there is a > difference to POWER5 with SMT. > > Cheers, > > Andy > > > On Wed, 04 Apr 2007 20:05:11 -0500, "Michael E. Thomadakis" > <miket@sc.tamu.edu> wrote: > >> Jan-Frode Myklebust wrote: >>> On 2007-04-04, Michael E. Thomadakis <miket@sc.tamu.edu> wrote: >>>> More specifically, I would expect that SMT ON would benefit workload >>>> mixes with high thread concurrency but with low contention on common >>>> thread resources (L1/L2/L3 and functional units). For instance, I would >>>> expect a cluster with a large number of distributed (and multithreaded) >>>> processes running to benefit from SMT ON: twice as many h/w threads can >>>> be active at a time, obviating the need for costly thread-switch. >>> I did some benchmarking on a P5+ cluster last fall, and was >>> quite surprised that SMT=ON (and 1 task per virtuall cpu vs. >>> 1 task per physical cpu in SMT=OFF) was giving best performance >>> on all the benchmarks we did. Even for 16 node (~250 task) >>> mpi-jobs. >>> >>> >>> -jf >> Jan thanks for your reply. We have a 40 node p5-575 cluster with 32GB >> DRAM/node. Nodes are attached to 2 planes of HPS. I was contending that >> SMT ON should benefit the cluster since all of the cluster control >> processes (RSCT, GPFS, HPS assistance, etc.) are heavily multi-threaded >> and by not being compute-intensive, they should benefit from SMT ON. >> >> Are there any public benchmarks that I can run to demonstrate the >> benefits of SMT ? >> >> thanks again, >> Michael > > > -- > Andreas Beckmann > Andreas.Beckmann@muenster.de > http://www.muenster.de/~andy Thanks Michael |
| ||||
| Jan-Frode Myklebust wrote: > On 2007-04-05, Michael E. Thomadakis <miket@sc.tamu.edu> wrote: >> Jan thanks for your reply. We have a 40 node p5-575 cluster with 32GB >> DRAM/node. Nodes are attached to 2 planes of HPS. I was contending that >> SMT ON should benefit the cluster since all of the cluster control >> processes (RSCT, GPFS, HPS assistance, etc.) are heavily multi-threaded >> and by not being compute-intensive, they should benefit from SMT ON. > > I believe these cluster control processes will have a negligble > impact (*). It's the applications you'll be running that will decide > if you should use SMT or not. We let the users decide on a job > by job basis if they want SMT by specifying this in their LoadLeveler > scripts, and then have LL pre/post scripts turn on/off SMT I believe that under reasonable caveats and circumstances, SMT ON will benefit MPI code that uses many non-blocking I/O and aggregate / reduction calls with relatively smaller amounts of data per call. This is particularly true for MPI code that is compute intensive (CPUs ~100% utilization) and then I/O arrives or needs to be checked for (with the dreadful polling). With SMT OFF, HPS assist threads have to swap out compute intensive ones which implies save/restore of their state and loss of some of their RSS in the cache(s). You may say this is negligible but I have seen polling threads almost exclusively doing nanosleep() on a tight loop waiting for data to come in at the rate of 100s times/sec. I believe that this high rate of thread swapping is far beyond the annoyance level and it is detrimental to the compute threads. With SMT ON, these I/O threads should not impact much the compute ones. I am in the process of putting together my own SMT benchmarking suite but I was hoping to find an existing one to get some quick first cut results. > >> Are there any public benchmarks that I can run to demonstrate the >> benefits of SMT ? > > Don't know, but it will probably be easy to demonstrate higher > troughput by running 32 separate cpu intensive tasks on a system > with SMT=on compared to SMT=off. And then you could probably also > experience the negative impact of SMT by running a 16 task > MPI-job on a system with SMT=on -- where you might not get all > tasks running on separate physical cpus... > > * BTW: do make sure that there's nothing causing any load on a > system that's idle. An idle system should show a load of > 0.00 in "uptime". We had an "xmtrend" system process that > was generating a load of about 0.05-0.10 on our nodes, > and that was killing the parallel benchmark performance > on fully loaded systems. I've also saw these pesky 'xmtrend' running on some of the nodes at 100% CPU utilization (spinning over a failing syscall). I went ahead and killed those who were mollesting the processors with a workload of > 10%... > > -jf Michael |