Preview

Hardware freaks with no working knowledge of HTML

 

LAMB CHOP HOME

JOIN TEAM LAMB CHOP

TEAM STATISTICS

MEMBER CHARTS

MEMBER PROJECTED

MEMBER GRAPHS

ACTIVE MEMBERS

MEMBER OTHER

OVERALL CHARTS

OVERALL GRAPHS

CHART LEGEND

SETI BENCHMARKING

SETI TIPS, FAQs, et. al.

ARCHIVES

PUBLIC QUEUES

ARS SUPERCOMPUTER

SETI@HOME PAGE

ARS DISTRIBUTED FORUM

TEAM BEEF ROAST

TEAM CHILI PEPPER

TEAM CRAB CAKE

TEAM EGG ROLL

TEAM FROZEN YOGURT

TEAM PRIMORDIAL SOUP

TEAM PRIME RIB

TEAM STIR FRY

TEAM VODKA MARTINI

THE SUSHI BAR

ARS TECHNICA

LINKAGE

 

PERSONAL STATS LOOKUP:
SETI@Home ACCOUNT:

COMMENTS? EMAIL: WEBMASTER
(remove NO.SPAM)

Mad props go out to IronBits for hosting the site and all the others who have kept this site going for the past couple of years!.

Version 3.0 Preview
(updated 7/23)

July 5th brought us the first beta test of the version 3.0 client (dubbed v2.66).  The new iteration of the SETI@Home client promised to bring the SETI crunchers of the world three major improvements over the previous versions of the client.  The two most important improvements happen to deal with additional science that the client will perform on the work unit.  Version 3.0 will do pulse and triplet searches on the data, in addition to gaussian signal fitting that has been done in previous version.  These added calculations would significantly increase the processing time of a work unit, unless they made improvements to the processing algorithms themselves.  Thus we bring in the third improvement, a new and improved Fast Fourier Tranformation (FFT) routine.  In previous versions of the S@H clients, the one sore spot from many was the inefficient FFT routines which spawned several and conflicting attempts to "patch" and improve the FFT routines in the version 1.x and 2.x clients.  The new faster FFT promised to help alleviate the burden of pulse and triplet searches.  But alas, you cant have everything...it was projected that the version 3.0 client would run approximately 25% longer than the current 2.4 client.

v2.66 Beta #1
The 2.66 beta gave us the first look at what to expect from the final 3.0 version.  The packaging was sure familiar enough.  (click on the thumbnail on the right to see a full size version of the client...v2.70 client shown).  On the top right there is a box for the work unit info.  The point in the sky where the telescope was aimed when the data was recorded, date, and base frequency of the work unit.  Just below that is the User info, Name, WU completed, and total computer time contributed.  The bottom half of the client shows a current running of the data analysis of the work unit.  Actually that isn't completely true.  It only shows the output from the FFT on the work unit.  The modified data that the actual gaussian, spike, pulse and triplet searches will be done on. 

The major graphical change in the beta client is on the top left of the client window.  This portion of the client window is shown below.  Directly underneath the Data Analysis title is where the client shows what it is currently working on.  It will either state "Computing Fast Fourier Transform", "Chirping Data", "Searching for Pulses", "Searching for Gaussians" or "Searching for Triplets".  The status bar to the right of these messages shows the % completion of the current search.  Below this line is where the "Best Results" are shown.  The second and third lines show the details of the analysis results, and below this is a graphical representation of the result, shown as a continuous red line.  When the client is showing the "Best Gaussian" it shows the raw data as a red line, and the best gaussian fit is a white line overlay on the raw data.  If it shows the best triplet, there are 3 short vertical white lines pointing out detected triplet.  As the work unit processes, it alternates showing the three different results.  But there is a possibility that there may not be a gaussian, or triplet, or pulse detection.....or even none of the above.  The client only shows results found....and alternates showing them.  If there are no results found in the data, this area is blank.  Finally...underneath the graphical data representation is the overall status bar.  This shows the % completion of the entire work unit, and the current amount of CPU time used on the current work unit.

The release of the v2.66 beta was highly anticipated, but that anticipation quickly turned into disappointment for many.  On analysis of my first work unit, it was apparent very quickly that this client needed some work.  The 2.66 client crawled along at a snail's pace, and the estimated time of completion was way longer than anything processed with the version 2.0 client that I had been using.  I let the work unit finish overnight....OK it didn't finish overnight...It was well into the evening when the client finally finished.  The completion time on a PIII 600E overcloked to 944MHz was a whopping 25 hours.  This definitely did not fit into the "25% longer" that was quoted on the alt.sci.seti newsgroups. 

While processing of the first work unit was still on going, a browse through the alt.sci.seti group showed many seen the same problem.  Needless to say, that was the first and last work unit that was processed with the v2.66 client.  Within the next two days Eric Korpela announced that they had an idea what was going wrong, and they were working on a fix.  The problem turned out to be the client waiting and idling while the on screen graphics were updating.  Back to the drawing board.

v2.70 Beta
After some work, the second beta dubbed v2.70 was released on July 17.  I downloaded the client, but held off for a bit before installing.  I installed before hitting the sack that night, but before I turned in...I could tell things were going to be a bit faster.  Ok, not a bit faster....ALOT faster.  There visually didn't appear to be anything that changed much with the new beta, other than a minor graphics fix with the rotating results...but the big change was in the run times.  After about a half an hour into the run, the estimated time of completion was expected to be under 4 hours.  Granted during this time I had upgraded from a 800Mhz PIII to a 925MHz PIII, this was definitely faster than a CPU upgrade would account for.

Run times for this new beta on the first 5 or so work units I completed ranged from 3:10 to 6:46.  Upon completion of a bit more work units, the work unit times fell into two categories.  The majority of the run times were in the range of 3:20 - 4:00, and the smaller category was in the 6 - 7 hour range.  I will touch on this in a bit.

Many of the people on the newsgroups finished their first work units with the beta...and there was much rejoicing.  Of course everyone loved the faster times....almost all of them ran faster than the previous version 2.x clients (even faster than the CLI!).  Apparently with the new beta, the overhead in drawing the graphics was totally eliminated with the GUI minimized, and even with comparing some other machined it appears that the newer client is more cache friendly.  Times from CPUs with large caches didn't vary that much from an equivalent speed CPU with smaller cache.   It almost seems that work unit times are more dependent on raw CPU speed, and less dependent on CPU cache size and memory bandwidth.  But it is early in the game right now, and needs to be further investigated.

After running through some of the "slower" work units, I noticed something different in the processing of the work units.  With the faster work units, the pulse detection seemed to be mostly at the beginning of the work unit run, while later on in the run, gaussian searches took over and there were no pulse searches....All of the "slower" work units, no gaussians were detected in the final analysis, but during the analysis, there didn't seem to be any gaussian searches being performed in the run.  Instead, there was pulse searches throughout the entire run, instead of only at the beginning.  I haven't heard back if this is "normal" or if this is a sign of some glitch in the matrix.  Just a half hour ago Eric Korpela replied to one of my posts about slow work units with:

>Let me guess, angle_range is a small number? 0.1 or less?

Well he was right....the angle_range on that work unit was 0.023.  But, he didn't say that was expected or not......

bulletin....bulletin....bulletin Late Breaking News! (don't you love it reading the newsgroups while you are writing?)

Eric just chimed in with an explanation. I will let him explain!

Under the old client these would have been "fast" work units, because gaussian finding isn't done on them.  It's a different situation on the new client, and we haven't yet decided what to do about it.

Let me explain the situation.  After the FFT's are done, the data is broken up into chunks the width of a beam for pulse finding, if that chunk is longer than 15 points, and shorter than 40961 points, pulse finding is done.  The problem is that length of those chunks is inversely proportional to the slew rate.  Slow slew=longer chunks.  At siderial slew rate, there are less than 16 points in a chunk for long FFTs (longer than 8k give or take).  At zero slew rate, you get more than 15 points at a 64K transform.  So for zero slew rate, we do a lot more pulse searches.  We also do longer ones.  The longest pulse array we search in a normal slew rate work unit is about 25000 points.  The longest we do in a zero slew work unit is 32768 points.  Longer arrays take longer to search.  Both of these add up to a longer run time.  Of course, the longer you watch the same point, the easier it is to detect a pulse, so the pulse search in these zero slew work units is more than twice as sensitive as the same search in a normal work unit. So the question is, is the additional time worth it?  

Eric

To paraphrase....at lower range angles, the antenna is "looking" at the same point in the sky longer.  This enables them to do a more sensitive search of the area.  This sensitive search takes longer.  On the older clients (with out the pulse detection) these work units did not contain gaussians, and therefore usually had shorter run times...but now with the pulse searches, these work units will probably take longer.

Is it worth it?  I guess it is up for them to decide!

I think I am going to end this part of the preview here.  I will add things as they come available!

Update (7/23/00)
The previous SETI clients were memory bandwidth limited because the client's working set did not fit within the L2 cache of many CPUs.  Xeons worked well because they had upwards of 1-2MB of cache.  An equally clocked Katmai PIII processor would out perform a Coppermine CPU because the Katmai had 512kb cache compared to the Coppermine's 256kb cache.  Rat Bastard recently did some benchmarks with the version 2.70 beta using equivalently clocked Katmai, Coppermine and Celeron CPUs.  Here are the Results:

Cache
CPU FSB Speed L1 L2 Time
PIII 550 Katmai 100MHz 550MHz 32kb 512kb 5:13
PIII 550 Coppermine 100 550 32kb 256kb 5:13
Celeron 366 100 550 32kb 128kb 5:41

Welp...It looks like the beta has a working set that fits completely into the Coppermine's 256kb L2 cache, but is still too large to fit into the Celeron's cache.  What does this mean?  It looks as if the client is now no longer memory bandwidth limited in *any* PIII CPU.  Memory tweaks will probably show no improvement in client times.  But it doesn't stop there!  The BX motherboard may not be SETI king anymore, the the VIA Apollo Pro boards should perform as well as the BX boards now....but with the added advantage of underclocking the memory and allowing for higher CPU speeds!.  Athlons, both classic and T-bird will probably perform as well as a PIII, the Xeon will probably now lose ground to the PIII.  The Celeron doesn't fare well...but what about its arch-enemy the Duron?  Word is still out on this. 

The Duron may be the best price/performance solution for version 3.0.  But you say "ahem...the Duron only has 64kb of L2 cache...and only 128kb of L1 cache...it should stink also".  Not necessarily.  The one thing about the Duron is that the L2 cache is inclusive exclusive (oops my bad).  That means the cache acts like a total....giving 192kb cache available for SETI.  That would be halfway between the CuMine's and Celeron's cache.  The working set for SETI may fit into the Duron's combined cache.  Only some benchmarking will tell....the Duron may actually crunch as well as the classic Athlon and the T-Bird!   Anyone want to do some benchmarking? :-)

Update (7/30/00)
When I reported in the Version 3.0 preview, that the new client will be more cache friendly, that appears to be correct...but with some more benchmark times, and further analysis  it appears that the client DOES seem to rely on memory bandwidth and cache size still.  The first question about the benchmark times that Rat Bastard reported (both a Katmai, and CuMine 550 reporting a 5:13) came fromt he SETI newsgroup, when a poster noted that IF the new client was cache independent, then the CuMine should be faster than the Katmai because of its full speed L2 Cache.  But that isn't happening.  Lawrence Kirby tried to explain this by saying that the version 3.0 client probably uses a "recursive subdivision" in the FFT routine.  What this means is that it either starts with small blocks and combines them into larger blocks or the opposite...start with a whole chunk of data and then splits it up into smaller blocks.  Here is how he broke down the "stages" as he calls them and how the data would be accessed:

Katmai Coppermine Mendocino (Celeron)
11 stages very fast L1 cache 11 stages very fast L1 cache 11 stages very fast L1 cache
5 stages slow L2 cache 4 stages fast L2 cache 3 stages fairly fast L2 cache
1 stage very slow memory 2 stage very slow memory 3 stage very slow memory

From the benchmark times the full speed L2 cache and the 256 bit L2 bus compensates exactly for the extra hit on the slow memory that the CuMine must take compared to the Katmai (1/2 speed L2 cache and 64 bit L2 bus.  The Celerons slower time can be explained by a combination of smaller L2 cache, the 64 bit L2 bus and the extra hits accessing system memory.

Where would this put the T-Bird and the Duron???
based on the above info the working set for the v 3.0 client is still > 512kb...but the T-Bird has 128kb L1 cache and 256kb of L2, and the Duron has 128kb L1 and 64kb L2 caches (L2 caches are at clock speed).  Because the caches on the T-bird and Duron are exclusive, they act like a total instead of the inclusive cache's of the Intel CPUs.  Based on Lawrences post I can come up with these estimates (these very well may be totally absolutely wrong :-) :

Coppermine T-Bird Duron
11 stages very fast L1 cache

13 stages very fast L1 cache

13 stages very fast L1 cache
4 stages fast L2 cache 2 stages fairly fast L2 cache 1 stages fairly fast L2 cache
2 stage very slow memory 2 stage very slow memory 3 stage very slow memory

I do want to remind you this is only a guess here.  With the above table how would the T-Bird and Duron perform compared to the CuMine?  Both the T-Bird and the Duron would have the advantage of a couple more stages in the very fast L1 Cache.  But the L2 Bus of the T-Bird and Duron is only 64 bit....plus it has the disadvantage of the Memory Bandwidth of the Athlon Chipsets.  For sure the Duron would run slower than the CuMine, but would the T-Bird?  I would say no.

Well I can actually say for sure(?) that the answer is no with the T-Bird, because I have some benchmarks!  I have a couple of times sent to me from tim Wilkens who ran the benchmark work unit with the 2.70 beta with both a T-Bird and Athlon Classic.  I ran the benchmark on my machine with the beta also.  Here is the meat:

CPU Speed (MHz) FSB (MHz) MoBo Time

Notes

PIII 650E 925 143 BE6 rev 2 3:21 3-2-3 mem. timing
T-Bird 950 1007 106 K7M 3:49 CAS2
Athlon 850 892 105 K7V 4:09 Cache at 2/5, CAS2

The CuMine is definitely kicking some ass.  But there are some caveats here.  Did you see how I slipped in a (?) when I said "for sure".  This may not really be that fair of a comparison.  Yes the PIII is kicking some serious ass, but take a look at that FSB setting.  I believe that this ass kicking is more due to the PIII cache's running at a significantly higher clock speed, and the memory bandwidth totally blowing away those of the AMD machines.  Because of this you cant really determine the cache dependencies of these different CPUs.  A better comparison would be for a PIII, T-Bird and Duron at similar multipliers and FSB settings.   Ancala has done the benchmark on the beta with a 700Mhz Duron and turned a 5:07.  Sometime in the next day or two I will clock back my PIII back down to 650 and do a bench with that setting to see how the PIII compares to the Duron.  Hey it isn't exact....but what the hell!  I will keep ya informed

-zAmboni