Panic Mode On (65) Server problems?

Message boards : Number crunching : Panic Mode On (65) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

AuthorMessage
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1188575 - Posted: 26 Jan 2012, 16:37:29 UTC - in response to Message 1188574.  

You are missing my original point.....(now obscured since AP is filling the pipe again).
If the feeder and scheduler are working together properly, they are fully capable of saturating the bandwidth with MB tasks only, and that was nowhere near happening. And current MB work being issued does not appear to be a shorty storm anyway.


Oh, I'm sorry, I just wanted to put across that 'no tasks available' with ready to send high is empty feeder, I didn't check SSP or cricket.

Why the feeder is empty is a different issue and yes, if people have trouble getting tasks when the bandwidth is not maxxed that points at more severe problems than just being unable to keep up with demand.

Amen....

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1188575 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1188705 - Posted: 27 Jan 2012, 0:58:31 UTC
Last modified: 27 Jan 2012, 1:03:46 UTC

Hey - I just downloaded 21jn11ac.5207.7025.3.10.244 - seems like an ordinary MB job.

But the datafile is 2,794 kilobytes. So what's with that, then?

Answers on a postcard.....

Edit - the header looks normal, but then we get to

<data length=2839876 encoding="x-setiathome">

A normal WU has

<data length=354991 encoding="x-setiathome">

Spooky!
ID: 1188705 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1188712 - Posted: 27 Jan 2012, 1:19:34 UTC

Interesting indeed. Are all the other parameters the same (like FFTs or whatever it is that controls how much crunching there is)? Maybe it's just one of those oddballs at the end of a tape that can't be chopped up anymore.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1188712 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1188714 - Posted: 27 Jan 2012, 1:36:09 UTC - in response to Message 1188712.  
Last modified: 27 Jan 2012, 1:39:39 UTC

Interesting indeed. Are all the other parameters the same (like FFTs or whatever it is that controls how much crunching there is)? Maybe it's just one of those oddballs at the end of a tape that can't be chopped up anymore.

Too late to investigate tonight - I'm on UTC, and headed for bed.

Host has a three-day cache, so we have a bit of leeway. But I've made a safety copy anyway, and I'll have a look in the morning.

Edit again - looks like I may have two more of those dodgy WUs, from the same tape. They'll do wonders for the download bandwidth - not.
ID: 1188714 · Report as offensive
Profile Gatekeeper
Avatar

Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1188715 - Posted: 27 Jan 2012, 1:38:22 UTC

I've got about 30 or so of the 2.7MB tasks on a couple of my boxes. Their completion times seem to be the same as the other MB tasks, so if they take much longer, they will play havoc with the DCF.

Are these an unannounced drop of the "new" WU's?
ID: 1188715 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1188716 - Posted: 27 Jan 2012, 1:43:02 UTC - in response to Message 1188715.  

I've got about 30 or so of the 2.7MB tasks on a couple of my boxes. Their completion times seem to be the same as the other MB tasks, so if they take much longer, they will play havoc with the DCF.

Are these an unannounced drop of the "new" WU's?

That's kind of what I was wondering when I heard about it a few minutes ago. Kind of like how way back, the precision was doubled making twice the computation, but the data size stayed the same. There's been talks of increasing either the precision or the data size to make computation time take longer. I think these WUs from that tape are just special.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1188716 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1188728 - Posted: 27 Jan 2012, 2:49:28 UTC

So far, I have 3 of them on my GTX560 machine and 3 on the other machine.

ID: 1188728 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1188732 - Posted: 27 Jan 2012, 3:18:37 UTC - in response to Message 1188728.  

I picked up a few of these today as well (all CUDA work).

Cheers.
ID: 1188732 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 1188733 - Posted: 27 Jan 2012, 3:33:48 UTC
Last modified: 27 Jan 2012, 3:34:06 UTC

I have to wait until maybe August 2012 before I can replace the half dead EVGA GTX295 v2 card, any other brand I have no problem with as they generally don't come overclocked.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1188733 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1188738 - Posted: 27 Jan 2012, 3:56:19 UTC

I've got 25 of the King Size MBs on one machine, 0 on the other. If there are a lot of these being created, it is a sure way to eat up all the bandwidth, especially if they take as long to compute as the old 366K ones.

Score another one for the SETI Team.
ID: 1188738 · Report as offensive
Profile Lint trap

Send message
Joined: 30 May 03
Posts: 871
Credit: 28,092,319
RAC: 0
United States
Message 1188757 - Posted: 27 Jan 2012, 5:09:44 UTC
Last modified: 27 Jan 2012, 5:30:41 UTC

The one comparison I am looking at shows the <subband_desc> <number> is 107 for the XL wu and just 87 for the normal sized MB. Of course the data_length is different; 2839876 for the XL and 354991 for the regular.

I've no idea what <subband desc> or <number> are or how that might affect anything. Subband makes me think of radio. More radio data means...

Are these maybe GBT wu's?? (updated: Nope...they're from Arecibo)


I'm still looking to see if there might be any other significant diffs.

Lt

adding: Many changes in splitter_cfg and analysis_cfg also.
ID: 1188757 · Report as offensive
Profile Tim
Volunteer tester
Avatar

Send message
Joined: 19 May 99
Posts: 211
Credit: 278,575,259
RAC: 0
Greece
Message 1188781 - Posted: 27 Jan 2012, 7:57:08 UTC - in response to Message 1188776.  

are they 23oc11ah.32500.10292.xxxxxxxx ?

i dont find any 2.7M on my hard drive but ... i saw in my results some took 3x more time.



I have about 15. All start with 21jn.....
ID: 1188781 · Report as offensive
Profile Belthazor
Volunteer tester
Avatar

Send message
Joined: 6 Apr 00
Posts: 219
Credit: 10,373,795
RAC: 13
Russia
Message 1188809 - Posted: 27 Jan 2012, 9:24:19 UTC

I've got 18 WUs named 21jn11ac, all of them are normally 365 kb
ID: 1188809 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1188810 - Posted: 27 Jan 2012, 9:27:06 UTC

I ran one to see what it would do. It was on track for a normal runtime for GPU other-than-shorties, but exited with a -9 result overflow after 9.6 minutes. Hope they don't all do that. Even if they run for normal times, it won't help the bandwith issue.

http://setiathome.berkeley.edu/result.php?resultid=2284335011
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1188810 · Report as offensive
Profile Belthazor
Volunteer tester
Avatar

Send message
Joined: 6 Apr 00
Posts: 219
Credit: 10,373,795
RAC: 13
Russia
Message 1188811 - Posted: 27 Jan 2012, 9:31:19 UTC

As I remebber, before his departure Matt said that there is some experiment on the splitters software. May be 21jn11ac is a testing tape?
ID: 1188811 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1188814 - Posted: 27 Jan 2012, 9:44:46 UTC - in response to Message 1188811.  

As I remebber, before his departure Matt said that there is some experiment on the splitters software. May be 21jn11ac is a testing tape?


Here's what was said:
Another project I've been working on is to get the splitters (the programs that make workunits out of raw data) to become sensitive to VGC (voltage gain control) values available in the raw data headers so that we can avoid splitting areas with low VGC values (and therefore loud noise). In layman's terms: we're trying to set everything up to automatically reject noisy workunits before sending them out. We know one or two beams (out of fourteen) are sometimes flaky, and keeping those workunits out of the pipeline will help reduce network competition for downloads.

This should have been fairly straightforward, however during the course of testing we're finding more than one or two beams with various problems. More like 5 or 6. This may be for several different reasons, including bogus or misreported VGC values. This is on a front burner, with several parties involved here and at Arecibo.


Basically it's what I've been suggesting for those AP WUs that take 16MB of download bandwidth and error out in 20 seconds due to 100% blanking. Find a way server-side to find these before they get sent out. It's being worked on, but I don't think that is what is going on with these huge MB WUs.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1188814 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1188876 - Posted: 27 Jan 2012, 16:07:19 UTC - in response to Message 1188814.  

I've been getting a lot of download errors today, but only on the big iron at work, it would seem. Nothing to do with the big MB WUs as far as I can tell. I'd guess it's something to do with our network, unless others are seeing it too -- a couple of days ago I could only get 30 kB/s upgrade downloads from CERN, where I normally get 2 MB/s or more.
ID: 1188876 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1188889 - Posted: 27 Jan 2012, 16:50:53 UTC - in response to Message 1188757.  

The one comparison I am looking at shows the <subband_desc> <number> is 107 for the XL wu and just 87 for the normal sized MB. Of course the data_length is different; 2839876 for the XL and 354991 for the regular.
...
adding: Many changes in splitter_cfg and analysis_cfg also.

Well, I've just done a comparison of

21jn11ac.5207.16432.3.10.154 (mega size WU)
21jn11ac.14219.16432.6.10.73 (normal size WU)

All the mega WUs seem to have .5207. as the second element of the name: those two match as closely as I can find in my cache for the other elements.

The mega WU was split from Beam 0, Polarisation 0: the normal one from Beam 1, Polarisation 1. That leads to minor differences in the telescope aiming point at identical coordinate times, which is to be expected - the seven separate antennae which make up our "multibeam" signal sit side-by-side at (or near) the telescope's focal point, so naturally they have a slighly different field of view - that's the whole idea.

There are different array_ellipse values, probably again because of the different beams in use: and the subband numbers differ - they are 154 and 73 respectively, as you'd expect from the final elements of the WU names.

Apart from that, there seem to be no significant differences at all - certainly nothing to explain the different data lengths. But I'll keep the WU files, just in case.
ID: 1188889 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 1188891 - Posted: 27 Jan 2012, 17:12:14 UTC - in response to Message 1188705.  

Hey - I just downloaded 21jn11ac.5207.7025.3.10.244 - seems like an ordinary MB job.

But the datafile is 2,794 kilobytes. So what's with that, then?

Answers on a postcard.....



Got 13 here, 12 for GPU and 1 for CPU.

I have made backup copies, if anyone's interested.

They should be processed tomorrow.


Kevin


ID: 1188891 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1188895 - Posted: 27 Jan 2012, 17:37:46 UTC - in response to Message 1188468.  

....I can tell you this.
My top rig is losing cache steadily. And if I am losing cache, a lot of other rigs are as well. Even though server status shows over 200k WUs ready to send, the scheduler/feeder combo is not getting the job done. Otherwise, bandwidth would be saturated right now.

Something is not happy in serverland right now.

The Crickets' hiccups this week (another one starting now) notwithstanding, my best rig is rebuilding its cache and is now within 10 of its limit.

(OTOH, the wifi here at work is having major fits today; this is the third time I've typed this message.)

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1188895 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

Message boards : Number crunching : Panic Mode On (65) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.