. . . IF You are Having Problems -

Message boards : Number crunching : . . . IF You are Having Problems -
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 621277 - Posted: 17 Aug 2007, 19:16:55 UTC




Suggestion: 'read this' from Matt:


First off, I should point out that the server status page isn't the most accurate thing in the world,

especially now as I haven't yet converted any of this code to understand how the new multibeam splitters work (I've been busy).

So please don't use the data on this particular web page to inspire panic - many splitters are running, and have been all night,

even though the page shows none of them are running at all



SETI NEWS from Matt . . . August 16, 2007


So here's the deal. Getting multibeam data out to the public is having its ups and downs. Thanks to some helpful poking and prodding from various users we uncovered a problem with the splitter causing it to generate workunits with bogus triplet thresholds. The result: about 50% of the workunits sent out were overflowing quickly and returning, creating network clogs on our already-overwhelmed servers. And about 2.5% of the workunits were sent out with impossibly low threshholds, causing clients to spin on ridiculously slow calculations. The mystery here is why these aren't also immediately overflowing (with such thresholds they should report a lot of garbage right away). This may have to do when/where the client checks for overflow - it may take several hours to reach 0.001% done, but then the hope is these clients will then finally be bursting with data and returning the results home.

This was actually a problem in beta that got fixed, but now somehow resurfaced, which is also a mystery. CVS out of sync? Some stupid code put in to check for config overrides on the command line? Unfortunately the splitter guru is on vacation, so we had to make our best attempt to understand the code and patch it ourselves. Jeff just did so and put the fixed version on line and we're watching the thresholds. So far so good.

Meanwhile, we're back to yesterday's problem of just not having enough throughput from the workunit file server, so that's the main bottleneck right now, and there's not much we can do about it except wait for the current artificial demand (caused by the excessive overflows) to die down and see if we catch up.

- Matt


Splitsville (Aug 16 2007)


Copyright © 2007 University of California
ID: 621277 · Report as offensive
tigerfood

Send message
Joined: 15 Nov 03
Posts: 1
Credit: 1,347,626
RAC: 0
Germany
Message 621457 - Posted: 17 Aug 2007, 23:22:07 UTC

most WUs exit with a computational error and still some take forever and go nowhere.. ~120,000s of cpu time for 0.05 credits?? if you can't fix it, take it offline.. wasting resources that other projects could use for smth. sensible is not really the smartest strategy ever heard of..

i'm glad to participate again when you know what the project is doing with you..
cu heiko
ID: 621457 · Report as offensive
Profile Jim-R.
Volunteer tester
Avatar

Send message
Joined: 7 Feb 06
Posts: 1494
Credit: 194,148
RAC: 0
United States
Message 621484 - Posted: 17 Aug 2007, 23:58:59 UTC - in response to Message 621457.  

most WUs exit with a computational error and still some take forever and go nowhere.. ~120,000s of cpu time for 0.05 credits?? if you can't fix it, take it offline.. wasting resources that other projects could use for smth. sensible is not really the smartest strategy ever heard of..

i'm glad to participate again when you know what the project is doing with you..
cu heiko

These were work units that were released after they were split with the defective splitter, but were already in the "wild". The splitter is fixed now but we still have to get rid of the defective ones that are still around in the system. As these get crunched, or whatever, new ones will be split that are good. So every one of these that we can "dispose of" means one more "good one" that can be released.
Jim

Some people plan their life out and look back at the wealth they've had.
Others live life day by day and look back at the wealth of experiences and enjoyment they've had.
ID: 621484 · Report as offensive
Steven Gaber

Send message
Joined: 18 May 99
Posts: 47
Credit: 291,872
RAC: 0
United States
Message 621631 - Posted: 18 Aug 2007, 2:22:42 UTC - in response to Message 620807.  

I admit to being one of those fools who hasn't got a clue.

But my SETI has been doing wierd stuff all day. The CPU time reads 5 hours, the progress is .005% and the time to completion is 7 hours and increasing, not decreasing, every second or two. That's a first for me.

I know SETI has no new work, but why should it be stuck at .005% yet using CPU time and adding more time to completion?

See-- I told you I had no clue. Enlightenment?
Steven Gaber

See the Work Unit Problem Thread, without looking I assume you got one of these.

Andy


So we got a run of bad WUs?

Will it let me stop the one it's working on and go to the next one?
Steven Gaber


You can suspend the 'bad one' and it will switch to next in line.
Of the 10 units I got from that batch, one not started yet, only one has caused problems.

Andy


I aborted the WU my computer was stuck on yesterday, switched to a new one and it hasn't budged in two hours. no CPU time, no Progress. I've let my present and past computers run 24/7 for several years processing SETI. After being inactive for two days, now almost 3, I was hoping it would start up again, but that doeosn't seem to be happening. Ain't no reasaon to have it on for days at a time with nothin goin on. Will WUs start spontaneously? Might as well shut the box down, give it a rest. Steve Gaber


In addition to the one I aborted earlier, there are three unstarted WUs, each with the same Time to Completion -- 01:58:42. That's suspicious, ain't it? Should I abort all of them and hope for better WUs? Steve Gaber


All better now. Back to almost normal. Thanks for the commiseration. Steve Gaber
ID: 621631 · Report as offensive
JLDun
Volunteer tester
Avatar

Send message
Joined: 21 Apr 06
Posts: 574
Credit: 196,101
RAC: 0
United States
Message 624245 - Posted: 22 Aug 2007, 6:27:28 UTC

This is more to get this out in the open, that this is 'still happening' (while not as often, for me), since the widespread release of MultiBeam WU's:


8/22/2007 1:17:05 AM|SETI@home|[file_xfer] Started download of file 02mr07ah.25481.20931.9.5.159
8/22/2007 1:17:28 AM||Project communication failed: attempting access to reference site
8/22/2007 1:17:28 AM|SETI@home|[file_xfer] Temporarily failed download of 02mr07ah.25481.20931.9.5.159: system connect
8/22/2007 1:17:28 AM|SETI@home|Backing off 1 min 0 sec on download of file 02mr07ah.25481.20931.9.5.159
8/22/2007 1:17:30 AM||Access to reference site succeeded - project servers may be temporarily down.
... [Message for another project] ...
... [Message for another project] ...
8/22/2007 1:18:29 AM|SETI@home|[file_xfer] Started download of file 02mr07ah.25481.20931.9.5.159

ID: 624245 · Report as offensive
James Nelson
Volunteer tester
Avatar

Send message
Joined: 23 Mar 02
Posts: 381
Credit: 4,806,382
RAC: 0
United States
Message 624276 - Posted: 22 Aug 2007, 9:09:01 UTC - in response to Message 624245.  
Last modified: 22 Aug 2007, 9:10:16 UTC

This is more to get this out in the open, that this is 'still happening' (while not as often, for me), since the widespread release of MultiBeam WU's:


8/22/2007 1:17:05 AM|SETI@home|[file_xfer] Started download of file 02mr07ah.25481.20931.9.5.159
8/22/2007 1:17:28 AM||Project communication failed: attempting access to reference site
8/22/2007 1:17:28 AM|SETI@home|[file_xfer] Temporarily failed download of 02mr07ah.25481.20931.9.5.159: system connect
8/22/2007 1:17:28 AM|SETI@home|Backing off 1 min 0 sec on download of file 02mr07ah.25481.20931.9.5.159
8/22/2007 1:17:30 AM||Access to reference site succeeded - project servers may be temporarily down.
... [Message for another project] ...
... [Message for another project] ...
8/22/2007 1:18:29 AM|SETI@home|[file_xfer] Started download of file 02mr07ah.25481.20931.9.5.159


I believe we all are experiencing the same thing, I understand your frustration that while the server is up, and work is avaliable, we are still having connection issues, I'm sure that project admins are aware, and are working hard at repairing this.


ID: 624276 · Report as offensive
Profile Mahoujin Tsukai
Volunteer tester
Avatar

Send message
Joined: 21 Jul 07
Posts: 147
Credit: 2,204,402
RAC: 0
Singapore
Message 624894 - Posted: 23 Aug 2007, 16:10:46 UTC

I keep getting a few of these often (along with WUs that download successfully).

Is anyone having the same problem? What is wrong here?
ID: 624894 · Report as offensive
Profile Jim-R.
Volunteer tester
Avatar

Send message
Joined: 7 Feb 06
Posts: 1494
Credit: 194,148
RAC: 0
United States
Message 624896 - Posted: 23 Aug 2007, 16:15:24 UTC - in response to Message 624894.  
Last modified: 23 Aug 2007, 16:16:22 UTC

I keep getting a few of these often (along with WUs that download successfully).

Is anyone having the same problem? What is wrong here?


Seems like the servers are having issues again. I've been noticing this showing up here in the forums for about the last 24 hours. It was pointed out in another thread that there seems to be a bunch of short running work units being split which would increase the load on the system due to the faster turnaround times. I don't know if this is what is really happening, but others have reported similar problems so you're not alone.
Jim

Some people plan their life out and look back at the wealth they've had.
Others live life day by day and look back at the wealth of experiences and enjoyment they've had.
ID: 624896 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51527
Credit: 1,018,363,574
RAC: 1,004
United States
Message 625350 - Posted: 24 Aug 2007, 7:17:46 UTC - in response to Message 624896.  
Last modified: 24 Aug 2007, 7:18:10 UTC

I keep getting a few of these often (along with WUs that download successfully).

Is anyone having the same problem? What is wrong here?


Seems like the servers are having issues again. I've been noticing this showing up here in the forums for about the last 24 hours. It was pointed out in another thread that there seems to be a bunch of short running work units being split which would increase the load on the system due to the faster turnaround times. I don't know if this is what is really happening, but others have reported similar problems so you're not alone.


Looks like the servers may be getting tied in knots again. The Cricket Graph shows steadily declining traffic. I would not expect that if there are a lot of hosts trying to get work and trying to complete pending downloads.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 625350 · Report as offensive
Profile Jim-R.
Volunteer tester
Avatar

Send message
Joined: 7 Feb 06
Posts: 1494
Credit: 194,148
RAC: 0
United States
Message 625371 - Posted: 24 Aug 2007, 8:25:29 UTC - in response to Message 625350.  
Last modified: 24 Aug 2007, 8:37:59 UTC


Looks like the servers may be getting tied in knots again. The Cricket Graph shows steadily declining traffic. I would not expect that if there are a lot of hosts trying to get work and trying to complete pending downloads.

Yes, but if the system were choking down internally and not able to get the work units out the door it looks to me like that would cause a drop in network traffic also. I suspect that the same bottleneck that plagued us earlier is rearing it's head again now that the shorter running wu's are being crunched. In other words, long running units going out, longer time between calls for new units and fewer returns in a certain time period, and the system can handle everything fine. Shorter running units going out, shorter time between calls for new units and more returns in a certain time period, and the system chokes.
I haven't looked at the graph yet, but I wouldn't be suprised to see a rise in network traffic just before the beginning of the decline.
(edit) yes a very sharp spike at 1:00 pm yesterday then a smaller peak around 2:30 followed by a sharp drop, then a slight increase of say 200-300 packets/sec with another sharp peak at 4:00 continuing at the slightly elevated level till about 9:00 when the plunge starts. Hope I've got these figures right, I don't have my glasses on at the moment! hehe
Jim

Some people plan their life out and look back at the wealth they've had.
Others live life day by day and look back at the wealth of experiences and enjoyment they've had.
ID: 625371 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 628343 - Posted: 28 Aug 2007, 13:53:21 UTC - in response to Message 619245.  


Thanks to Each of You that Commented - iT was Much Appreciated . . .

UPDATE: Berkeley DID it's thing - finished the 5.17's and then the Server sent me my 5.27's - w/ 5.10.13 and i UPDATED crunchR's 2.4v to the Folder and she's CRUNCHIN' & A-MUNCHIN' . . . THANKS again especially Mark ;)

~ THIS THREAD IS NOW OFFICIALLY CLOSED ~




simply leave your box connected to Berkeley and let it do it's thing

- it 'corrects itself' all by itself . . . IF left alone

@ least it worked for me . . .


richard





BOINC Wiki . . .

Science Status Page . . .
ID: 628343 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : . . . IF You are Having Problems -


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.