Wall of Workunits (Apr 15 2008)


log in

Advanced search

Message boards : Technical News : Wall of Workunits (Apr 15 2008)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1391
Credit: 74,079
RAC: 10
United States
Message 739536 - Posted: 15 Apr 2008, 22:24:02 UTC

As mentioned yesterday the kind folks at Adaptec/SnapAppliance replaced our server. The leading theory for its failure is still localized to the ribbon cable connecting the faceplate to the motherboard, but they swapped out the whole thing anyway just to be safe. The RAID devices had to be massaged a bit and then spent all night resyncing. That wrapped up around 4am, but one of the RAID1 pairs needed to be resynced again. Once that finished, I tackled the usual Tuesday database compression/backup. Since that began early this week (no reason not to since we were already off line) that completed around 12:30pm and I started the public/beta projects. We'll be catching up for a while, I imagine.

The assimilator queue blossomed again, but this (I think) was mostly due to one of the four assimilators being stuck on one particular result where the uploaded file got garbled and therefore became un-parseable. I blew this result away and that one assimilator seems to have pushed through for now.

Jeff is trying to debug a new problem with the splitters - despite additional smarts/logic some are failing mid-file, unable to find the radar blanking signal. But when we look at the file by hand, we see the signal (or at least where the signal should be). Insert sound of head scratching here. In any case, if there are less splitters running than normal, that's why.

Happy Tax Day, my U.S. compatriots.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Daniel Michel
Volunteer tester
Avatar
Send message
Joined: 2 Feb 04
Posts: 14899
Credit: 1,343,133
RAC: 32
United States
Message 739541 - Posted: 15 Apr 2008, 22:32:23 UTC
Last modified: 15 Apr 2008, 22:34:18 UTC

From my end it seems like you guys are doing a swell job with the tools you have to work with...Now if only we could get a major band to do a benefit concert for SETI@home you could be outfitted with more of the good stuff you need...Even a fundraising concert by a bunch of lesser known artist could raise some significant cash.

Again...thanks for keeping us updated.
____________


Proud to be TFFE

Greg
Send message
Joined: 12 Oct 07
Posts: 6
Credit: 1,031,943
RAC: 0
Australia
Message 739555 - Posted: 15 Apr 2008, 22:56:32 UTC

Speaking of a Wall of Workunits. I seem to have been allocated a huge slab while the download server was offline before I realised there was something wrong and suspended my boinc software. Is there any way to re-download these, or at least have them put back into circulation?

It's just that I don't like the idea of all those people waiting for credit until May 8 (when most of them expire) when I could have them processed within a few days.

Cheers,
Greg

Profile RandyC
Avatar
Send message
Joined: 20 Oct 99
Posts: 714
Credit: 1,704,345
RAC: 0
United States
Message 739561 - Posted: 15 Apr 2008, 23:08:55 UTC - in response to Message 739555.

Speaking of a Wall of Workunits. I seem to have been allocated a huge slab while the download server was offline before I realised there was something wrong and suspended my boinc software. Is there any way to re-download these, or at least have them put back into circulation?

It's just that I don't like the idea of all those people waiting for credit until May 8 (when most of them expire) when I could have them processed within a few days.

Cheers,
Greg


The only way (currently) to handle this is to Detach that host. A Reset does not work. If you still have valid WUs you're crunching, set no-new-tasks, run the queue down, and then report them before doing the detach.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 739563 - Posted: 15 Apr 2008, 23:10:52 UTC - in response to Message 739555.

Speaking of a Wall of Workunits. I seem to have been allocated a huge slab while the download server was offline before I realised there was something wrong and suspended my boinc software. Is there any way to re-download these, or at least have them put back into circulation?

What I've been doing, which seems to be working very well:

I only allow network activity for about 4 hours per day, during the evening in Berkeley.

I've got my connect interval set to less than 4 hours (0.1 seems good), and my "extra days" at about 3.

That way, the cache stays pretty full, and if the project is down for the evening, my systems aren't hammering Berkeley for work.

____________

Profile Neil Blaikie
Volunteer tester
Avatar
Send message
Joined: 17 May 99
Posts: 142
Credit: 6,643,590
RAC: 660
Canada
Message 739565 - Posted: 15 Apr 2008, 23:15:44 UTC
Last modified: 15 Apr 2008, 23:17:48 UTC

Good job again everyone at Berkeley. Thanks for the update Matt.

Off to enjoy the small amount of remaining evening sunshine here in Montreal and yes it will be with a nice cold beer!
____________

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 739569 - Posted: 15 Apr 2008, 23:20:36 UTC


. . . Again Matt - Thanks for keeping us folks informed - and iT is Appreciated Sir

- THAT also goes out to each of the others @ Berkeley too . . .


____________
BOINC Wiki . . .

Science Status Page . . .

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 739734 - Posted: 16 Apr 2008, 7:55:25 UTC - in response to Message 739569.


. . . Again Matt - Thanks for keeping us folks informed - and iT is Appreciated Sir

- THAT also goes out to each of the others @ Berkeley too . . .



Sounds like Pink Floyd, i'am glad everything is UP and running again, including the RAID device, thanx for your update, Matt .




____________

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 2003
Credit: 11,213,980
RAC: 13,914
United States
Message 740063 - Posted: 16 Apr 2008, 21:34:51 UTC

Umm, Matt... the client connection stats page is on the fritz again...
____________
.

John G
Send message
Joined: 29 Dec 01
Posts: 63
Credit: 10,142,278
RAC: 0
Canada
Message 740099 - Posted: 16 Apr 2008, 22:25:22 UTC - in response to Message 739555.

Speaking of a Wall of Workunits. I seem to have been allocated a huge slab while the download server was offline before I realised there was something wrong and suspended my boinc software. Is there any way to re-download these, or at least have them put back into circulation?

It's just that I don't like the idea of all those people waiting for credit until May 8 (when most of them expire) when I could have them processed within a few days.

Cheers,
Greg


Ditto Greg I have lost over 46 wu's this day because of the problem. Had to go to a reset of project which I hate doing !!!!.

Cheers

Profile Mr. Majestic
Volunteer tester
Avatar
Send message
Joined: 26 Nov 07
Posts: 4752
Credit: 258,845
RAC: 0
United States
Message 740240 - Posted: 17 Apr 2008, 2:04:59 UTC

Thanks for keeping us informed Matt!
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5946
Credit: 62,407,355
RAC: 38,853
Australia
Message 740269 - Posted: 17 Apr 2008, 4:35:20 UTC - in response to Message 740099.

Ditto Greg I have lost over 46 wu's this day because of the problem. Had to go to a reset of project which I hate doing !!!!.
Cheers

Why?
Wait for them to be re-issued, they get crunched, you get credit.
____________
Grant
Darwin NT.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 740590 - Posted: 17 Apr 2008, 20:24:25 UTC - in response to Message 740099.

Speaking of a Wall of Workunits. I seem to have been allocated a huge slab while the download server was offline before I realised there was something wrong and suspended my boinc software. Is there any way to re-download these, or at least have them put back into circulation?

It's just that I don't like the idea of all those people waiting for credit until May 8 (when most of them expire) when I could have them processed within a few days.

Cheers,
Greg


Ditto Greg I have lost over 46 wu's this day because of the problem. Had to go to a reset of project which I hate doing !!!!.

Cheers

I didn't lose any. BOINC kept retrying, and when things came back up, it picked up the relevant files.
____________

Greg
Send message
Joined: 12 Oct 07
Posts: 6
Credit: 1,031,943
RAC: 0
Australia
Message 742767 - Posted: 22 Apr 2008, 14:45:30 UTC - in response to Message 739555.

Speaking of a Wall of Workunits. I seem to have been allocated a huge slab while the download server was offline before I realised there was something wrong and suspended my boinc software. Is there any way to re-download these, or at least have them put back into circulation?

It's just that I don't like the idea of all those people waiting for credit until May 8 (when most of them expire) when I could have them processed within a few days.

Cheers,
Greg



Thankyou all for your assistance on this one! Detaching did the trick, and the WU's in question are dropping from my task-list like flies. I took the hint from a few of you and increased my queue length a bit.

Thanks Matt and the team for keeping up the supply!

-Greg

Profile Clyde C. Phillips, III
Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 742787 - Posted: 22 Apr 2008, 16:04:55 UTC

Would a wall of workunits be causing this kind of problem?

Task ID 820497706
Name 12mr08ah.19734.11115.10.8.88_2
Workunit 253396773
Created 19 Apr 2008 15:08:47 UTC
Sent 20 Apr 2008 10:30:09 UTC
Received 22 Apr 2008 15:58:14 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 2398546
Report deadline 14 May 2008 16:43:30 UTC
CPU time 27096.671875
stderr out <core_client_version>5.4.11</core_client_version>
<stderr_txt>
Optimized SETI@Home Enhanced application

Optimizers: Ben Herndon, Josef Segur, Alex Kan, Simon Zadra
Version: Windows SSE3 32-bit based on seti V5.15 'Ni!'
Rev: (R-2.4|xP|FFT:IPP_SSE3|Ben-Joe)
CPUID: 'Intel PD Pentium D (Presler)'
cpus: 1 cores: 2 threads: 1 cache: L1=16K L2=2048K L3=0K
features: mmx sse sse2 sse3
speed: 3412 MHz -- read megs/sec: L1=12564, L2=8307, RAM=4702

Work Unit Info
True angle range: 0.389240
Restarted at 78.07 percent.

Spikes Pulses Triplets Gaussians Flops
1 0 0 0 22417043295907

</stderr_txt>

Validate state Initial
Claimed credit 73.9615432793234
Granted credit 0
application version 5.27


HOME PARTICIPATE ABOUT COMMUNITY ACCOUNT STATISTICS

Taking almost three times the normal time to crunch


____________

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 2003
Credit: 11,213,980
RAC: 13,914
United States
Message 742824 - Posted: 22 Apr 2008, 21:38:41 UTC - in response to Message 742787.

Would a wall of workunits be causing this kind of problem?

Task ID 820497706
Name 12mr08ah.19734.11115.10.8.88_2
Workunit 253396773
Created 19 Apr 2008 15:08:47 UTC
Sent 20 Apr 2008 10:30:09 UTC
Received 22 Apr 2008 15:58:14 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 2398546
Report deadline 14 May 2008 16:43:30 UTC
CPU time 27096.671875
stderr out <core_client_version>5.4.11</core_client_version>
<stderr_txt>
Optimized SETI@Home Enhanced application

Optimizers: Ben Herndon, Josef Segur, Alex Kan, Simon Zadra
Version: Windows SSE3 32-bit based on seti V5.15 'Ni!'
Rev: (R-2.4|xP|FFT:IPP_SSE3|Ben-Joe)
CPUID: 'Intel PD Pentium D (Presler)'
cpus: 1 cores: 2 threads: 1 cache: L1=16K L2=2048K L3=0K
features: mmx sse sse2 sse3
speed: 3412 MHz -- read megs/sec: L1=12564, L2=8307, RAM=4702

Work Unit Info
True angle range: 0.389240
Restarted at 78.07 percent.

Spikes Pulses Triplets Gaussians Flops
1 0 0 0 22417043295907

</stderr_txt>

Validate state Initial
Claimed credit 73.9615432793234
Granted credit 0
application version 5.27


HOME PARTICIPATE ABOUT COMMUNITY ACCOUNT STATISTICS

Taking almost three times the normal time to crunch



what is your normal CPU time for an ~ 74 credit WU? Nothing here looks out of the ordinary, to me, except the "Restarted at 78 %" ... possibly the WU was interrupted for another, higher priority, WU, possibly for another project.
____________
.

Profile Clyde C. Phillips, III
Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 743193 - Posted: 23 Apr 2008, 18:21:58 UTC

About 10,000 seconds. An idea popped into my head. Maybe I'll go in and blow off all the cooling elements, fans, etc. Maybe the computer (just the one, not the other similar machine) is throttling back because its PD950 is getting a little too hot. I'll give that a try. There are more bad units today, too. There's no other project, just Seti at present. Thanks a lot.
____________

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 2003
Credit: 11,213,980
RAC: 13,914
United States
Message 743560 - Posted: 24 Apr 2008, 14:21:04 UTC - in response to Message 743193.
Last modified: 24 Apr 2008, 14:25:42 UTC

About 10,000 seconds. An idea popped into my head. Maybe I'll go in and blow off all the cooling elements, fans, etc. Maybe the computer (just the one, not the other similar machine) is throttling back because its PD950 is getting a little too hot. I'll give that a try. There are more bad units today, too. There's no other project, just Seti at present. Thanks a lot.


Also look for anything else that might have resulted in your CPU being "throttled back" for heat - bad CPU heatsink fan, dead case fan(s), adjacent case fans blowing in opposite directions, major obstruction in airways, etc..
Don't forget the fan in your power supply!

(one of those has actually happened to me - the "dead case fan(s)")
____________
.

Profile Clyde C. Phillips, III
Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 743640 - Posted: 24 Apr 2008, 18:36:00 UTC

Blowing out everything didn't help. I still found several bad results (long times with restarts) returned today, after blowing out the machine yesterday afternoon. All four fans are spinning at blur velocity. There can't be any adjacent case fans turning in opposite directions because there are no adjacent fans, and the machine had been crunching at normal speed up until recently. Some units are right now being done at normal speed. Maybe I could try loading a newer Boinc but that'll almost certainly freeze Seti (Simon's cruncher). Maybe I could try Crunch3r's cruncher (if that's available) instead of Simon's.
____________

Profile Clyde C. Phillips, III
Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 744109 - Posted: 25 Apr 2008, 18:55:39 UTC

I finally got SpeedFan installed on both computers. The errant computer's processor is at 84C, and the OK one is at only 64C. The system fan is rotating faster in the good machine, and the "CPU0" fan is at 0 RPM in the bad machine. It looked like all fans were turning there but maybe there could be a hidden one somewhere. I guess it's a phonecall to CyberPower at convenience. Thanks.
____________

Message boards : Technical News : Wall of Workunits (Apr 15 2008)

Copyright © 2014 University of California