Panic Mode On (101) Server Problems?

Message boards : Number crunching : Panic Mode On (101) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 27 · Next

AuthorMessage
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1739816 - Posted: 5 Nov 2015, 5:18:08 UTC
Last modified: 5 Nov 2015, 5:24:04 UTC

Drat. My lowly single-core machine picked up a re-send for one of the terminally-ill MBs. There goes the consecutive valid count.

http://setiathome.berkeley.edu/workunit.php?wuid=1954413233


edit: Question: if the auto-corr config values (as mentioned by Richard) are zero instead of the values they should be... then theoretically, couldn't one just open the WU in a hex editor and put those values back to something non-zero so it would crunch properly? Surely it's not that simple of a fix though...

edit2: I think I just understood from re-reading.. that's in the output result file that is from. So then it would still have to be something in the header for the WU itself that decides it can't run auto-corr, right?
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1739816 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1739821 - Posted: 5 Nov 2015, 5:57:54 UTC - in response to Message 1739815.  

Got to love the perversity of chance.

I've got 2 systems, a Core 2 Duo & and i7. Naturally the i7 can do a lot more work than the C2D.
With the present lack of work, the C2D gets work every 45min or so. The i7, every 2 (or more) hours.

Gotta love it. Like how my Pentium D just sucked down APs, but the quad core Xeon, nope ...
ID: 1739821 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22202
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1739823 - Posted: 5 Nov 2015, 6:06:45 UTC

Well it looks as if the splitters are behaving themselves just now....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1739823 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739826 - Posted: 5 Nov 2015, 6:20:03 UTC - in response to Message 1739823.  

Well it looks as if the splitters are behaving themselves just now....

I had been thinking that their splitting rate has been higher than it's been for some time (pretty much since the PFB splitters came in. Generally anything less than 5 splitters running & output is barely 27/s. Multiple splitters on the one channel, even less. So far all 7 splitters have been running on only 4 (even down to 2) files & still they're pumping out the work).
I just didn't want to tempt fate.



To add to the perversity of chance, now that so many caches are pretty much empty, 90% of the work I've been getting have been shorties.

Although there are some GPU WUs i'll keep an eye on.
04mr11ae

Estimated completion times for longer running GPU WUs are usually not much more than 35min. These ones are all around 45min.
Grant
Darwin NT
ID: 1739826 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1739840 - Posted: 5 Nov 2015, 8:29:41 UTC

It was a really nice flow lately, lots of APs gave me an nice RAC but now it's over again. No APs, no MBs, lots of invalid tasks >>>>> had to power down my cruncher once again. Just wish this project would be a bit more stable.
ID: 1739840 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1739844 - Posted: 5 Nov 2015, 8:37:35 UTC - in response to Message 1739840.  

NX-01 you can probably blame Matt he probably was the sucker left to do the programming changes It's called passing the buck hehehehehe

Sorry Matt couldn't resit that one :-)
ID: 1739844 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739848 - Posted: 5 Nov 2015, 9:12:10 UTC - in response to Message 1739816.  

edit: Question: if the auto-corr config values (as mentioned by Richard) are zero instead of the values they should be... then theoretically, couldn't one just open the WU in a hex editor and put those values back to something non-zero so it would crunch properly? Surely it's not that simple of a fix though...

edit2: I think I just understood from re-reading.. that's in the output result file that is from. So then it would still have to be something in the header for the WU itself that decides it can't run auto-corr, right?

Yes, the autocorr settings I quoted were lifted from the downloaded data file before it was crunched. They could be edited between downloading and crunching, so that the proper analysis was done and reported.

But there are two flies in that ointment - one potential, one certain.

Potential: editing the WU data file would change its MD5 checksum. I think that's only checked as the download completes, but it might get checked when the task is launched as well (it probably should be). BOINC would be within its rights to reject the file for tampering.

Certain: unless you could be certain that your wingmate had also edited the data, your result would be different from all the others, and would fail validation.
ID: 1739848 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1739857 - Posted: 5 Nov 2015, 9:36:43 UTC - in response to Message 1739844.  

NX-01 you can probably blame Matt he probably was the sucker left to do the programming changes It's called passing the buck hehehehehe

Sorry Matt couldn't resit that one :-)

Don't they test changes on beta anymore before they go live on main?
ID: 1739857 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739858 - Posted: 5 Nov 2015, 9:39:35 UTC - in response to Message 1739857.  

Don't they test changes on beta anymore before they go live on main?


Posted earlier in this very thread,
Just so you know we're working on the splitter problem - a new bit of splitter code was put into play yesterday. It was working well enough in beta, but apparently it still wasn't ready for prime time. We have some debugging and cleaning up to do but we'll be back soon enough with more workunits....

- Matt

Grant
Darwin NT
ID: 1739858 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1739864 - Posted: 5 Nov 2015, 10:52:17 UTC
Last modified: 5 Nov 2015, 10:55:13 UTC

Well, that's strange then.

BTW: Everthing was running fine before, at least for me, so I wonder what problem they are trying to fix with the new code? (sorry if the answer is already in this thread somewhere).
ID: 1739864 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739866 - Posted: 5 Nov 2015, 11:06:45 UTC - in response to Message 1739864.  
Last modified: 5 Nov 2015, 11:11:53 UTC

Well, that's strange then.

BTW: Everthing was running fine before, at least for me, so I wonder what problem they are trying to fix with the new code? (sorry if the answer is already in this thread somewhere).

They've been working slowly behind the scenes for most of this year, preparing the entire processing chain (telescope --> recorder --> splitter --> application(s) --> validator --> assimilator) to handle observations made at the Green Bank observatory. The new splitters are dual-purpose, designed to handle either Arecibo or Green Bank data as required.

Edit - none of that is particularly new, I'm just repeating what Matt has posted in Technical News. See, for example, Jun 23 2015 and Aug 31 2015.
ID: 1739866 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1739869 - Posted: 5 Nov 2015, 11:18:42 UTC

Ok, thx Richard, hope they can fix everything soon.
ID: 1739869 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1739959 - Posted: 5 Nov 2015, 20:30:54 UTC - in response to Message 1739848.  
Last modified: 5 Nov 2015, 20:32:35 UTC

But there are two flies in that ointment - one potential, one certain.

Potential: editing the WU data file would change its MD5 checksum. I think that's only checked as the download completes, but it might get checked when the task is launched as well (it probably should be). BOINC would be within its rights to reject the file for tampering.

Certain: unless you could be certain that your wingmate had also edited the data, your result would be different from all the others, and would fail validation.

The MD5s are easy enough.. just make the change to the header, re-MD5 the file, put the new hash into client_state. Unless that MD5 is cross-checked with the scheduler upon contact (I would hope that it does), then I can likely see that being a problem.

As far as wingmates.. I know there's hardly any guarantee that random wingmates would ever respond, and then it would be even less likely that if anyone does respond, they would know how to fix their WU the same as you did.

In the case of some of these WUs, you just need two of them out of the total of 10 to match. So if 6 or 7 of the wingmates never respond to PMs about it.. you just need one out of the total of 9 to respond and know how to do this.



Of course, this is all hypothetical at best anyway, because I believe this totally falls into the category of tampering which is not only unethical but also prohibited.

I was just wondering if there was technically something that could be done on the client-side to fix these broken WUs, and I suppose I already got my answer... theoretically, yes; realistically, no.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1739959 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739973 - Posted: 5 Nov 2015, 22:12:56 UTC
Last modified: 5 Nov 2015, 22:13:20 UTC

Caches are slowly filling, only another 300,000 to go.
Returned-per-hour is right up there, over 100,000/hr for the last 6 hours. Hopefully some of the new files will give a few more longer running WUs. Help reduce the load a bit.
Grant
Darwin NT
ID: 1739973 · Report as offensive
Starman
Avatar

Send message
Joined: 15 May 99
Posts: 204
Credit: 81,351,915
RAC: 25
Canada
Message 1739984 - Posted: 5 Nov 2015, 23:27:34 UTC

I'm still getting an unusually high number of invalid's.

Just me or are others getting them as well.

Thanks
ID: 1739984 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1739985 - Posted: 5 Nov 2015, 23:36:13 UTC

There are a lot of invalid tasks floating though the system due to a coding error, which I believe has been fixed.
ID: 1739985 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739987 - Posted: 5 Nov 2015, 23:44:50 UTC - in response to Message 1739985.  

There are a lot of invalid tasks floating though the system due to a coding error, which I believe has been fixed.

I haven't noticed any errors on the WUs since they had a play with the splitter code to sort it out.
Although I've already had several _9s on my systems, those automatic error WUs will be floating around for months.
Grant
Darwin NT
ID: 1739987 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1739988 - Posted: 5 Nov 2015, 23:50:06 UTC

I also got a lot more invalids today.

Most of them have no autocorr still.


With each crime and every kindness we birth our future.
ID: 1739988 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1739995 - Posted: 6 Nov 2015, 0:14:17 UTC - in response to Message 1739984.  

I'm still getting an unusually high number of invalid's.

Just me or are others getting them as well.

Thanks

Since the original splitter problem seems to have been fixed, everything you get from now on for those WUs will be resends, tasks _2 thru _9. Whereas a "good" WU only requires 2 hosts to put it to bed, these suckers need 5 times that many, all of it just wasted host processing. In the absence of any action by the admins to block the resends and stop wasting resources, a lot of those WUs will be circling the drain for many weeks to come.

After getting stuck with about 30 Invalids on my xw9400 in the initial wave, I've since managed to abort about 150 of those garbage tasks before they could run, freeing up a lot of processing time for actual productive work!
ID: 1739995 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1739996 - Posted: 6 Nov 2015, 0:14:37 UTC - in response to Message 1739988.  

I also got a lot more invalids today.

Most of them have no autocorr still.

Resends (or the dregs of your cache, depending on how fast you process work), will take months to clear them all out.

And probably 90% of all your current Inconclusives will end up being Invalid as well.
Grant
Darwin NT
ID: 1739996 · Report as offensive
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 27 · Next

Message boards : Number crunching : Panic Mode On (101) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.