Panic Mode On (101) Server Problems?

Author	Message
ChrisD Volunteer tester Send message Joined: 25 Sep 99 Posts: 158 Credit: 2,496,342 RAC: 0	Message 1739655 - Posted: 4 Nov 2015, 16:50:33 UTC - in response to Message 1739649. LOL, how many Commodore Cassettes does it take to hold 50GB? Anyone still have a Comodore to verify? :) As far as I remember data was written at 1200 baud and each block was written twice for error correction. (Maybe I am wrong, it might have been 600 Baud only. Anyone still have that manual?) At 50 bytes/sec a 60 min Cassette will hold 175 KiloBytes. Where can we store 285,715 cassettes :) :) ChrisD ID: 1739655 ·

WezH Volunteer tester Send message Joined: 19 Aug 99 Posts: 576 Credit: 67,033,957 RAC: 95	Message 1739656 - Posted: 4 Nov 2015, 16:52:39 UTC - in response to Message 1739649. ...that was the event I was thinking about - totally frustrating, but at leas this time the splitters are running out of tapes very rapidly.... And they are out now. And funny, We do still speak about tapes, I doubt that any of these data has been in tape :D LOL, how many Commodore Cassettes does it take to hold 50GB? Hmm... 90-minutes tape (45 minutes on each side) will hold on the order of 150 kilobytes on each side if no compression or fast loader is used More that I can carry :D ID: 1739656 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80	Message 1739664 - Posted: 4 Nov 2015, 17:05:31 UTC - in response to Message 1739610. Last modified: 4 Nov 2015, 17:11:40 UTC Just noticed one of my latest downloads has an estimated run time of 10,776 hours. It had been running for 12 minutes with only 0.001% completed. I just suspended it, and then resumed it after a couple of minutes and it restarted from scratch, and it's got a 5 minute estimated run time, and gets as far as 0.001% again & time to completion freezes as elapsed time continues to tick by. I've exited BOINC & restarted, and again the WU starts from scratch, and progress freezes at 0.001% I aborted that WU, then picked up 2 more with 4min 09sec estimated run times that go high priority due to the short deadline date (11/11/2015). Both of those WUs get to 0.001% & then stop progressing, even though the Elapsed time clock is running. Aborting them as well. EDIT- those 2 problem WUS are, 14jl11ac.12197.15609.3.12.158_0 14jl11ac.12197.15609.3.12.156_1 I'm running one like that at the moment - 16ap11aa.20222.23380.9.12.20_1 Other notable features - very high CPU usage, explained by stderr.txt entry Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU... Deadline is 11 Nov 2015, 2:22:06 UTC (7 days from issue) - that takes us back to very old processing parameters. I noticed that also. I`m still at work so can`t look any closer atm. We had such an issue at beta not that Long ago. All invalids have no autocorr section. With each crime and every kindness we birth our future. ID: 1739664 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80	Message 1739669 - Posted: 4 Nov 2015, 17:51:40 UTC This Task has autocorr but also ran 100% on CPU- http://setiathome.berkeley.edu/result.php?resultid=4496130911 With each crime and every kindness we birth our future. ID: 1739669 ·

Gene Send message Joined: 26 Apr 99 Posts: 150 Credit: 48,393,279 RAC: 118	Message 1739670 - Posted: 4 Nov 2015, 17:57:14 UTC Last modified: 4 Nov 2015, 17:59:40 UTC 21no11aa batch of work throwing "triplets >30" kind of error. So far this morning, I've had 12 tasks end with computation error. They are all from the 21no11aa.994.18891.5.12.xx batch. Doesn't seem to matter whether they're for CPU or GPU. I picked one GPU failed work unit and reran it (in a "benchmark" sandbox) as a CPU task; it ended with the same stderr failure. There are 8 more of these in the work buffer. They only take a few seconds to exit so I'll just let them pass through in turn. Some wingmen I can find are also showing the same triplet count error, but the tasks are being reissued to reach a quorum. /EDIT: The autocorr count is missing in the stderr result. ID: 1739670 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1739673 - Posted: 4 Nov 2015, 18:17:44 UTC Just so you know we're working on the splitter problem - a new bit of splitter code was put into play yesterday. It was working well enough in beta, but apparently it still wasn't ready for prime time. We have some debugging and cleaning up to do but we'll be back soon enough with more workunits.... - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1739673 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1739678 - Posted: 4 Nov 2015, 18:32:11 UTC - in response to Message 1739673. It was working well enough in beta, but apparently it still wasn't ready for prime time That is why it is called the weekly outrage. ID: 1739678 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1739679 - Posted: 4 Nov 2015, 18:33:18 UTC - in response to Message 1739678. Last modified: 4 Nov 2015, 18:33:25 UTC Thanks Matt... ID: 1739679 ·

Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3	Message 1739696 - Posted: 4 Nov 2015, 20:22:22 UTC Hello Huston are you there !! Huston ozzie 1 one here are you reading us !!! Huston hello !! Where having difficulty reading you Huston are you there !!!! AS the crew start to panic " What's happened down there we haven't herd from them in hours " says one crew member , another say's "oh no WW3 has started that's why we can't hear them Huston has been hit with a nuke aaaaaaaaaaaaaaahhhhhhhhhhhhh , where doomd" Hope you can get things sorted soon I'm out of GPU work and i'll be out off cpu work in a few more hours anyway fingers crossed you can fix the problem soon ID: 1739696 ·

Wild6-NJ Volunteer tester Send message Joined: 4 Aug 99 Posts: 43 Credit: 100,336,791 RAC: 140	Message 1739700 - Posted: 4 Nov 2015, 20:44:50 UTC - in response to Message 1739696. Last modified: 4 Nov 2015, 20:46:44 UTC Hello Huston are you there !! Huston ozzie 1 one here are you reading us !!! Huston hello !! Where having difficulty reading you Huston are you there !!!! AS the crew start to panic " What's happened down there we haven't herd from them in hours " says one crew member , another say's "oh no WW3 has started that's why we can't hear them Huston has been hit with a nuke aaaaaaaaaaaaaaahhhhhhhhhhhhh , where doomd" Hope you can get things sorted soon I'm out of GPU work and i'll be out off cpu work in a few more hours anyway fingers crossed you can fix the problem soon (Apologies to the vegans out there) ID: 1739700 ·

Dr Grey Send message Joined: 27 May 99 Posts: 154 Credit: 104,147,344 RAC: 21	Message 1739710 - Posted: 4 Nov 2015, 21:23:16 UTC Still, can't say we didn't try ID: 1739710 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1739735 - Posted: 4 Nov 2015, 22:26:20 UTC - in response to Message 1739710. Still, can't say we didn't try Yeah, that's the same thing that happened back in January. Even though the splitter gets fixed, if they don't do anything to block resends for those WUs, new tasks just keep getting created and sent back out again until the WU maxes out with 10 Invalids, doing nothing but wasting host resources along the way. Very irritating! Back in January and February, I managed to abort most of the ones I received that I could identify. I'll probably start doing it again shortly with these. The thing is, that earlier batch all came from one original file, as I recall, whereas this time there seem to be multiple source files. ID: 1739735 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1739737 - Posted: 4 Nov 2015, 22:38:35 UTC - in response to Message 1739735. Woke up this morning to find 55 Invalids, notice also that the Server Status shows 5 splitters running, but the Splitter Status shows only 3, all working on the 1 last file. Work in progress has dropped by around 1 million. It's going to take a very long time to recover from this outage once the splitters are sorted out. Grant Darwin NT ID: 1739737 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1739738 - Posted: 4 Nov 2015, 22:40:29 UTC I got my first batch from the new tape about half an hour ago. All shorties, and with the new reduced file size (which I suspect is deliberate - it doesn't seem to be a problem by itself). But at least the v7 processing seems to be working properly for this batch. I've also seen a couple of changes made to the splitters, to make it less likely they'll lose their configuration data, and to shut them down automatically if it all goes wrong. Time will tell. ID: 1739738 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1739750 - Posted: 4 Nov 2015, 23:35:48 UTC - in response to Message 1739738. Last modified: 4 Nov 2015, 23:36:27 UTC More odd WUs. 26oc11ac.1324.4157.12.12.89_1 26oc11ac.1324.4157.12.12.107_1 26oc11ac.1324.4157.12.12.65_0 26oc11ac.1324.4157.12.12.59_0 26oc11ac.1324.4157.12.12.71_1 26oc11ac.1324.4157.12.12.77_1 26oc11ac.1324.4157.12.12.83_1 26oc11ac.1324.4157.12.12.234_1 26oc11ac.1324.4157.12.12.113_1 26oc11ac.1324.4157.12.12.240_1 26oc11ac.1324.4157.12.12.47_0 26oc11ac.1324.4157.12.12.41_1 All shorties. They start running, % Progress counts off, till they get to about 5%, then it resets to zero. Elapsed time continues to run, Progress just sits on 0% Aborted all. Grant Darwin NT ID: 1739750 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1739757 - Posted: 4 Nov 2015, 23:56:09 UTC - in response to Message 1739750. Last modified: 5 Nov 2015, 0:02:06 UTC At least you let them run long enough to display Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU... in stderr.txt I gave Jason one of those (run to completion, so we could be sure it wasn't the "too many triplets" half of that information message), but he hasn't commented on the alternative threshold levels yet. Reporting pseudo-progress until the first checkpoint is standard for your v7.6.6 client. It's supposed to reassure you that something is happening. Edit: 26oc11ac? that tape was split some 14 hours ago, while you were asleep. Since then, we've had several hours without work, and now new tapes with new splitters. I'll reserve judgement until the morning. ID: 1739757 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1739758 - Posted: 5 Nov 2015, 0:10:10 UTC - in response to Message 1739757. Reporting pseudo-progress until the first checkpoint is standard for your v7.6.6 client. It's supposed to reassure you that something is happening. It certainly does. Watching time ticking away with no progress being made is... unsettling. Especially since the day before I had a WU that ran for 30min with progress stuck at 0.001% and the estimated run time had climbed to 10,776 hours. WUs in question, 14jl11ac.12197.15609.3.12.158_0 14jl11ac.12197.15609.3.12.156_1 The next WUs I got ran OK, but were very, very, very short. 08ap11ae.30787.24607.9.12.242_1 08ap11ae.30787.24607.9.12.88_1 08ap11ae.30787.24607.9.12.248_0 2 to GPU, 1 to CPU. GPU estimated run times were under 3 min, took 1:43 (usual shorty estimate 12min) CPU estimated run time about 35min, 10% done in 3 min (usual shorty estimate 1hr 40m). Completed OK. Grant Darwin NT ID: 1739758 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1739772 - Posted: 5 Nov 2015, 0:59:26 UTC - in response to Message 1739758. Last modified: 5 Nov 2015, 1:00:59 UTC Another couple of odd WUs. 16se11ab.25031.20517.5.12.238_3 23oc11ah.12765.24804.6.12.19_3 GPU WU, don't know what the estimated run times were, but they completed in just over 3min 30s. Usual time to completion for GPU shorties is 13-16min. The result of no autocorrelation? Even so, before it was introduced shortie WUs (running 2 at a time) took way longer than 3min 30 to process. Should be out of GPU work on this system in the next 30min or so. Grant Darwin NT ID: 1739772 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1739776 - Posted: 5 Nov 2015, 1:27:07 UTC - in response to Message 1739772. Last modified: 5 Nov 2015, 1:31:18 UTC Just noticed some more anaomalies, these ones VLARs. Usual runtime on this system is 4-4.5hrs. Estimated run times for these VLARs- 1hr 50min- 2hr 2min. Would normally take most of the day to get to them. Will suspend other work & see how they go. 16oc11aa.20967.14169.7.12.90.vlar_0 29ap11ad2518.14791.8.12.129.vlar_2 21no11aa.31868.24617.13.12.60.vlar_2 EDIT- All of the 16oc11aa WUs ran for 4 secs & then finished. Same with 260c11ac & 26ap11ab WUs. 21no11aa, one completed after 2min 20s, others still running. Other WUs still running. And another of the running, but no Progress WUs. 13min and counting, 0.000% done, estimated time remaining- 23,664hrs & climbing. Aborted. 26ap11ab.9472.85225.14.12.42_2 Grant Darwin NT ID: 1739776 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1739815 - Posted: 5 Nov 2015, 5:12:33 UTC - in response to Message 1739776. Got to love the perversity of chance. I've got 2 systems, a Core 2 Duo & and i7. Naturally the i7 can do a lot more work than the C2D. With the present lack of work, the C2D gets work every 45min or so. The i7, every 2 (or more) hours. Grant Darwin NT ID: 1739815 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.