Panic Mode On (104) Server Problems?

Author	Message
Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1848715 - Posted: 15 Feb 2017, 3:42:21 UTC - in response to Message 1848673. Can't someone write some code to prohibit a splitter from working on a file with another splitter on it? ... Is that really hard? Or am I missing something? I've suggested this as well, but I'm also not sure it's quite as simple as the splitters crashing when there are more than one on the same file. We may be making an invalid assumption. For example, right now 22au08aa has three splitters running and 23no08ab has two, with two other files having 1 each. Everything is running fine, however, and we're getting about 35/sec output, which is normal max for pfb. At least, it always seems to be 5/sec per pfb whenever I look. So is it just coincidence that the crashes happen when a given tape has more than one splitter on? Dunno ... ID: 1848715 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1848718 - Posted: 15 Feb 2017, 3:46:32 UTC - in response to Message 1848696. Why don't we start a fund for a robotic disk library? I used to do Tech Support on commercial tape backup robotic libraries and they aren't all that big. Should be able to find space for a robot at the CoLo or SSL or wherever the splitters are physically located. That would solve the issue of having to have a corporeal entity physically change out the disks as is done currently. I believe it is just a matter of running a command to load the data from the storage array into the splitting queue. When they receive a shipment of hard drives from one of the antennas they do have to copy the data to the storage array. I don't recall the exact process for that, but I believe it is still transferred from the SSL to the CoLo over the 100Mb shared connection that the SSL has to the rest of the campus. IIRC 2TB drives were what they asked for in the last hardware donation drive. So each drive would hold about 40 data sets (aka tapes) and a full case of 20 drives would be about 800 data sets. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1848718 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1848743 - Posted: 15 Feb 2017, 5:04:46 UTC - in response to Message 1848718. I believe it is just a matter of running a command to load the data from the storage array into the splitting queue. When they receive a shipment of hard drives from one of the antennas they do have to copy the data to the storage array. I don't recall the exact process for that, but I believe it is still transferred from the SSL to the CoLo over the 100Mb shared connection that the SSL has to the rest of the campus. IIRC 2TB drives were what they asked for in the last hardware donation drive. So each drive would hold about 40 data sets (aka tapes) and a full case of 20 drives would be about 800 data sets. I was just responding to kittyman's statement that it was either Eric or Jeff responsible for ensuring new "tapes" or drives were loaded to keep the splitters busy. Sounds like they can load new work remotely by simple commands. So no robot needed I guess. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1848743 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1848746 - Posted: 15 Feb 2017, 5:43:22 UTC - in response to Message 1848715. Last modified: 15 Feb 2017, 5:43:50 UTC we're getting about 35/sec output, which is normal max for pfb. At least, it always seems to be 5/sec per pfb whenever I look. So is it just coincidence that the crashes happen when a given tape has more than one splitter on? It's been an issue from around the time the PFB splitters first came on line. It used to be 1 splitter per file, and they were able to turn out WUs at sustained rates in excess of 55/s. When you've got multiple processes all trying to work on the one file, you're going to get Input/Output contention, and things will bog down. The more splitters on the one file, the greater the slowdown. As to why splitters get stuck on certain files, could easily be due to file corruption or other issues. If the past is any example then if the PFB splitters would only be one per file, we'd have no issues getting work even with the lack of GBT data. On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely. Grant Darwin NT ID: 1848746 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1848753 - Posted: 15 Feb 2017, 5:57:38 UTC - in response to Message 1848746. Last modified: 15 Feb 2017, 5:59:18 UTC It's been an issue from around the time the PFB splitters first came on line. It used to be 1 splitter per file, and they were able to turn out WUs at sustained rates in excess of 55/s. When you've got multiple processes all trying to work on the one file, you're going to get Input/Output contention, and things will bog down. The more splitters on the one file, the greater the slowdown. Makes sense. As I've mentioned in the past, at least on the surface this would seem to be an easy fix ... As to why splitters get stuck on certain files, could easily be due to file corruption or other issues. Again, makes perfect sense, though one would think that the software might have some form of watchdog timer going to bail out on hangs. Again, not that tough a code job. If the past is any example then if the PFB splitters would only be one per file, we'd have no issues getting work even with the lack of GBT data. Agreed. I'm basically back to full caches this evening after the outage, which is pretty great considering 4 of 5 boxes here were nearly empty, a couple for 6 hours or so. On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely. Still leaves me wondering why you consistently get bad luck refilling, and I don't. ID: 1848753 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1848755 - Posted: 15 Feb 2017, 6:05:03 UTC - in response to Message 1848746. we're getting about 35/sec output, which is normal max for pfb. At least, it always seems to be 5/sec per pfb whenever I look. So is it just coincidence that the crashes happen when a given tape has more than one splitter on? It's been an issue from around the time the PFB splitters first came on line. It used to be 1 splitter per file, and they were able to turn out WUs at sustained rates in excess of 55/s. When you've got multiple processes all trying to work on the one file, you're going to get Input/Output contention, and things will bog down. The more splitters on the one file, the greater the slowdown. As to why splitters get stuck on certain files, could easily be due to file corruption or other issues. If the past is any example then if the PFB splitters would only be one per file, we'd have no issues getting work even with the lack of GBT data. On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely. . . Same here, getting new work is like winning the lottery. If you are lucky over time you get enough to keep ahead of the clearance rate but I ran out yesterday on two machines and moved them both to Einstein. Mi_Burrito is back doing SETI and has enough work to get through the next few hours and this machine (the i5) is struggling, last time I got about 30 or 40 WUs for the GPU but nothing for the CPU, so it was a good thing I could retask some to balance the load. But La_Bamba is too low and is staying on Einstein until the work situation improves. That seems better than turning it off ... Stephen :( ID: 1848755 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1848756 - Posted: 15 Feb 2017, 6:08:28 UTC - in response to Message 1848753. On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely. Still leaves me wondering why you consistently get bad luck refilling, and I don't.[/quote] . . It isn't just Grant, Sometimes I get work consistently and other times there is nothing coming in for hours. It really is a lottery for some of us. Stephen :( ID: 1848756 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1848774 - Posted: 15 Feb 2017, 8:25:46 UTC Last modified: 15 Feb 2017, 8:28:38 UTC Possibly apropos to the mentions of stuck tapes, I stumbled on this gem while researching for my own refactoring endeavours. This example points out one possible infinite loop, that can be triggered when there is some minor glitch, in some of Seti's file handling code. Forwarding details to Eric. From The Ultimate Question of Programming, Refactoring, and Everything 20. The End-of-file (EOF) check may not be enough The fragment is taken from SETI@home project. The error is detected by the following PVSStudio diagnostic: V663 Infinite loop is possible. The 'cin.eof()' condition is insufficient to break from the loop. Consider adding the 'cin.fail()' function call to the conditional expression. ... Explanation The operation of reading data from a stream object is not as trivial as it may seem at first. When reading data from streams, programmers usually call the eof() method to check if the end of stream has been reached. This check, however, is not quite adequate as it is not sufficient and doesn't allow you to find out if any data reading errors or stream integrity failures have occurred, which may cause certain issues. Note. The information provided in this article concerns both input and output streams. To avoid repetition, we'll only discuss one type of stream here. This is exactly the mistake the programmer made in the code sample above: in the case of there being any data reading error, an infinite loop may occur as the eof() method will always return false. On top of that, incorrect data will be processed in the loop, as unknown values will be getting to the tmp variable. To avoid issues like that, we need to use additional methods to check the stream status: bad(), fail(). ... "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1848774 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1848775 - Posted: 15 Feb 2017, 8:31:47 UTC - in response to Message 1848774. Last modified: 15 Feb 2017, 8:33:35 UTC 20. The End-of-file (EOF) check may not be enough ... On top of that, incorrect data will be processed in the loop, as unknown values will be getting to the tmp variable. To avoid issues like that, we need to use additional methods to check the stream status: bad(), fail(). ... Interesting. Wonder if that could also be an explanation for tapes that when failing end up with end channels full of errors? After all, it doesn't make sense that those errors are true source file errors, when they pop up across the board and are otherwise seldom seen ... what are the odds ID: 1848775 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1848776 - Posted: 15 Feb 2017, 8:34:17 UTC - in response to Message 1848775. Last modified: 15 Feb 2017, 8:41:19 UTC 20. The End-of-file (EOF) check may not be enough ... On top of that, incorrect data will be processed in the loop, as unknown values will be getting to the tmp variable. To avoid issues like that, we need to use additional methods to check the stream status: bad(), fail(). ... Interesting. Wonder if that could also be an explanation for tapes that when failing end up with end channels full of errors? With codebases so complex, it'd certainly be one possibility of many. The code resides in a utility library set of functions probably used by all the servers, it could certainly account for some amount of fragility under stress. Either way, refactoring to harden this won't hurt. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1848776 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1848837 - Posted: 15 Feb 2017, 15:15:21 UTC Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. ID: 1848837 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1848846 - Posted: 15 Feb 2017, 15:55:11 UTC - in response to Message 1848837. Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. I have full caches now, which means many others should as well. So, if the splitters are able to maintain present creation rate of around 38/second, RTS should start to build in the next hour or so. That is MY prediction. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1848846 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1848854 - Posted: 15 Feb 2017, 16:36:16 UTC - in response to Message 1848846. We should hope. ID: 1848854 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1848855 - Posted: 15 Feb 2017, 16:49:06 UTC - in response to Message 1848854. We should hope. Indeed, LOL. Hope springs eternal in the life of a Seti addict. Meow! "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1848855 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1848870 - Posted: 15 Feb 2017, 18:38:18 UTC - in response to Message 1848846. Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. I have full caches now, which means many others should as well. So, if the splitters are able to maintain present creation rate of around 38/second, RTS should start to build in the next hour or so. That is MY prediction. Meow. I've been bouncing off of the limits pretty much right after everything came back online after maintenance. I guess some hosts just have better odds with work requests than others. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1848870 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1848894 - Posted: 15 Feb 2017, 21:45:17 UTC - in response to Message 1848846. Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. I have full caches now, which means many others should as well. So, if the splitters are able to maintain present creation rate of around 38/second, RTS should start to build in the next hour or so. That is MY prediction. Meow. . . Well for the first time in several days my caches filled up this morning (9pm UTC), so I am hoping that is a sign that the beasty has been found and tamed. Time will tell. Stephen :) ID: 1848894 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1848896 - Posted: 15 Feb 2017, 21:49:49 UTC - in response to Message 1848870. Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. I have full caches now, which means many others should as well. So, if the splitters are able to maintain present creation rate of around 38/second, RTS should start to build in the next hour or so. That is MY prediction. Meow. I've been bouncing off of the limits pretty much right after everything came back online after maintenance. I guess some hosts just have better odds with work requests than others. . . MUCH better odds it would seem. I had to shut down one rig overnight due to lack of work. Another had run out when I got up this morning, but now they have filled up all in one go. No dribs and drabs, both caches just filled right up and I have not seen that happen for quite a few days. I am hoping that this is indeed the end of the famine. No more lingering in the lounge at the Einstein Bar and Grill ... Stephen :) ID: 1848896 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1848897 - Posted: 15 Feb 2017, 21:51:34 UTC - in response to Message 1848894. Two of my computers had full caches early this morning but my daily driver still hadn't reloaded. I had to go through the project toggling exercise to get it to get tasks. What I find interesting is that the Windows 10 computer is able to fill out its cache within a few hours of the project coming back online. Not the case with the Windows 7 machines which always take 1-2 days to recover. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1848897 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1848920 - Posted: 15 Feb 2017, 23:59:02 UTC - in response to Message 1848897. Two of my computers had full caches early this morning but my daily driver still hadn't reloaded. I had to go through the project toggling exercise to get it to get tasks. What I find interesting is that the Windows 10 computer is able to fill out its cache within a few hours of the project coming back online. Not the case with the Windows 7 machines which always take 1-2 days to recover. . . There seem to be strange anomalous patterns to this issue but they are inconsistent and the results vary greatly between computers and, it seems, locations. Stephen ?? ID: 1848920 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1848929 - Posted: 16 Feb 2017, 0:53:06 UTC - in response to Message 1848897. Two of my computers had full caches early this morning but my daily driver still hadn't reloaded. I had to go through the project toggling exercise to get it to get tasks. What I find interesting is that the Windows 10 computer is able to fill out its cache within a few hours of the project coming back online. Not the case with the Windows 7 machines which always take 1-2 days to recover. There doesn't seem to be anything specific to those that are having the issue. One of my machines is only allowed network traffic for a 2.5hr time period each day and it isn't having any issues maintaining its cache. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1848929 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.