Message boards :
Number crunching :
Panic Mode On (104) Server Problems?
Message board moderation
Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 42 · Next
Author | Message |
---|---|
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Can't someone write some code to prohibit a splitter from working on a file with another splitter on it? I've suggested this as well, but I'm also not sure it's quite as simple as the splitters crashing when there are more than one on the same file. We may be making an invalid assumption. For example, right now 22au08aa has three splitters running and 23no08ab has two, with two other files having 1 each. Everything is running fine, however, and we're getting about 35/sec output, which is normal max for pfb. At least, it always seems to be 5/sec per pfb whenever I look. So is it just coincidence that the crashes happen when a given tape has more than one splitter on? Dunno ... |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Why don't we start a fund for a robotic disk library? I used to do Tech Support on commercial tape backup robotic libraries and they aren't all that big. Should be able to find space for a robot at the CoLo or SSL or wherever the splitters are physically located. That would solve the issue of having to have a corporeal entity physically change out the disks as is done currently. I believe it is just a matter of running a command to load the data from the storage array into the splitting queue. When they receive a shipment of hard drives from one of the antennas they do have to copy the data to the storage array. I don't recall the exact process for that, but I believe it is still transferred from the SSL to the CoLo over the 100Mb shared connection that the SSL has to the rest of the campus. IIRC 2TB drives were what they asked for in the last hardware donation drive. So each drive would hold about 40 data sets (aka tapes) and a full case of 20 drives would be about 800 data sets. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I was just responding to kittyman's statement that it was either Eric or Jeff responsible for ensuring new "tapes" or drives were loaded to keep the splitters busy. Sounds like they can load new work remotely by simple commands. So no robot needed I guess. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
we're getting about 35/sec output, which is normal max for pfb. At least, it always seems to be 5/sec per pfb whenever I look. It's been an issue from around the time the PFB splitters first came on line. It used to be 1 splitter per file, and they were able to turn out WUs at sustained rates in excess of 55/s. When you've got multiple processes all trying to work on the one file, you're going to get Input/Output contention, and things will bog down. The more splitters on the one file, the greater the slowdown. As to why splitters get stuck on certain files, could easily be due to file corruption or other issues. If the past is any example then if the PFB splitters would only be one per file, we'd have no issues getting work even with the lack of GBT data. On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely. Grant Darwin NT |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
It's been an issue from around the time the PFB splitters first came on line. It used to be 1 splitter per file, and they were able to turn out WUs at sustained rates in excess of 55/s. When you've got multiple processes all trying to work on the one file, you're going to get Input/Output contention, and things will bog down. The more splitters on the one file, the greater the slowdown.Makes sense. As I've mentioned in the past, at least on the surface this would seem to be an easy fix ... As to why splitters get stuck on certain files, could easily be due to file corruption or other issues. Again, makes perfect sense, though one would think that the software might have some form of watchdog timer going to bail out on hangs. Again, not that tough a code job. If the past is any example then if the PFB splitters would only be one per file, we'd have no issues getting work even with the lack of GBT data.Agreed. I'm basically back to full caches this evening after the outage, which is pretty great considering 4 of 5 boxes here were nearly empty, a couple for 6 hours or so. On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely.Still leaves me wondering why you consistently get bad luck refilling, and I don't. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
we're getting about 35/sec output, which is normal max for pfb. At least, it always seems to be 5/sec per pfb whenever I look. . . Same here, getting new work is like winning the lottery. If you are lucky over time you get enough to keep ahead of the clearance rate but I ran out yesterday on two machines and moved them both to Einstein. Mi_Burrito is back doing SETI and has enough work to get through the next few hours and this machine (the i5) is struggling, last time I got about 30 or 40 WUs for the GPU but nothing for the CPU, so it was a good thing I could retask some to balance the load. But La_Bamba is too low and is staying on Einstein until the work situation improves. That seems better than turning it off ... Stephen :( |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Still leaves me wondering why you consistently get bad luck refilling, and I don't.[/quote] . . It isn't just Grant, Sometimes I get work consistently and other times there is nothing coming in for hours. It really is a lottery for some of us. Stephen :( |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Possibly apropos to the mentions of stuck tapes, I stumbled on this gem while researching for my own refactoring endeavours. This example points out one possible infinite loop, that can be triggered when there is some minor glitch, in some of Seti's file handling code. Forwarding details to Eric. From The Ultimate Question of Programming, Refactoring, and Everything 20. The End-of-file (EOF) check may not be "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
20. The End-of-file (EOF) check may not be Interesting. Wonder if that could also be an explanation for tapes that when failing end up with end channels full of errors? After all, it doesn't make sense that those errors are true source file errors, when they pop up across the board and are otherwise seldom seen ... what are the odds |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
20. The End-of-file (EOF) check may not be With codebases so complex, it'd certainly be one possibility of many. The code resides in a utility library set of functions probably used by all the servers, it could certainly account for some amount of fragility under stress. Either way, refactoring to harden this won't hurt. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66 |
Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. I have full caches now, which means many others should as well. So, if the splitters are able to maintain present creation rate of around 38/second, RTS should start to build in the next hour or so. That is MY prediction. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66 |
We should hope. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
We should hope. Indeed, LOL. Hope springs eternal in the life of a Seti addict. Meow! "Freedom is just Chaos, with better lighting." Alan Dean Foster |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. I've been bouncing off of the limits pretty much right after everything came back online after maintenance. I guess some hosts just have better odds with work requests than others. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. . . Well for the first time in several days my caches filled up this morning (9pm UTC), so I am hoping that is a sign that the beasty has been found and tamed. Time will tell. Stephen :) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted. . . MUCH better odds it would seem. I had to shut down one rig overnight due to lack of work. Another had run out when I got up this morning, but now they have filled up all in one go. No dribs and drabs, both caches just filled right up and I have not seen that happen for quite a few days. I am hoping that this is indeed the end of the famine. No more lingering in the lounge at the Einstein Bar and Grill ... Stephen :) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Two of my computers had full caches early this morning but my daily driver still hadn't reloaded. I had to go through the project toggling exercise to get it to get tasks. What I find interesting is that the Windows 10 computer is able to fill out its cache within a few hours of the project coming back online. Not the case with the Windows 7 machines which always take 1-2 days to recover. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Two of my computers had full caches early this morning but my daily driver still hadn't reloaded. I had to go through the project toggling exercise to get it to get tasks. What I find interesting is that the Windows 10 computer is able to fill out its cache within a few hours of the project coming back online. Not the case with the Windows 7 machines which always take 1-2 days to recover. . . There seem to be strange anomalous patterns to this issue but they are inconsistent and the results vary greatly between computers and, it seems, locations. Stephen ?? |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Two of my computers had full caches early this morning but my daily driver still hadn't reloaded. I had to go through the project toggling exercise to get it to get tasks. What I find interesting is that the Windows 10 computer is able to fill out its cache within a few hours of the project coming back online. Not the case with the Windows 7 machines which always take 1-2 days to recover. There doesn't seem to be anything specific to those that are having the issue. One of my machines is only allowed network traffic for a 2.5hr time period each day and it isn't having any issues maintaining its cache. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.