Panic Mode On (104) Server Problems?

Message boards : Number crunching : Panic Mode On (104) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 42 · Next

AuthorMessage
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1848715 - Posted: 15 Feb 2017, 3:42:21 UTC - in response to Message 1848673.  

Can't someone write some code to prohibit a splitter from working on a file with another splitter on it?
...
Is that really hard? Or am I missing something?

I've suggested this as well, but I'm also not sure it's quite as simple as the splitters crashing when there are more than one on the same file. We may be making an invalid assumption.
For example, right now 22au08aa has three splitters running and 23no08ab has two, with two other files having 1 each. Everything is running fine, however, and we're getting about 35/sec output, which is normal max for pfb. At least, it always seems to be 5/sec per pfb whenever I look.
So is it just coincidence that the crashes happen when a given tape has more than one splitter on?
Dunno ...
ID: 1848715 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1848718 - Posted: 15 Feb 2017, 3:46:32 UTC - in response to Message 1848696.  

Why don't we start a fund for a robotic disk library? I used to do Tech Support on commercial tape backup robotic libraries and they aren't all that big. Should be able to find space for a robot at the CoLo or SSL or wherever the splitters are physically located. That would solve the issue of having to have a corporeal entity physically change out the disks as is done currently.

I believe it is just a matter of running a command to load the data from the storage array into the splitting queue.
When they receive a shipment of hard drives from one of the antennas they do have to copy the data to the storage array.
I don't recall the exact process for that, but I believe it is still transferred from the SSL to the CoLo over the 100Mb shared connection that the SSL has to the rest of the campus.
IIRC 2TB drives were what they asked for in the last hardware donation drive. So each drive would hold about 40 data sets (aka tapes) and a full case of 20 drives would be about 800 data sets.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1848718 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1848743 - Posted: 15 Feb 2017, 5:04:46 UTC - in response to Message 1848718.  


I believe it is just a matter of running a command to load the data from the storage array into the splitting queue.
When they receive a shipment of hard drives from one of the antennas they do have to copy the data to the storage array.
I don't recall the exact process for that, but I believe it is still transferred from the SSL to the CoLo over the 100Mb shared connection that the SSL has to the rest of the campus.
IIRC 2TB drives were what they asked for in the last hardware donation drive. So each drive would hold about 40 data sets (aka tapes) and a full case of 20 drives would be about 800 data sets.

I was just responding to kittyman's statement that it was either Eric or Jeff responsible for ensuring new "tapes" or drives were loaded to keep the splitters busy. Sounds like they can load new work remotely by simple commands. So no robot needed I guess.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1848743 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1848746 - Posted: 15 Feb 2017, 5:43:22 UTC - in response to Message 1848715.  
Last modified: 15 Feb 2017, 5:43:50 UTC

we're getting about 35/sec output, which is normal max for pfb. At least, it always seems to be 5/sec per pfb whenever I look.
So is it just coincidence that the crashes happen when a given tape has more than one splitter on?

It's been an issue from around the time the PFB splitters first came on line. It used to be 1 splitter per file, and they were able to turn out WUs at sustained rates in excess of 55/s. When you've got multiple processes all trying to work on the one file, you're going to get Input/Output contention, and things will bog down. The more splitters on the one file, the greater the slowdown.
As to why splitters get stuck on certain files, could easily be due to file corruption or other issues.

If the past is any example then if the PFB splitters would only be one per file, we'd have no issues getting work even with the lack of GBT data.


On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely.
Grant
Darwin NT
ID: 1848746 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1848753 - Posted: 15 Feb 2017, 5:57:38 UTC - in response to Message 1848746.  
Last modified: 15 Feb 2017, 5:59:18 UTC

It's been an issue from around the time the PFB splitters first came on line. It used to be 1 splitter per file, and they were able to turn out WUs at sustained rates in excess of 55/s. When you've got multiple processes all trying to work on the one file, you're going to get Input/Output contention, and things will bog down. The more splitters on the one file, the greater the slowdown.
Makes sense. As I've mentioned in the past, at least on the surface this would seem to be an easy fix ...
As to why splitters get stuck on certain files, could easily be due to file corruption or other issues.

Again, makes perfect sense, though one would think that the software might have some form of watchdog timer going to bail out on hangs. Again, not that tough a code job.
If the past is any example then if the PFB splitters would only be one per file, we'd have no issues getting work even with the lack of GBT data.
Agreed. I'm basically back to full caches this evening after the outage, which is pretty great considering 4 of 5 boxes here were nearly empty, a couple for 6 hours or so.
On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely.
Still leaves me wondering why you consistently get bad luck refilling, and I don't.
ID: 1848753 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1848755 - Posted: 15 Feb 2017, 6:05:03 UTC - in response to Message 1848746.  

we're getting about 35/sec output, which is normal max for pfb. At least, it always seems to be 5/sec per pfb whenever I look.
So is it just coincidence that the crashes happen when a given tape has more than one splitter on?

It's been an issue from around the time the PFB splitters first came on line. It used to be 1 splitter per file, and they were able to turn out WUs at sustained rates in excess of 55/s. When you've got multiple processes all trying to work on the one file, you're going to get Input/Output contention, and things will bog down. The more splitters on the one file, the greater the slowdown.
As to why splitters get stuck on certain files, could easily be due to file corruption or other issues.

If the past is any example then if the PFB splitters would only be one per file, we'd have no issues getting work even with the lack of GBT data.


On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely.


. . Same here, getting new work is like winning the lottery. If you are lucky over time you get enough to keep ahead of the clearance rate but I ran out yesterday on two machines and moved them both to Einstein. Mi_Burrito is back doing SETI and has enough work to get through the next few hours and this machine (the i5) is struggling, last time I got about 30 or 40 WUs for the GPU but nothing for the CPU, so it was a good thing I could retask some to balance the load. But La_Bamba is too low and is staying on Einstein until the work situation improves. That seems better than turning it off ...

Stephen

:(
ID: 1848755 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1848756 - Posted: 15 Feb 2017, 6:08:28 UTC - in response to Message 1848753.  


On my main system I ran out of work during the outage, I've managed to pick some up since then, but will once again be out of work in an hour or 2 (unless some more comes along in the meantime). Doesn't appear too likely.
Still leaves me wondering why you consistently get bad luck refilling, and I don't.[/quote]

. . It isn't just Grant, Sometimes I get work consistently and other times there is nothing coming in for hours. It really is a lottery for some of us.

Stephen

:(
ID: 1848756 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1848774 - Posted: 15 Feb 2017, 8:25:46 UTC
Last modified: 15 Feb 2017, 8:28:38 UTC

Possibly apropos to the mentions of stuck tapes, I stumbled on this gem while researching for my own refactoring endeavours. This example points out one possible infinite loop, that can be triggered when there is some minor glitch, in some of Seti's file handling code. Forwarding details to Eric.

From The Ultimate Question of Programming, Refactoring, and Everything

20. The End-of-file (EOF) check may not be
enough
The fragment is taken from SETI@home project. The error is detected by the following PVSStudio
diagnostic: V663 Infinite loop is possible. The 'cin.eof()' condition is insufficient to
break from the loop. Consider adding the 'cin.fail()' function call to the conditional
expression.
...
Explanation
The operation of reading data from a stream object is not as trivial as it may seem at first.
When reading data from streams, programmers usually call the eof() method to check if the
end of stream has been reached. This check, however, is not quite adequate as it is not
sufficient and doesn't allow you to find out if any data reading errors or stream integrity
failures have occurred, which may cause certain issues.
Note. The information provided in this article concerns both input and output streams. To
avoid repetition, we'll only discuss one type of stream here.
This is exactly the mistake the programmer made in the code sample above: in the case of
there being any data reading error, an infinite loop may occur as the eof() method will always
return false. On top of that, incorrect data will be processed in the loop, as unknown values
will be getting to the tmp variable.
To avoid issues like that, we need to use additional methods to check the stream status:
bad(), fail().
...

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1848774 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1848775 - Posted: 15 Feb 2017, 8:31:47 UTC - in response to Message 1848774.  
Last modified: 15 Feb 2017, 8:33:35 UTC

20. The End-of-file (EOF) check may not be
enough
...
On top of that, incorrect data will be processed in the loop, as unknown values
will be getting to the tmp variable.
To avoid issues like that, we need to use additional methods to check the stream status:
bad(), fail().
...

Interesting. Wonder if that could also be an explanation for tapes that when failing end up with end channels full of errors? After all, it doesn't make sense that those errors are true source file errors, when they pop up across the board and are otherwise seldom seen ... what are the odds
ID: 1848775 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1848776 - Posted: 15 Feb 2017, 8:34:17 UTC - in response to Message 1848775.  
Last modified: 15 Feb 2017, 8:41:19 UTC

20. The End-of-file (EOF) check may not be
enough
...
On top of that, incorrect data will be processed in the loop, as unknown values
will be getting to the tmp variable.
To avoid issues like that, we need to use additional methods to check the stream status:
bad(), fail().
...

Interesting. Wonder if that could also be an explanation for tapes that when failing end up with end channels full of errors?


With codebases so complex, it'd certainly be one possibility of many. The code resides in a utility library set of functions probably used by all the servers, it could certainly account for some amount of fragility under stress. Either way, refactoring to harden this won't hurt.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1848776 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1848837 - Posted: 15 Feb 2017, 15:15:21 UTC

Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted.
ID: 1848837 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1848846 - Posted: 15 Feb 2017, 15:55:11 UTC - in response to Message 1848837.  

Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted.

I have full caches now, which means many others should as well.
So, if the splitters are able to maintain present creation rate of around 38/second, RTS should start to build in the next hour or so.
That is MY prediction.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1848846 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1848854 - Posted: 15 Feb 2017, 16:36:16 UTC - in response to Message 1848846.  

We should hope.
ID: 1848854 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1848855 - Posted: 15 Feb 2017, 16:49:06 UTC - in response to Message 1848854.  

We should hope.

Indeed, LOL.
Hope springs eternal in the life of a Seti addict.

Meow!
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1848855 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1848870 - Posted: 15 Feb 2017, 18:38:18 UTC - in response to Message 1848846.  

Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted.

I have full caches now, which means many others should as well.
So, if the splitters are able to maintain present creation rate of around 38/second, RTS should start to build in the next hour or so.
That is MY prediction.

Meow.

I've been bouncing off of the limits pretty much right after everything came back online after maintenance.
I guess some hosts just have better odds with work requests than others.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1848870 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1848894 - Posted: 15 Feb 2017, 21:45:17 UTC - in response to Message 1848846.  

Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted.

I have full caches now, which means many others should as well.
So, if the splitters are able to maintain present creation rate of around 38/second, RTS should start to build in the next hour or so.
That is MY prediction.

Meow.


. . Well for the first time in several days my caches filled up this morning (9pm UTC), so I am hoping that is a sign that the beasty has been found and tamed. Time will tell.

Stephen

:)
ID: 1848894 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1848896 - Posted: 15 Feb 2017, 21:49:49 UTC - in response to Message 1848870.  

Over 12 hrs since the outrage ended and the RTS remains drained. This is what I predicted.

I have full caches now, which means many others should as well.
So, if the splitters are able to maintain present creation rate of around 38/second, RTS should start to build in the next hour or so.
That is MY prediction.

Meow.

I've been bouncing off of the limits pretty much right after everything came back online after maintenance.
I guess some hosts just have better odds with work requests than others.


. . MUCH better odds it would seem. I had to shut down one rig overnight due to lack of work. Another had run out when I got up this morning, but now they have filled up all in one go. No dribs and drabs, both caches just filled right up and I have not seen that happen for quite a few days. I am hoping that this is indeed the end of the famine. No more lingering in the lounge at the Einstein Bar and Grill ...

Stephen

:)
ID: 1848896 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1848897 - Posted: 15 Feb 2017, 21:51:34 UTC - in response to Message 1848894.  

Two of my computers had full caches early this morning but my daily driver still hadn't reloaded. I had to go through the project toggling exercise to get it to get tasks. What I find interesting is that the Windows 10 computer is able to fill out its cache within a few hours of the project coming back online. Not the case with the Windows 7 machines which always take 1-2 days to recover.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1848897 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1848920 - Posted: 15 Feb 2017, 23:59:02 UTC - in response to Message 1848897.  

Two of my computers had full caches early this morning but my daily driver still hadn't reloaded. I had to go through the project toggling exercise to get it to get tasks. What I find interesting is that the Windows 10 computer is able to fill out its cache within a few hours of the project coming back online. Not the case with the Windows 7 machines which always take 1-2 days to recover.


. . There seem to be strange anomalous patterns to this issue but they are inconsistent and the results vary greatly between computers and, it seems, locations.

Stephen

??
ID: 1848920 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1848929 - Posted: 16 Feb 2017, 0:53:06 UTC - in response to Message 1848897.  

Two of my computers had full caches early this morning but my daily driver still hadn't reloaded. I had to go through the project toggling exercise to get it to get tasks. What I find interesting is that the Windows 10 computer is able to fill out its cache within a few hours of the project coming back online. Not the case with the Windows 7 machines which always take 1-2 days to recover.

There doesn't seem to be anything specific to those that are having the issue. One of my machines is only allowed network traffic for a 2.5hr time period each day and it isn't having any issues maintaining its cache.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1848929 · Report as offensive
Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 42 · Next

Message boards : Number crunching : Panic Mode On (104) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.