Panic Mode On (78) Server Problems?

Author	Message
kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1302178 - Posted: 4 Nov 2012, 17:31:58 UTC - in response to Message 1302169. Anybody got any idea whether we're being haunted by Astropulse ghosts, to the same degree? I think it's getting to the point that I need to put out an urgent APB, reluctant though I am to do so on a Sunday - just to get a lid put on the situation, until they can look at it tomorrow or Tuesday. The question is - is stopping the MB splitters enough, or should I ask them to stop AP as well? With splitters stopped, and no new work entering the system, we could allow work fetch and fish for resends of the work we've already supposedly got. I would just stop the AP splitters. MB was chugging along just fine until the AP work entered the picture again. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1302178 ·

Fred E. Volunteer tester Send message Joined: 22 Jul 99 Posts: 768 Credit: 24,140,697 RAC: 0	Message 1302179 - Posted: 4 Nov 2012, 17:35:13 UTC The question is - is stopping the MB splitters enough, or should I ask them to stop AP as well? With splitters stopped, and no new work entering the system, we could allow work fetch and fish for resends of the work we've already supposedly got. Numbers for AP are not as large, but I have 8 lost AP that were not sent when scheduler gave me new work instead. "Results" out in field are pretty big for MB - 10,685,000 and 137k for AP. Suggest they try to shut off both. Another Fred Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop. ID: 1302179 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 1302180 - Posted: 4 Nov 2012, 17:38:37 UTC - in response to Message 1302037. One rig couldn't connect to the SETI servers and displayed the message: don't need a network connection ?! After reinstalling BOINC several times, my account setting in SETI were altered?! And in Malaria and Rosetta, too. My BOOT-drive (C:) was also shared. In too many cases, it's not a good idea to run multiple projects, on 1 host. Apollogies to my wingmen for a few hundred MB WUs, which could not be uploaded, due to this network issue and were timed-out. ID: 1302180 ·

fscheel Send message Joined: 13 Apr 12 Posts: 73 Credit: 11,135,641 RAC: 0	Message 1302181 - Posted: 4 Nov 2012, 17:41:36 UTC As Marcel said..."Just shoot up here amongst us, one of us has got to have some relief." :) ID: 1302181 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1302186 - Posted: 4 Nov 2012, 17:54:21 UTC - in response to Message 1302178. I would just stop the AP splitters. MB was chugging along just fine until the AP work entered the picture again. I sympathise with that assessment of the possible trigger, but I think we've got beyond that point now. The database is horrendously bloated - well over 10.5 million tasks supposedly 'out in the field', which is 50% more than usual. The biggest problem right now isn't communications - when you get resends, they come down quite smoothly - but the slow "thinking time" response of the scheduler. I don't think simply stopping AP production on its own will free up enough scheduler and database resources to stop the timeouts and the creation of new ghosts. On the other hand, I agree stopping MB production is a drastic step, and it will impact loads of volunteers with hosts like my little one-core server - which has been plinking along exactly as designed, getting new tasks as needed (and rotating through three different projects). No surplus fat to live off there - one task in progress it says, and I can see it running now. But my gut feeling is saying, quite strongly, that recovery from this problem is going to take an outage of some sort - and the sooner we start it, the shorter it will be. ID: 1302186 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1302188 - Posted: 4 Nov 2012, 17:59:17 UTC - in response to Message 1302186. I would just stop the AP splitters. MB was chugging along just fine until the AP work entered the picture again. I sympathise with that assessment of the possible trigger, but I think we've got beyond that point now. The database is horrendously bloated - well over 10.5 million tasks supposedly 'out in the field', which is 50% more than usual. The biggest problem right now isn't communications - when you get resends, they come down quite smoothly - but the slow "thinking time" response of the scheduler. I don't think simply stopping AP production on its own will free up enough scheduler and database resources to stop the timeouts and the creation of new ghosts. On the other hand, I agree stopping MB production is a drastic step, and it will impact loads of volunteers with hosts like my little one-core server - which has been plinking along exactly as designed, getting new tasks as needed (and rotating through three different projects). No surplus fat to live off there - one task in progress it says, and I can see it running now. But my gut feeling is saying, quite strongly, that recovery from this problem is going to take an outage of some sort - and the sooner we start it, the shorter it will be. Well, I would not be the one to second guess your intuition, Richard. You seldom are off base. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1302188 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1302198 - Posted: 4 Nov 2012, 18:13:35 UTC - in response to Message 1302188. There are going to 10s of thousands of timeouts when we run out of shorties & all the VLAR resends start going to NVidia GPUs. Overnight i got a few resends on my systems, but it's still mostly timeouts. There's been the odd HTTP error, couldn't reach Scheduler, and when it was reached Project has no tasks available. But mostly it's still timeouts. Been that way pretty much since a few hours after the last weekly outage, and i notice the database queries are still over 1,000/s where less than 800/s is the usual number. Grant Darwin NT ID: 1302198 ·

Highlander Send message Joined: 5 Oct 99 Posts: 167 Credit: 37,987,668 RAC: 16	Message 1302206 - Posted: 4 Nov 2012, 18:24:54 UTC According to AP Ghosts: on my host http://setiathome.berkeley.edu/show_host_detail.php?hostid=5553346 which only do AP, there are till now no ghost units at all. My other host which is setup as MB only have some, around 100. And only for this host i do manual suspend of network communication (once a day open for 2-3 hours) cause i dont wanna put extra load on the schedular because of NNT. - Performance is not a simple linear function of the number of CPUs you throw at the problem. - ID: 1302206 ·

Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22	Message 1302238 - Posted: 4 Nov 2012, 20:00:18 UTC What I'm seeing now. This is about MB units, I don't do APs. Now that the "ready to send" queue has dried up, I've been able to get schedule requests completed even when allowing new tasks. Not getting any new units but I would expect that with the ready queue running on vapors. However Ghost Detector shows I currently have 196 ghost units that hadn't been downloaded but assigned and I should be getting those, eventually. All ATI, almost all shorties BTW. Problem is that's nearly half of my "in progress" units. And the vast majority of the ones I do have are shorties. I'm guessing I'm down to a couple of days worth of units locally right now. Que sera sera. "Life is just nature's way of keeping meat fresh." - The Doctor ID: 1302238 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1302257 - Posted: 4 Nov 2012, 20:41:00 UTC I've just had a note back from Eric: I've stopped the splitters and doubled the httpd timeout... I think we're going to need to at least temporarily go back to restricting workunits in progress on a per host basis and per RPC basis, regardless of what complaints we get about people being unable to keep their hosts busy. The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control. ID: 1302257 ·

fscheel Send message Joined: 13 Apr 12 Posts: 73 Credit: 11,135,641 RAC: 0	Message 1302262 - Posted: 4 Nov 2012, 20:48:04 UTC - in response to Message 1302257. I've just had a note back from Eric: I've stopped the splitters and doubled the httpd timeout... I think we're going to need to at least temporarily go back to restricting workunits in progress on a per host basis and per RPC basis, regardless of what complaints we get about people being unable to keep their hosts busy. The splitters are already showing red/orange on the server status page, and 'ready to send' is as near zero as makes no difference (there'll always be a few errors and timeouts to resend). So I'm going to turn off NNT and see what happens - let's see if we can help get this beast back under control. I have NNT turned off on three machines with empty caches and lots of ghost tasks. So far all I get is Project has no tasks available. ID: 1302262 ·

Fred E. Volunteer tester Send message Joined: 22 Jul 99 Posts: 768 Credit: 24,140,697 RAC: 0	Message 1302265 - Posted: 4 Nov 2012, 20:52:09 UTC I decided to try to get some of my lost tasks since the splitters are disabled and there is no new work available. However, I didn't get any of my 467 lost tasks. I got the "no tasks available" message on two attempts. Will have to see how this plays out. I'm back to NNT for now. Database queries are still high but I got fast responses on those requests- 7 and 8 seconds for work requests ain't bad. Bet Synergy would always run well without the splitting duties. Need a pool to guess how many ghosts are out there, but guess we won't see the number. Another Fred Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop. ID: 1302265 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 1302266 - Posted: 4 Nov 2012, 20:55:53 UTC Lost tasks come back "automagically"..... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1302266 ·

bill Send message Joined: 16 Jun 99 Posts: 861 Credit: 29,352,955 RAC: 0	Message 1302273 - Posted: 4 Nov 2012, 21:11:25 UTC - in response to Message 1302180. "In too many cases, it's not a good idea to run multiple projects, on 1 host." What data do you have to support that assumption? And your personal experiences are too small a data point to be statistically significant. I offset them with mine of running 8 projects without having any of the problems you have. At the moment I've been out of SETI gpu work for a while, running Einstein instead. I'm down to 56 cpu APs (about 8 days worth) that are alternating with LHC and Rosetta. When the servers first started screwing up this weekend I went NNT on SETI. Call it foresight, experience, luck, whatever, but I have 0 ghosts and continuous work to do on my cruncher. When the lab boys get this little snafu fixed I'll be waiting because I have what seems to be in short supply around here patience. ID: 1302273 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1302275 - Posted: 4 Nov 2012, 21:12:58 UTC - in response to Message 1302266. Lost tasks come back "automagically"..... Not yet, they haven't. I'm with Fred and fscheel so far - really quick turnround on requests, but always "Project has no tasks available". I'll let them keep asking, and see what happens over the next few hours. ID: 1302275 ·

AllanB Send message Joined: 2 Sep 12 Posts: 282 Credit: 425,090 RAC: 0	Message 1302276 - Posted: 4 Nov 2012, 21:13:11 UTC Last modified: 4 Nov 2012, 21:14:53 UTC I had set NNT so to run down my cache to switch off my machine. When I looked after it had finished crunching I had 20 or so ghosts, so I unset NNT for a while to try and see if I could get them. I now appear to have 329 tasks I haven't' got. This is much more than my cache is set for,will I still get them? PS I have received none so far as well. ID: 1302276 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1302279 - Posted: 4 Nov 2012, 21:31:16 UTC - in response to Message 1302276. I had set NNT so to run down my cache to switch off my machine. When I looked after it had finished crunching I had 20 or so ghosts, so I unset NNT for a while to try and see if I could get them. I now appear to have 329 tasks I haven't' got. This is much more than my cache is set for,will I still get them? You should get them as you request work over the next few days, no more than 20 tasks per request. Your computer won't be suddenly overwhelmed with work, and if you don't get through them all in time, don't worry - they'll get sent to somebody else instead. PS I have received none so far as well. One of my requests has got a resend now, but other machines are still dry. Never mind, the journey of a thousand WUs starts with a single crunch... ID: 1302279 ·

Fred E. Volunteer tester Send message Joined: 22 Jul 99 Posts: 768 Credit: 24,140,697 RAC: 0	Message 1302280 - Posted: 4 Nov 2012, 21:34:16 UTC Last modified: 4 Nov 2012, 21:35:26 UTC had set NNT so to run down my cache to switch off my machine. When I looked after it had finished crunching I had 20 or so ghosts, so I unset NNT for a while to try and see if I could get them. I now appear to have 329 tasks I haven't' got. This is much more than my cache is set for,will I still get them? PS I have received none so far as well. When they start flowing again, you should get some until BOINC stops asking for work due to the cache setting. No request = no work. You may have to increase that setting, or just be patient and get them over a period of days. Edit: Richard beat me to it. Another Fred Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop. ID: 1302280 ·

Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22	Message 1302289 - Posted: 4 Nov 2012, 21:59:32 UTC Well there was a dip in the cricket graphs and about then I got 20 members of my Ghost Army downloaded. However now that the cricket graphs are back up, I'm getting scheduler timeouts again as it's trying to get my ghosts and report newly done units. :( "Life is just nature's way of keeping meat fresh." - The Doctor ID: 1302289 ·

AllanB Send message Joined: 2 Sep 12 Posts: 282 Credit: 425,090 RAC: 0	Message 1302308 - Posted: 4 Nov 2012, 22:27:27 UTC Just got two lots of 20!! ID: 1302308 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.