Panic Mode On (109) Server Problems?

Author	Message
Stargate (SA) Volunteer tester Send message Joined: 4 Mar 10 Posts: 1854 Credit: 2,258,721 RAC: 0	Message 1908982 - Posted: 26 Dec 2017, 2:07:36 UTC - in response to Message 1908981. Last modified: 26 Dec 2017, 2:12:15 UTC Yup..same here getting dribs and drabs..I don't get many as only have one computer :/ ID: 1908982 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1908983 - Posted: 26 Dec 2017, 2:30:12 UTC - in response to Message 1908977. I'm thinking of doing the same and refill the cpu cache for later move to the gpus. But I am also very concerned about upsetting the APR and getting into the run time exceeded pitfall. I have never had one of those errors but I don't relish spending the time and energy on a task only to have it thrown out because some artificial time limit was exceeded. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1908983 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1908984 - Posted: 26 Dec 2017, 2:39:54 UTC - in response to Message 1908983. Yeah, I agree. The same thing happens when a task gets that blasted "finish file present too long" error. All that wasted processing because BOINC has that inexplicable time limit built in. Anyway, it looks like non-VLARs are flowing again, so I think I'll go grab some for my other two Linux boxes. (If nothing else, my earlier experiment cleared 600 Arecibo VLARs out of the RTS buffer. You're welcome! :^P) ID: 1908984 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1908986 - Posted: 26 Dec 2017, 2:55:58 UTC - in response to Message 1908983. Last modified: 26 Dec 2017, 3:01:39 UTC But I am also very concerned about upsetting the APR and getting into the run time exceeded pitfall. APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately. This is not a problem with Seti this is a flaw in Boinc . ID: 1908986 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13745 Credit: 208,696,464 RAC: 304	Message 1908987 - Posted: 26 Dec 2017, 3:03:29 UTC - in response to Message 1908986. APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately. Mine are, but I don't reschedule so they don't affect each other that way. Grant Darwin NT ID: 1908987 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1908988 - Posted: 26 Dec 2017, 3:12:55 UTC - in response to Message 1908987. Last modified: 26 Dec 2017, 3:13:39 UTC APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately. Mine are, but I don't reschedule so they don't affect each other that way. But the case we are talking about is not rescheduling per se, but in bunkering for the outage. If you move tasks for whatever reason, you open yourself up to the issue of exceeding the task time limit if run on a different device other than the one it was sent to originally. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1908988 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1908989 - Posted: 26 Dec 2017, 3:13:46 UTC - in response to Message 1908986. APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately. As far as I know they are, but the calculation is based on the application that they're originally assigned to, not the app that they may actually end up running on. That's why rescheduling tends to get them out of whack. A task that was assigned to a CPU app with an estimated run time of, say, 2 hours, may run on a GPU in 5 minutes. That causes the scheduler to increase the APR and decrease the estimated run time for the CPU app and future tasks that are assigned to it. So, tasks that are assigned to the CPU, and then actually do run on the CPU, end up taking much longer than the estimates. When the disparity gets too great, "run time exceeded" errors result. ID: 1908989 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1908990 - Posted: 26 Dec 2017, 3:15:26 UTC Yes, it appears the Arecibo VLAR storm is tapering off and I am seeing BLC tasks available again. Only the linux machine is still having troubles getting the gpu cache refilled so I can bunker them to the cpu for tomorrow. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1908990 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1908991 - Posted: 26 Dec 2017, 3:30:32 UTC - in response to Message 1908987. Grant, Mine are, but I don't reschedule so they don't affect each other that way. how do you do that? ID: 1908991 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13745 Credit: 208,696,464 RAC: 304	Message 1908994 - Posted: 26 Dec 2017, 3:37:25 UTC - in response to Message 1908991. Grant, Mine are, but I don't reschedule so they don't affect each other that way. how do you do that? I don't reschedule. As Jeff pointed out, the calculation times are based on what hardware they allocated to initially by the Scheduler. When you move them to a different application to process them, the estimated times aren't re-calculated & things end up getting messy. Grant Darwin NT ID: 1908994 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1908995 - Posted: 26 Dec 2017, 4:03:32 UTC - in response to Message 1908994. Grant, I don't crunch enough Seti for that to be a problem here but over at Einstein it can be a major PITA for me. The GPU tasks running 2 @ a time take 30 min and the APR can be over 2hrs. The CPU tasks are averaged down to 1/3 of less of their actual time. That gets me into a situation where I get deadline problems on the CPU causing Boinc to limit GPU usage in order to try to allocate as much CPU as possible in order to meet the deadlines for the CPU tasks. I run a very small cache, 1.4 days. ID: 1908995 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13745 Credit: 208,696,464 RAC: 304	Message 1908997 - Posted: 26 Dec 2017, 4:42:15 UTC - in response to Message 1908995. Last modified: 26 Dec 2017, 4:43:02 UTC The GPU tasks running 2 @ a time take 30 min and the APR can be over 2hrs. ? APR (Average Processing Rate) is measured in GFLOPS, and yes running more than 1 WU at a time results in it indicating you're doing less work than you actually are. Actual WU processing times both CPU & GPU, at least for me here on Seti, are generally pretty close to the estimated time. Sounds like Einstein have an issue with how they determine estimated processing times. The GPU tasks running 2 @ a time take 30 min and the APR can be over 2hrs. The CPU tasks are averaged down to 1/3 of less of their actual time. That gets me into a situation where I get deadline problems on the CPU causing Boinc to limit GPU usage in order to try to allocate as much CPU as possible in order to meet the deadlines for the CPU tasks. Might be worth checking out the Einstein forums for some optimisation tips- here at Seti, for the SoG application 1 CPU core is required for each GPU WU being processed. For the older Nvidia applications it's not necessary, although they are much slower. Looking at your systems, if their GPU application also requires 1 CPU core to support each WU being crunched, then I can see you running into insufficient CPU resources on 2 of your systems. Here on Seti there generally isn't much point in using the Intel integrated GPUs to crunch, as the heat they produce, and cache contention with the CPU, actually results in less work being done than just running the CPU cores alone. I don't know how well the Einstein application performs on iGPUs- try the systems without them processing and see if things get worse, improve, or makes no difference. If they improve or make no difference, then with those iGPUs disabled you won't run in to low available CPU resources. Grant Darwin NT ID: 1908997 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1909002 - Posted: 26 Dec 2017, 5:12:44 UTC - in response to Message 1908997. Last modified: 26 Dec 2017, 5:15:39 UTC I don't use IGPUs I found them to be evil. Sounds like Einstein have an issue with how they determine estimated processing times. I don't know either but their times are very consistent, the variance is quite small. ID: 1909002 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1909009 - Posted: 26 Dec 2017, 7:25:59 UTC - in response to Message 1908976. Absolutely no problems whatsoever. Getting new task at every request. Day after day, week after week, when the project is online. Maybe you multi GPU/computer owners with hundreds of thousands in RAC just simply are getting too greedy :-) Maybe it's not the end of the world, if your RAC drops a bit eh? It's like in the real world, those with tons of money complains the most about taxes for example, or if something becomes more expensive. This project doesn't promise that there will be tasks available 24/7/365 . . No, no promises, but if they want volunteer processing power to process all the data coming in then they need to be able to disseminate that data to the hosts out there trying to process it for them. If there is no work coming out for a machine to process why have that machine powered up and online? If you want people to crunch for you, you have to get the work to them to crunch. Since they have repeatedly stated there is more data than the current horde of volunteers can process them for them and they need more, how is that going to work if they cannot keep the work up to the ones they already have? It isn't a matter of them owing us the work, simply practicality, we cannot do the work if they cannot get it to us. Stephen <shrug> ID: 1909009 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1909011 - Posted: 26 Dec 2017, 7:31:35 UTC - in response to Message 1908977. Of course, one of the potential drawbacks to downloading Arecibo VLARs, or any other type of task to the CPU queue and then running them on the GPUs, is that the APR for those tasks will eventually climb to the point that the "Elapsed time exceeded" error starts to show up, so that has to be monitored. A real PITA but, if the choice is either to run Arecibo VLARs on my GPUs, or run out of work on the GPUs altogether, I'll take the first option. :^) . . Been there, and it sneaks up on you quicker than you might think. After just two weeks of doing that the APRs for the CPU looked OK at over 7 mins per tasks, and that is almost the length of time the Arecibo VHAR tasks were taking on the GPU, but when moved to the GPU Q the estimates dropped to mere seconds (which I failed to notice) and anything that took about 7 mins or longer timed out. Killed that option really quickly. Stephen :( ID: 1909011 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1909012 - Posted: 26 Dec 2017, 7:33:32 UTC - in response to Message 1908984. Yeah, I agree. The same thing happens when a task gets that blasted "finish file present too long" error. All that wasted processing because BOINC has that inexplicable time limit built in. Anyway, it looks like non-VLARs are flowing again, so I think I'll go grab some for my other two Linux boxes. (If nothing else, my earlier experiment cleared 600 Arecibo VLARs out of the RTS buffer. You're welcome! :^P) . . Thanks for falling on your sword :) Stephen :) ID: 1909012 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1909014 - Posted: 26 Dec 2017, 7:37:37 UTC - in response to Message 1908987. APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately. Mine are, but I don't reschedule so they don't affect each other that way. . . Yes, the problem is that CreditScrew or whatever, does not read the result file info fully, so it does not take note of which device actually did the processing, only to which device it was despatched. And therein lies the cross stream corruption of the APRs when re-scheduling from CPU to GPU, it happens the other way as well but that does not create time outs. Stephen :( ID: 1909014 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22216 Credit: 416,307,556 RAC: 380	Message 1909022 - Posted: 26 Dec 2017, 9:04:37 UTC Both the APR and CreditScrew calculation processes are server side. CreditScrew uses the APR to guess at the cerdit to be awarded in a rather convoluted manner. APR is a measure of expected run time against acieved run time, in FOPs, and uses the assigned processor in its calculation. The value of APR is passed periodically back to your cruncher where it is used to calculate the expected run time in seconds for each task. Timeouts occur when you exceed that time by a factor of ten. Now since everything is based on the expected run time rescheduling, or introducing a new processor to the cruncher, will have an impact on the APR and thus on credit awarded per task. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1909022 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1909038 - Posted: 26 Dec 2017, 13:32:39 UTC . . Well here it is 00:30 AEDT (13:30 UTC) and still no outage. I guess it will be a late start and therefore a very late finish today. . . These rolling outage start times of late make it a little unnerving ... :( Stephen ?? ID: 1909038 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1909039 - Posted: 26 Dec 2017, 13:58:34 UTC Last modified: 26 Dec 2017, 13:59:47 UTC It all depends if someone left a full pot of coffee from the night before, or if they need to make a fresh pot :D I'm as full as I can get, so bring it on :) ID: 1909039 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.