Panic Mode On (109) Server Problems?

Message boards : Number crunching : Panic Mode On (109) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 36 · Next

AuthorMessage
Profile Stargate (SA)
Volunteer tester
Avatar

Send message
Joined: 4 Mar 10
Posts: 1854
Credit: 2,258,721
RAC: 0
Australia
Message 1908982 - Posted: 26 Dec 2017, 2:07:36 UTC - in response to Message 1908981.  
Last modified: 26 Dec 2017, 2:12:15 UTC

Yup..same here getting dribs and drabs..I don't get many as only have one computer :/
ID: 1908982 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1908983 - Posted: 26 Dec 2017, 2:30:12 UTC - in response to Message 1908977.  

I'm thinking of doing the same and refill the cpu cache for later move to the gpus. But I am also very concerned about upsetting the APR and getting into the run time exceeded pitfall. I have never had one of those errors but I don't relish spending the time and energy on a task only to have it thrown out because some artificial time limit was exceeded.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1908983 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1908984 - Posted: 26 Dec 2017, 2:39:54 UTC - in response to Message 1908983.  

Yeah, I agree. The same thing happens when a task gets that blasted "finish file present too long" error. All that wasted processing because BOINC has that inexplicable time limit built in.

Anyway, it looks like non-VLARs are flowing again, so I think I'll go grab some for my other two Linux boxes.

(If nothing else, my earlier experiment cleared 600 Arecibo VLARs out of the RTS buffer. You're welcome! :^P)
ID: 1908984 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1908986 - Posted: 26 Dec 2017, 2:55:58 UTC - in response to Message 1908983.  
Last modified: 26 Dec 2017, 3:01:39 UTC

But I am also very concerned about upsetting the APR and getting into the run time exceeded pitfall.

APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately.
This is not a problem with Seti this is a flaw in Boinc .
ID: 1908986 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13745
Credit: 208,696,464
RAC: 304
Australia
Message 1908987 - Posted: 26 Dec 2017, 3:03:29 UTC - in response to Message 1908986.  

APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately.

Mine are, but I don't reschedule so they don't affect each other that way.
Grant
Darwin NT
ID: 1908987 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1908988 - Posted: 26 Dec 2017, 3:12:55 UTC - in response to Message 1908987.  
Last modified: 26 Dec 2017, 3:13:39 UTC

APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately.

Mine are, but I don't reschedule so they don't affect each other that way.

But the case we are talking about is not rescheduling per se, but in bunkering for the outage. If you move tasks for whatever reason, you open yourself up to the issue of exceeding the task time limit if run on a different device other than the one it was sent to originally.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1908988 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1908989 - Posted: 26 Dec 2017, 3:13:46 UTC - in response to Message 1908986.  

APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately.
As far as I know they are, but the calculation is based on the application that they're originally assigned to, not the app that they may actually end up running on. That's why rescheduling tends to get them out of whack. A task that was assigned to a CPU app with an estimated run time of, say, 2 hours, may run on a GPU in 5 minutes. That causes the scheduler to increase the APR and decrease the estimated run time for the CPU app and future tasks that are assigned to it. So, tasks that are assigned to the CPU, and then actually do run on the CPU, end up taking much longer than the estimates. When the disparity gets too great, "run time exceeded" errors result.
ID: 1908989 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1908990 - Posted: 26 Dec 2017, 3:15:26 UTC

Yes, it appears the Arecibo VLAR storm is tapering off and I am seeing BLC tasks available again. Only the linux machine is still having troubles getting the gpu cache refilled so I can bunker them to the cpu for tomorrow.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1908990 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1908991 - Posted: 26 Dec 2017, 3:30:32 UTC - in response to Message 1908987.  

Grant,
Mine are, but I don't reschedule so they don't affect each other that way.
how do you do that?
ID: 1908991 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13745
Credit: 208,696,464
RAC: 304
Australia
Message 1908994 - Posted: 26 Dec 2017, 3:37:25 UTC - in response to Message 1908991.  

Grant,
Mine are, but I don't reschedule so they don't affect each other that way.
how do you do that?

I don't reschedule.
As Jeff pointed out, the calculation times are based on what hardware they allocated to initially by the Scheduler. When you move them to a different application to process them, the estimated times aren't re-calculated & things end up getting messy.
Grant
Darwin NT
ID: 1908994 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1908995 - Posted: 26 Dec 2017, 4:03:32 UTC - in response to Message 1908994.  

Grant, I don't crunch enough Seti for that to be a problem here but over at Einstein it can be a major PITA for me. The GPU tasks running 2 @ a time take 30 min and the APR can be over 2hrs. The CPU tasks are averaged down to 1/3 of less of their actual time. That gets me into a situation where I get deadline problems on the CPU causing Boinc to limit GPU usage in order to try to allocate as much CPU as possible in order to meet the deadlines for the CPU tasks. I run a very small cache, 1.4 days.
ID: 1908995 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13745
Credit: 208,696,464
RAC: 304
Australia
Message 1908997 - Posted: 26 Dec 2017, 4:42:15 UTC - in response to Message 1908995.  
Last modified: 26 Dec 2017, 4:43:02 UTC

The GPU tasks running 2 @ a time take 30 min and the APR can be over 2hrs.

?
APR (Average Processing Rate) is measured in GFLOPS, and yes running more than 1 WU at a time results in it indicating you're doing less work than you actually are.
Actual WU processing times both CPU & GPU, at least for me here on Seti, are generally pretty close to the estimated time.

Sounds like Einstein have an issue with how they determine estimated processing times.

The GPU tasks running 2 @ a time take 30 min and the APR can be over 2hrs. The CPU tasks are averaged down to 1/3 of less of their actual time. That gets me into a situation where I get deadline problems on the CPU causing Boinc to limit GPU usage in order to try to allocate as much CPU as possible in order to meet the deadlines for the CPU tasks.

Might be worth checking out the Einstein forums for some optimisation tips- here at Seti, for the SoG application 1 CPU core is required for each GPU WU being processed. For the older Nvidia applications it's not necessary, although they are much slower.

Looking at your systems, if their GPU application also requires 1 CPU core to support each WU being crunched, then I can see you running into insufficient CPU resources on 2 of your systems.
Here on Seti there generally isn't much point in using the Intel integrated GPUs to crunch, as the heat they produce, and cache contention with the CPU, actually results in less work being done than just running the CPU cores alone.
I don't know how well the Einstein application performs on iGPUs- try the systems without them processing and see if things get worse, improve, or makes no difference. If they improve or make no difference, then with those iGPUs disabled you won't run in to low available CPU resources.
Grant
Darwin NT
ID: 1908997 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1909002 - Posted: 26 Dec 2017, 5:12:44 UTC - in response to Message 1908997.  
Last modified: 26 Dec 2017, 5:15:39 UTC

I don't use IGPUs I found them to be evil.
Sounds like Einstein have an issue with how they determine estimated processing times.

I don't know either but their times are very consistent, the variance is quite small.
ID: 1909002 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1909009 - Posted: 26 Dec 2017, 7:25:59 UTC - in response to Message 1908976.  

Absolutely no problems whatsoever. Getting new task at every request.
Day after day, week after week, when the project is online.

Maybe you multi GPU/computer owners with hundreds of thousands in RAC just simply are getting too greedy :-)

Maybe it's not the end of the world, if your RAC drops a bit eh?
It's like in the real world, those with tons of money complains the most about taxes for example, or if something becomes more expensive.

This project doesn't promise that there will be tasks available 24/7/365


. . No, no promises, but if they want volunteer processing power to process all the data coming in then they need to be able to disseminate that data to the hosts out there trying to process it for them. If there is no work coming out for a machine to process why have that machine powered up and online? If you want people to crunch for you, you have to get the work to them to crunch. Since they have repeatedly stated there is more data than the current horde of volunteers can process them for them and they need more, how is that going to work if they cannot keep the work up to the ones they already have? It isn't a matter of them owing us the work, simply practicality, we cannot do the work if they cannot get it to us.

Stephen

<shrug>
ID: 1909009 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1909011 - Posted: 26 Dec 2017, 7:31:35 UTC - in response to Message 1908977.  


Of course, one of the potential drawbacks to downloading Arecibo VLARs, or any other type of task to the CPU queue and then running them on the GPUs, is that the APR for those tasks will eventually climb to the point that the "Elapsed time exceeded" error starts to show up, so that has to be monitored. A real PITA but, if the choice is either to run Arecibo VLARs on my GPUs, or run out of work on the GPUs altogether, I'll take the first option. :^)


. . Been there, and it sneaks up on you quicker than you might think. After just two weeks of doing that the APRs for the CPU looked OK at over 7 mins per tasks, and that is almost the length of time the Arecibo VHAR tasks were taking on the GPU, but when moved to the GPU Q the estimates dropped to mere seconds (which I failed to notice) and anything that took about 7 mins or longer timed out. Killed that option really quickly.

Stephen

:(
ID: 1909011 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1909012 - Posted: 26 Dec 2017, 7:33:32 UTC - in response to Message 1908984.  

Yeah, I agree. The same thing happens when a task gets that blasted "finish file present too long" error. All that wasted processing because BOINC has that inexplicable time limit built in.

Anyway, it looks like non-VLARs are flowing again, so I think I'll go grab some for my other two Linux boxes.

(If nothing else, my earlier experiment cleared 600 Arecibo VLARs out of the RTS buffer. You're welcome! :^P)



. . Thanks for falling on your sword :)

Stephen

:)
ID: 1909012 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1909014 - Posted: 26 Dec 2017, 7:37:37 UTC - in response to Message 1908987.  

APR is very broken, averaging the CPU run times with the GPU is insane. They should be calculated separately.

Mine are, but I don't reschedule so they don't affect each other that way.


. . Yes, the problem is that CreditScrew or whatever, does not read the result file info fully, so it does not take note of which device actually did the processing, only to which device it was despatched. And therein lies the cross stream corruption of the APRs when re-scheduling from CPU to GPU, it happens the other way as well but that does not create time outs.

Stephen

:(
ID: 1909014 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22216
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1909022 - Posted: 26 Dec 2017, 9:04:37 UTC

Both the APR and CreditScrew calculation processes are server side. CreditScrew uses the APR to guess at the cerdit to be awarded in a rather convoluted manner. APR is a measure of expected run time against acieved run time, in FOPs, and uses the assigned processor in its calculation. The value of APR is passed periodically back to your cruncher where it is used to calculate the expected run time in seconds for each task. Timeouts occur when you exceed that time by a factor of ten.
Now since everything is based on the expected run time rescheduling, or introducing a new processor to the cruncher, will have an impact on the APR and thus on credit awarded per task.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1909022 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1909038 - Posted: 26 Dec 2017, 13:32:39 UTC

. . Well here it is 00:30 AEDT (13:30 UTC) and still no outage. I guess it will be a late start and therefore a very late finish today.

. . These rolling outage start times of late make it a little unnerving ... :(

Stephen

??
ID: 1909038 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1909039 - Posted: 26 Dec 2017, 13:58:34 UTC
Last modified: 26 Dec 2017, 13:59:47 UTC

It all depends if someone left a full pot of coffee from the night before, or if they need to make a fresh pot :D
I'm as full as I can get, so bring it on :)
ID: 1909039 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 36 · Next

Message boards : Number crunching : Panic Mode On (109) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.