Cannot get any work with 3 GPU, no queue size, one GPU sometimes idle

Message boards : Number crunching : Cannot get any work with 3 GPU, no queue size, one GPU sometimes idle
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1992304 - Posted: 2 May 2019, 17:06:35 UTC

======brought this over from boinc forum as maybe the problem is SETI???========
Noticed for some time that SETI had exactly 1 task running on each of the 3 GPUs. There is no queue depth. Since generally there are 100,000 or so at the project then something is wrong.

All my systems use account manager BAM! but it seems that preferences at BAM! are not used (they show .1 and .25), the same as the local client preference (according to BoincTasks)
I went to SETI and set preferences there for .25 daily queue with .50 additional (used to be .1 and .25) just to see what happened.

Did an update as that was required by the project and event queue reported

    3347 SETI@home 5/2/2019 10:53:52 AM update requested by user
    3348 SETI@home 5/2/2019 10:53:52 AM Sending scheduler request: Requested by user.
    3349 SETI@home 5/2/2019 10:53:52 AM Not requesting tasks: don't need (CPU: ; AMD/ATI GPU: )
    3350 SETI@home 5/2/2019 10:53:54 AM Scheduler request completed
    3351 SETI@home 5/2/2019 10:53:54 AM General prefs: from SETI@home (last modified 02-May-2019 10:53:54)
    3352 SETI@home 5/2/2019 10:53:54 AM Host location: none
    3353 SETI@home 5/2/2019 10:53:54 AM General prefs: using your defaults
    3354 5/2/2019 10:53:54 AM Reading preferences override file
    3355 5/2/2019 10:53:54 AM Preferences:
    3356 5/2/2019 10:53:54 AM max memory usage when active: 6139.56 MB
    3357 5/2/2019 10:53:54 AM max memory usage when idle: 11051.20 MB
    3358 5/2/2019 10:53:54 AM max disk usage: 116.17 GB
    3359 5/2/2019 10:53:54 AM max CPUs used: 20
    3360 5/2/2019 10:53:54 AM (to change preferences, visit a project web site or select Preferences in the Manager)



As far as I could tell not only did the increase have no effect but I actually lost a work unit as the update asked for a data too soon which caused (I am guessing) the project asked for a backoff. So, an UPDATE needs to happen before the preferences get updated and during an UPDATA the client asks for more data? Is this correct?

After 5-6 minutes (backoff is 300 seconds as I recall) I finally got an extra workunit and all three of my RX560 are busy.

However, what happened to the request for addition buffer? Is the project preferences being overridden by the general client? In any event exactly 1 work unit for a GPU is a queue of exactly ZERO. Where is the original .1 day or the new .25

What has control over preference? client? bam!? project?

Maybe this should be asked over at SETI??

[EDIT]
I just changed the local (client) preferences and read the following that indicates I need to go to the project (which I did earlier)

    jysdualxeon

    3603 SETI@home 5/2/2019 11:42:31 AM General prefs: from SETI@home (last modified 02-May-2019 10:53:55)
    3604 SETI@home 5/2/2019 11:42:31 AM Host location: none
    3605 SETI@home 5/2/2019 11:42:31 AM General prefs: using your defaults
    3606 5/2/2019 11:42:31 AM Reading preferences override file
    3607 5/2/2019 11:42:31 AM Preferences:
    3608 5/2/2019 11:42:31 AM max memory usage when active: 6139.56 MB
    3609 5/2/2019 11:42:31 AM max memory usage when idle: 11051.20 MB
    3610 5/2/2019 11:42:31 AM max disk usage: 116.17 GB
    3611 5/2/2019 11:42:31 AM max CPUs used: 20
    3612 5/2/2019 11:42:31 AM (to change preferences, visit a project web site or select Preferences in the Manager)



In any event, nothing happened though I did not lose a work unit because no update was actually done.


Here is an image from Boinc Manager (not BoincTasks). It shows only 2 tasks running, one GPU is idle and no queue size.

ID: 1992304 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1992306 - Posted: 2 May 2019, 17:52:30 UTC

You must have some configuration conflicting with resources allocation. The host has more gpu work for other projects and Seti only gets the last little slice of gpu allocation. Or you have mistakenly put a decimal point in the wrong place in your usage in Preferences.

What does setting the sched_ops_debug flag in Logging options show for work request? It will show the number of seconds of work requested for both cpu and gpu. You could also set work_fetch_debug and look at its more detailed report.

With even a 0.5 day work cache, you should get 100 tasks for each gpu and another 100 tasks for the cpu.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1992306 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1992310 - Posted: 2 May 2019, 18:25:52 UTC - in response to Message 1992306.  

You must have some configuration conflicting with resources allocation. The host has more gpu work for other projects and Seti only gets the last little slice of gpu allocation. Or you have mistakenly put a decimal point in the wrong place in your usage in Preferences.

What does setting the sched_ops_debug flag in Logging options show for work request? It will show the number of seconds of work requested for both cpu and gpu. You could also set work_fetch_debug and look at its more detailed report.

With even a 0.5 day work cache, you should get 100 tasks for each gpu and another 100 tasks for the cpu.



OK, set those debug flags.
Results are here
It the above does not work remove the www. I have no idea which sites use which protocol. Be nice if all boinc projects upgrade to allow storage on the cloud like newer forums / communities.

Going to make a guess after looking at the chatter.

I have share set to 0 because I want seti to run behind all other GPU tasks.
There are no other GPU tasks on this system nor do I plan on any but that might change.

Maybe that is the problem?
ID: 1992310 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1992315 - Posted: 2 May 2019, 18:52:03 UTC - in response to Message 1992310.  

Yes, that is exactly the problem. When you set share to 0 for a project it will request exactly one task at a time and finish it when the other projects are idle and continue to ask for only a single task until the other projects pickup work again. That way you don't get a ton of work for a backup project when your prime projects get work again and have to crunch through the unwanted secondary backup project work before being able to request work from your prime project.

If you look at your wok_fetch_debug you are asking for exactly one day of work with is 0.25 day plus 0.50 days of additional work.

[work_fetch] target work buffer: 21600.00 + 43200.00 sec

If you wanted to get more Seti work, you could set the host for another venue and then bump the project share up to 1 or 5 or something in relation to your prime projects share. Also I would remove the additional days of work by setting to 0.01 days which is the lowest BOINC will allow since it won't take zero for input. I would reduce your work day cache down even further to maybe 0.1 day of cache. That would only request 8640 seconds of work which you could easily crunch through with the gpu in short time.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1992315 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1992317 - Posted: 2 May 2019, 19:03:52 UTC - in response to Message 1992310.  

Keith is right - Resource share 0 is a special case value meaning 'backup project', work to be fetched only when resources are idle and no other project can supply work. Using it here on a machine with 3 GPUs is especially problematic because of the enforced 5-minute delay between work requests: if you need work between those requests, you can't get it.
ID: 1992317 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 1992318 - Posted: 2 May 2019, 19:45:29 UTC - in response to Message 1992315.  

Thanks Keith, Richard!

Yes, after changing priority at BAM! and doing a sync, about 5 minutes later I got a boatload of tasks. I was unaware of the "0" but did know about the problem with low priority tasks ending up with a lot of WU's that never complete.

This is what I have been working on:

I have GPUs that have extremely fast double precision float. S9x00, HD79x0: They work best on Milkyway

All other GPUs, RX5x0, GTX1070 have superior single precision over the above AMD boards but really suck on double precision, typically 1:16 ratio. Waste of electricity on Milkyway

Milkyway & Seti go off line for maintenance regularly. I want priority on science projects with fallback to non-science when those are offline. Not all projects have ATI apps, most have nVidia. so fallback to Asteroids cannot be on ATI systems (for example)

There is a problem with Milkyway in that they have work but do not supply it for some reason. Some type of bug, perfect example is HERE.
During those 10-15 minute gaps my secondary projects suck up work units and if I set the priority too low there will be real problems later near their deadlines.

I am thinking that I cannot use BAM! nor boinc client general preferences and need to use project preferences. Not sure how to do this or if it is even possible keeping BAM! as account manager. I am not sure if Milkyway even looks at project preferences like .1 and .25. I seem to get exactly 200 WUs for each GPU.

I cannot change what Milkyway is doing. The best I can do to avoid idle time is to fall back on seti or Einstein on those double precision AMD boards. I will try a low number for seti & Einstein on those AMD boards. My other GPUs do not run milkyway nor do I plan to other then getting statistics for various studies I am doing.

If you got any suggestions let me know. Maybe there should be a WiKi about this and also about cost (KWH) of running various projects.
ID: 1992318 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1992331 - Posted: 2 May 2019, 21:33:00 UTC - in response to Message 1992318.  

I am not sure if Milkyway even looks at project preferences like .1 and .25. I seem to get exactly 200 WUs for each GPU.

I think you may be correct that MW doesn't obey the normal BOINC convention of cache allotment. I think you just get 200 tasks for each gpu. Doesn't matter how fast or slow they are or the GFLOPS performance of the app on the task in determining how much work to ask for.

Right now MW@home is very broken since the server code update. The preference choices have been removed for controlling which project you get work for. The ongoing issue with fast clients going idle until all the work is reported before asking for work. Plus the obvious misconfiguration in the feeder size.

But we have new scientists maintaining the project and they are starting off on the bottom of the learning curve. We will just have to have patience till they can get the project working correctly again.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1992331 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1992333 - Posted: 2 May 2019, 22:32:07 UTC - in response to Message 1992331.  

I think you may be correct that MW doesn't obey the normal BOINC convention of cache allotment....
I'd be interested if you can stand that up.

Remember that what you ask for is a client decision: what you get is a server decision. They are independent, but should be correlated. The ideal is that you would see

02/05/2019 23:20:56 | SETI@home | [sched_op] NVIDIA GPU work request: 7897.15 seconds; 0.00 devices
02/05/2019 23:20:59 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 7922 seconds
where 'received' exceeds 'request' by no more than one task's estimated runtime.
ID: 1992333 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1992337 - Posted: 2 May 2019, 22:57:21 UTC - in response to Message 1992333.  

I think you may be correct that MW doesn't obey the normal BOINC convention of cache allotment....
I'd be interested if you can stand that up.


I'd had to replicate my earlier test back when the number of tasks allowed per gpu was set at 40 or 80. I set my cache allotment to 0.1 day and 0.01 additional days and still received 40 tasks only. Changing to 4 days of cache changed nothing. The server configuration sets the max allowed per gpu. It was 40 a few years ago, then bumped to 80 a year ago and then earlier this year was bumped to 200 per gpu because of all the fast ATI host users complaining they crunched through the tasks too fast.

https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4424&postid=68441
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1992337 · Report as offensive

Message boards : Number crunching : Cannot get any work with 3 GPU, no queue size, one GPU sometimes idle


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.