queuing question

Message boards : Number crunching : queuing question
Message board moderation

To post messages, you must log in.

AuthorMessage
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,633,544
RAC: 265
United States
Message 1917687 - Posted: 8 Feb 2018, 18:16:42 UTC

I wonder about the efficacy of allowing clients to stash in their queues so many tasks at any time, up to a week I recall. Consider the following:
* The effect of long queues must be to increase the loading on the servers, which right now hold about 5M place holders in the various database tables corresponding to the number of results in the field. That is a lot of unproductive overhead, when perhaps a considerably smaller number could be maintained and benefit the seti project.
* Further and similarly degrading productivity is the currently 4M results waiting for validation. Shorter client queues would reduce this number as well it would seem.
* Note that actual run-times for the V8's are ranging roughly from say 2h to 15m, depending on the client and estimated from my old Core2Quad rig and what the GPU endowed people report. With the current result turn around time being 30h, this means that the average task sits on the average client doing nothing from 93% to 99% of the time.
* As a rule Seti appears to keep up to six hours worth of results ready to send. If the clients caches were reduced, then this number would have to increase by an uncertain amount.

I realize that people don't want to run out of datasets to work on. Fine. But keeping those people happy seems to be at the risk of maintaining a dynamic and efficient backend at SETI. And in the end, it is meaningless. SETI is up "most" of the time and so the risk of running out of tasks is smaller than feared or observed, so far as I can ascertain. And, most real number crunchers seem to have backup projects that Boinc transfers to automatically.

Comments?
ID: 1917687 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 4734
Credit: 301,838,320
RAC: 307,771
United States
Message 1917694 - Posted: 8 Feb 2018, 19:07:29 UTC - in response to Message 1917687.  

I Run Quad GPUs. Turn around time is 0.2 days. That means roughly every 5 hours I return 400 work units. Nothing ever sits more than half a day (and only because seti goes goes down for maintenance) If you restrict the amount of data allowed to go out, then you are going to be placing a HUGE amount of demand on the servers as all the machines looking for data slam them to get work. It's like what happens right after we come back from Maintenance. For the first few hours, the servers struggle to maintain flow. Once everyone gets their cache filled, the demand drops and the server are more responsive. Remove the ability to fill people's cache and the servers will be tied up 24/7 .

Validation times also will go UP not down. With fewer work units going out, fewer are coming back. Instead of weeks for validation, seti could be looking at months.

Run times vary based on machines. My times run 8 minutes. Older machine might be 2 hours or more. Depends on equipment.

The answer isn't reducing capacity. Unfortunately time, money, man power aren't things that are readily available to Seti
ID: 1917694 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 5564
Credit: 385,185,971
RAC: 1,029,916
United States
Message 1917696 - Posted: 8 Feb 2018, 20:06:05 UTC

If you are worried about turnaround time and the stress it puts on the servers, the quick and easy fix would be to shorten the task deadlines to something around the nominal return time we see currently. This has been discussed in length in another thread.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1917696 · Report as offensive
Profile Wiggo "Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 14931
Credit: 193,582,312
RAC: 54,029
Australia
Message 1917714 - Posted: 8 Feb 2018, 22:00:38 UTC

Maybe a bit more homework on this subject should've been done before posting.

For starters, anyone with a rig that can store a week's worth of work is a slow rig.

There are hard limits of 100 CPU tasks and 100 tasks per GPU for every rig which here with those limits I'm very lucky to make it through the weekly outage without running out of work (I regularly look to my backup projects before the end of an outage).

Stopping those people who join, suck up a full cache's worth of work and then find out that their computers (especially those with usable GPU's) arn't responding the way they should so they dump the program and leave all their remaining tasks to timeout would be a better place to look for a solution as fixing this problem would certainly take a real load off the servers. ;-)

Cheers.
ID: 1917714 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,633,544
RAC: 265
United States
Message 1917750 - Posted: 8 Feb 2018, 23:09:52 UTC - in response to Message 1917714.  

Oh, yes. I forgot about the 100 task limits. That needs to play into it. But in your case, those limits sound small.

Yes, killing off zombies should help and could be attacked separately. I think I remember there is a feature in Boinc that could help this, but SETI never supported it. Can't remember, though.

Yes, shortening the turnaround time seems a good idea, possibly coupled with limiting the amount of work permitted to be cached. My guess is that shortening it gradually will not have an effect on results received up to a point. That point is close to where the new cut off should be set.

I don't understand Zalster's comment. Modest reductions in maximum queue length, say from 10 days to 2 days shouldn't hurt many clients but cut down a load of work units waiting around on the clients and on the server. Perhaps that changed could be tested on a willing client, to see if reducing the local queue to two days would change the average output. One fly might be that 100 task limit, I suppose. In Zalster's case, care would be needed to look at the results in terms of individual compute node performance because he has sooo many compute nodes!

My point is that the flow of work should have a minimal latency, though what is minimal may require some empiricism. Buffering vast amounts of work on the clients on purpose makes no sense and probably harms the project by increasing overhead time at the servers. Buying larger disks and faster servers isn't the answer, at least not the prudent answer, in my mind.
ID: 1917750 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 5564
Credit: 385,185,971
RAC: 1,029,916
United States
Message 1917764 - Posted: 9 Feb 2018, 0:30:18 UTC - in response to Message 1917750.  

I believe you will find that the majority of fast crunchers have their cache work load set very low. I have mine set at 2 days with 0.1 additional days of work cached. That is more than enough to keep the 100 tasks per cpu and gpu at the limit. The small store additional days setting forces BOINC to ask for replacement work at every scheduler check in. We have plenty of troubles keeping the crunchers fed through a project maintenance day or unscheduled outage because of fast turnaround time and throughput.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1917764 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10345
Credit: 138,842,811
RAC: 84,024
Australia
Message 1917815 - Posted: 9 Feb 2018, 4:35:34 UTC - in response to Message 1917764.  
Last modified: 9 Feb 2018, 4:39:18 UTC

I believe you will find that the majority of fast crunchers have their cache work load set very low.

The faster the system, the less the cache setting has any meaning.
Only slow systems can hold any cache of work larger than 24hours.

Limiting the maximum cache to say 4 days, then increasing the server side limits so that those WUs freed up can go to faster systems so they aren't out of work for as long each week would be nice. But i suspect the actual impact on numbers in progress would be minimal.
Grant
Darwin NT
ID: 1917815 · Report as offensive

Message boards : Number crunching : queuing question


 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.