Please rise the limits... just a little...


log in

Advanced search

Message boards : Number crunching : Please rise the limits... just a little...

Previous · 1 · 2 · 3 · 4 · Next
Author Message
tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2676
Credit: 201,808,914
RAC: 502,776
United States
Message 1345736 - Posted: 12 Mar 2013, 6:46:13 UTC - in response to Message 1345723.

I believe the problem comes in not from the number any one super cruncher can do in a day but from how many of those will persist in the database waiting for it's mate from a wingman with a much slower system.

The top host can on average crunch 1 GPU assigned unit in under a minute (seen it a low as 35 seconds) due to the number of GPUs and multitasking multiple units on each GPU. In a day we are talking 1400-2500 units. My very low end GPU can do one in a little less than an hour. It takes around 3.5 days to process 100 GPU units. Right now 15-25% of them per day are waiting for their wingman when they are reported. Just imagine the percentage for a super cruncher.

It's the validation pending ones as well as those assigned in progress queue that clog up the database lookup for super crunchers. I have roughly 25% of an average day's worth of work that's been pending for more than 30 days. What's the rate for someone who can crunch several hundred if not thousand in a day? How many persist for weeks, filling up the database, slowing lookup times?

It's not that the super crunchers are the problem, they are just the ever accelerating conveyer belt of bonbons that Lucy simply can't box fast enough.


A happy situation might be for the cache limits to go from 100 work units to .5 or 1 day caches and shorten the timeout limits. That would be a "validate or perish" situation for the database.

Oh, not to mention that it would cut the number of entries in the database for all of those who do fewer than 100 work units in a day (and CPU work units which would be difficult to do 100 of in a day).

More connections? I doubt it.

Look at the production of the computers you find in the 960-1000th places in the "top computers" list. There are a LOT of computers making a LOT of connections to re-build a 100 work unit cache.

I *really* hope Matt is making headway in getting our super-duper servers into a building with a super-duper connection to the outside world.


rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8241
Credit: 54,280,698
RAC: 73,762
United Kingdom
Message 1345738 - Posted: 12 Mar 2013, 6:58:35 UTC

S@H uses multiple servers, so the load on the upload servers does not affect that on the download servers. The upload servers feed into the validators, which aren't affected by the problem addressed by the poor performance of the download servers - S@H has always had a large pool of data awaiting validation, and appears to manage that side of things quite well.
The download servers are on the other hand struggling to cope with demand. Its bound not to be a simple solution (apart from lack of bandwidth) but a deep rooted one, that is taking a lot of effort to isolate and resolve. Contributors I can think of include the re-try/back-off by clients, the in-balance between the the two download servers (one is massively faster than the other), the "auto-sync" between the various time-outs and hand-overs (they all appear to be based on 5 minutes) and so on. Not to mention that there is a large number of different versions of BOINC out here all with subtly different "approaches to the world". And finally there are the abusers, sorry users....
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2676
Credit: 201,808,914
RAC: 502,776
United States
Message 1345751 - Posted: 12 Mar 2013, 7:37:26 UTC - in response to Message 1345738.

S@H uses multiple servers, so the load on the upload servers does not affect that on the download servers. The upload servers feed into the validators, which aren't affected by the problem addressed by the poor performance of the download servers - S@H has always had a large pool of data awaiting validation, and appears to manage that side of things quite well.


Yeah, I don't really understand this whole argument Eric or Matt or both of them put-forward that we had a database limitation that forced the 100wu limit. I'm not arguing, I'm saying I don't understand.

Couldn't we go to a 2 day cache and crush the number of entries the db had to keep-up with? I would think that from 10 days to 2 days would be at least a 66% decrease in the size of *that* db.

I obviously don't understand the issue.

Sakletare
Avatar
Send message
Joined: 18 May 99
Posts: 131
Credit: 20,831,551
RAC: 5,701
Sweden
Message 1345767 - Posted: 12 Mar 2013, 10:02:47 UTC - in response to Message 1345723.

I believe the problem comes in not from the number any one super cruncher can do in a day but from how many of those will persist in the database waiting for it's mate from a wingman with a much slower system.

The project could save database space if the scheduler paired up fast hosts witht fast hosts and slow hosts with slow hosts. That way fewer work units would be hanging around waiting for the wingman.

It would be a little more overhead but perhaps the hardware can take it.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5125
Credit: 279,190,702
RAC: 446,247
Brazil
Message 1345799 - Posted: 12 Mar 2013, 12:41:57 UTC
Last modified: 12 Mar 2013, 12:57:11 UTC

Posted in a wrong thread:

I realy don´t belive if they change from 100 GPU WU per host to 100 GPU WU per GPU will "crash" the DB... and that will give the fastest crunchers enought WU to pass the scheduled outages.

If the DB size is the problem, then why not simply decrease the 100 CPU WU limit to 50WU? That will give a big diference (100k users with less 50WU each = 5MM WU!) against probabily few 100´s who have 2 or 3 GPU hosts (lets say 1K x 100 =100K WU). Don´t mention, very few hosts could do 100 CPU WU in a 6 hours...

(edit)
Of course not all users realy get the 100 CPU WU limit, most of the slow ones with small caches don´t go near that.

The actual size of the DB, as appears on the server page, is about 3.6MM on the field, 2.9MM waiting validation and 3.5MM waiting for purgin, so an increase of few 100 thousands the 100WU/GPU can´t be make any real diference on the DB.
____________

Profile James SotherdenProject donor
Avatar
Send message
Joined: 16 May 99
Posts: 8635
Credit: 32,458,989
RAC: 54,944
United States
Message 1345818 - Posted: 12 Mar 2013, 13:36:47 UTC

Who knows when, But at least all the different versions will have to be the new Version 7 for anyone to crucnh. Will it help? I hope so.
____________

Old James

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 70,902,949
RAC: 81,787
Argentina
Message 1345856 - Posted: 12 Mar 2013, 20:14:26 UTC - in response to Message 1345767.

Yeah, I don't really understand this whole argument Eric or Matt or both of them put-forward that we had a database limitation that forced the 100wu limit. I'm not arguing, I'm saying I don't understand.


I believe the problem comes in not from the number any one super cruncher can do in a day but from how many of those will persist in the database waiting for it's mate from a wingman with a much slower system.

The project could save database space if the scheduler paired up fast hosts witht fast hosts and slow hosts with slow hosts. That way fewer work units would be hanging around waiting for the wingman.

It would be a little more overhead but perhaps the hardware can take it.

The issue, as Ive understood what Matt said, is not a limitation of the DB per se, neither has to do with the disk space needed, it's the time it takes to the scheduller to get an answer from it. As the DB grows the queries need more time to be answered as there is more data to filter and if this time goes too high the scheduller fails losing the conection with client and creating ghosts, that then makes the queries longer.

Disclaimer: Im not trying to give a technically accurate answer, neither Im saying that the limits are the right fix to that issue... Im just "translating" what Matt said to what Ive understood...

Someone suggested the idea of reduced deadlines to reduce the number of tasks in the hands of the DB, and (IIRC) Eric or some one else in the lab said that was a plausible and good idea... But I think they have not applied it.
____________

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3988
Credit: 109,731,284
RAC: 132,377
United States
Message 1345872 - Posted: 12 Mar 2013, 20:45:54 UTC

I believe the larger work units, which are in the works, will be the better solution. However it will take time to test & develop that stuff. Which as far as I know came to a halt for some reason.
IIRC the value of Results out in the field was over 8000000 when everything started to get really bad. If a fixed number of results is an issue for the servers. Adding code to stop work unit generation at that point would also be a good idea. Such as if the results out his 7000000, or whatever the magic number is, stop making work units until it goes down to 6000000.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8429
Credit: 47,795,779
RAC: 55,030
United Kingdom
Message 1345954 - Posted: 12 Mar 2013, 23:36:25 UTC - in response to Message 1345872.

I believe the larger work units, which are in the works, will be the better solution. However it will take time to test & develop that stuff. Which as far as I know came to a halt for some reason.
IIRC the value of Results out in the field was over 8000000 when everything started to get really bad. If a fixed number of results is an issue for the servers. Adding code to stop work unit generation at that point would also be a good idea. Such as if the results out his 7000000, or whatever the magic number is, stop making work units until it goes down to 6000000.

Except remember that we've seen failures where the server status page fails to update: and whatever underlying data drives the SSP fails to get through as well. We already have checks in place that are supposed to inhibit the splitters when 'ready to send' reach a high water mark: when the SSP failed, the splitters went on splitting, and splitting, and splitting....

Unless you can make the cap on 'results in the field' more failsafe than that, we risk a runaway event taking us into dangerous territory - again.

Profile Alaun
Send message
Joined: 29 Nov 05
Posts: 16
Credit: 5,196,377
RAC: 0
United States
Message 1345957 - Posted: 12 Mar 2013, 23:48:59 UTC

Easy to say, probably harder to do:

1. The project has been capped by bandwidth for quite awhile. Move the servers to the colo facility and open the taps!

2. If the database has too many entries to keep up, make the workunits much, much larger! Have the GPU's work for several hours or a day on one task. Fewer files to move around, fewer database entries.

3. If the CPU crunchers are too slow to be wingmen now, that trend will only continue. Either seperate the workunits or stop sending them to CPU's at all! How much computing power do we have available CPU vs GPU anyway?
____________

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 317
Credit: 134,893,877
RAC: 226,350
Australia
Message 1345972 - Posted: 13 Mar 2013, 1:42:46 UTC

I believe point 2 is in progress, albeit slowly (stalled?).

For point 3, you can't underestimate how valuable CPU results are. For starters, it is more time efficient to send VLAR WUs to CPUs rather than GPUs, and much more importantly, CPU results are generally more accurate than those from GPUs (related to the precision of floating-point calculations, I think). So it is very important to have tasks continue to be sent to CPUs, especially for resolving inconclusive results from GPUs.
____________
Soli Deo Gloria

bill
Send message
Joined: 16 Jun 99
Posts: 859
Credit: 22,689,326
RAC: 18,258
United States
Message 1345975 - Posted: 13 Mar 2013, 1:47:57 UTC - in response to Message 1345972.

VLARS are faster on my gpu than on my cpu.

I think your brush may need trimming. It's too broad

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 317
Credit: 134,893,877
RAC: 226,350
Australia
Message 1345979 - Posted: 13 Mar 2013, 1:55:37 UTC - in response to Message 1345975.

It's not that GPUs can't process VLAR WUs faster than CPUs - they can. The problem is that VLAR WUs on GPUs - especially NV GPUs - are drastically slower than regular MB WUs. Thus, in terms of opportunity cost, it makes sense to move VLAR WUs to the CPU, as has been done for a long time now.
____________
Soli Deo Gloria

bill
Send message
Joined: 16 Jun 99
Posts: 859
Credit: 22,689,326
RAC: 18,258
United States
Message 1345986 - Posted: 13 Mar 2013, 2:45:47 UTC - in response to Message 1345979.

and MBs are slower on cpus than gpus. VLARs were moved
to cpus only because they were crashing the older gpus
moreso than the speed issue. Why should SAH care where I crunch
VLARs as long as they validate?

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6684
Credit: 92,097,341
RAC: 73,810
Australia
Message 1346000 - Posted: 13 Mar 2013, 4:43:52 UTC - in response to Message 1345986.

and MBs are slower on cpus than gpus. VLARs were moved
to cpus only because they were crashing the older gpus
moreso than the speed issue. Why should SAH care where I crunch
VLARs as long as they validate?

A few VLAR's at a time don't worry my GTX560 Ti or my GTX550 Ti's either but once under a heavy constant VLAR storm all of mine will eventually stutter to a stop just like the older cards with their memory clogged full so they are definitely not worth doing on Nvidia GPU's.

Cheers.
____________

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 317
Credit: 134,893,877
RAC: 226,350
Australia
Message 1346014 - Posted: 13 Mar 2013, 6:00:09 UTC

I'm aware of the crashing issue on lower-end GPUs. But obviously the concept of opportunity cost is completely lost. No matter, it's still the case that GPU calculations are not quite as precise as those made on the CPU.
____________
Soli Deo Gloria

bill
Send message
Joined: 16 Jun 99
Posts: 859
Credit: 22,689,326
RAC: 18,258
United States
Message 1346015 - Posted: 13 Mar 2013, 6:01:32 UTC - in response to Message 1346000.

and MBs are slower on cpus than gpus. VLARs were moved
to cpus only because they were crashing the older gpus
moreso than the speed issue. Why should SAH care where I crunch
VLARs as long as they validate?

A few VLAR's at a time don't worry my GTX560 Ti or my GTX550 Ti's either but once under a heavy constant VLAR storm all of mine will eventually stutter to a stop just like the older cards with their memory clogged full so they are definitely not worth doing on Nvidia GPU's.

Cheers.


Mine have never stuttered to a stop so THEY'RE DEFINITELY WORTH DOING ON MY NVIDIA GPUS, and I'm not the only one to have reported the same results.

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 317
Credit: 134,893,877
RAC: 226,350
Australia
Message 1346018 - Posted: 13 Mar 2013, 6:08:45 UTC

I don't suppose it's worth noting that even within the WUs marked as VLAR there's a fair angle range difference among them. Before the server switched to sending VLAR WUs only to CPUs, I noticed that WUs in the upper end of the VLAR zone were only taking perhaps two or three times as long as a regular MB WU, while those towards the lower end were taking as long as ten times the normal run-time. I'm pretty sure this was among the reasons why people chose to use the VLAR-kill version of the MB CUDA application, before the server eventually implemented VLAR detection.

But I tire of this debate. If you want to process VLAR WUs on your GPUs, go right ahead. There's no need to shout and carry on, arguing the point with discourtesy.
____________
Soli Deo Gloria

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,747,548
RAC: 2,146
United States
Message 1346033 - Posted: 13 Mar 2013, 6:54:36 UTC - in response to Message 1345751.
Last modified: 13 Mar 2013, 7:02:28 UTC

S@H uses multiple servers, so the load on the upload servers does not affect that on the download servers. The upload servers feed into the validators, which aren't affected by the problem addressed by the poor performance of the download servers - S@H has always had a large pool of data awaiting validation, and appears to manage that side of things quite well.


Yeah, I don't really understand this whole argument Eric or Matt or both of them put-forward that we had a database limitation that forced the 100wu limit. I'm not arguing, I'm saying I don't understand.

Couldn't we go to a 2 day cache and crush the number of entries the db had to keep-up with? I would think that from 10 days to 2 days would be at least a 66% decrease in the size of *that* db.

I obviously don't understand the issue.


Best as I understood the problem.

Some super crunchers literally had 10,000+ units in various states that needed to be searched through for a database operation (might have been to report completion). It took longer than the connection timeout so the report didn't complete. Super cruncher tries again, same result. Now during all this others user's hosts start failing to report because of the overtaxing of the database by the super crunchers who are trying to report. It cascads to the point nearly everyone was experiencing timeouts reporting. Uploads and download rates don't matter if a finished unit can't be reported.

So to shrink the number of units a super cruncher had that were valid but not yet deleted, pending on a wingman or a conclusive match and units in progress to a number that would easily be completed with little fear of timing out the connection, the fix number queue size limits were introduced.

Again, that's at least how I understood the problem.

PS, and I imagine for those with slow machines where their queue size wouldn't even give them 100 units in progress, won't get 100 units but just the number that satisfied their number of days based queue request.
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

bill
Send message
Joined: 16 Jun 99
Posts: 859
Credit: 22,689,326
RAC: 18,258
United States
Message 1346034 - Posted: 13 Mar 2013, 6:55:48 UTC - in response to Message 1346018.

"it's still the case that GPU calculations are not quite as precise as those made on the CPU."

In that case GPUs shouldn't be used all then.

I've done hundreds of VLARs on my GPUs at every angle that they sent me.
I crunch them there to allow my old slow CPU to crunch the quicker MBs.
That's how I judge my setup to be most efficient as to how I use it.
I'm sure your milage will vary. And I wasn't debating anything.

The problem is in people giving out false information to other people that don't know it's not true. That you and others have not been able to run VLARS on NVIDIA GPUs does not make it true in all instances. If somebody wants to run VLARs that end up validating on their NVIDIA GPUs then it's their concern and nobody elses. And if the information was passed out in a way that stated VLARs may or may not work for someone's NVIDIA GPU, but because of problems on older NVIDIA GPUs Seti@home has restricted them to CPUs only, nobody would ever hear a peep out of me about it.

Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Please rise the limits... just a little...

Copyright © 2014 University of California