Please rise the limits... just a little...

Message boards : Number crunching : Please rise the limits... just a little...
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1345751 - Posted: 12 Mar 2013, 7:37:26 UTC - in response to Message 1345738.  

S@H uses multiple servers, so the load on the upload servers does not affect that on the download servers. The upload servers feed into the validators, which aren't affected by the problem addressed by the poor performance of the download servers - S@H has always had a large pool of data awaiting validation, and appears to manage that side of things quite well.


Yeah, I don't really understand this whole argument Eric or Matt or both of them put-forward that we had a database limitation that forced the 100wu limit. I'm not arguing, I'm saying I don't understand.

Couldn't we go to a 2 day cache and crush the number of entries the db had to keep-up with? I would think that from 10 days to 2 days would be at least a 66% decrease in the size of *that* db.

I obviously don't understand the issue.
ID: 1345751 · Report as offensive
Sakletare
Avatar

Send message
Joined: 18 May 99
Posts: 132
Credit: 23,423,829
RAC: 0
Sweden
Message 1345767 - Posted: 12 Mar 2013, 10:02:47 UTC - in response to Message 1345723.  

I believe the problem comes in not from the number any one super cruncher can do in a day but from how many of those will persist in the database waiting for it's mate from a wingman with a much slower system.

The project could save database space if the scheduler paired up fast hosts witht fast hosts and slow hosts with slow hosts. That way fewer work units would be hanging around waiting for the wingman.

It would be a little more overhead but perhaps the hardware can take it.
ID: 1345767 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1345799 - Posted: 12 Mar 2013, 12:41:57 UTC
Last modified: 12 Mar 2013, 12:57:11 UTC

Posted in a wrong thread:

I realy don´t belive if they change from 100 GPU WU per host to 100 GPU WU per GPU will "crash" the DB... and that will give the fastest crunchers enought WU to pass the scheduled outages.

If the DB size is the problem, then why not simply decrease the 100 CPU WU limit to 50WU? That will give a big diference (100k users with less 50WU each = 5MM WU!) against probabily few 100´s who have 2 or 3 GPU hosts (lets say 1K x 100 =100K WU). Don´t mention, very few hosts could do 100 CPU WU in a 6 hours...

(edit)
Of course not all users realy get the 100 CPU WU limit, most of the slow ones with small caches don´t go near that.

The actual size of the DB, as appears on the server page, is about 3.6MM on the field, 2.9MM waiting validation and 3.5MM waiting for purgin, so an increase of few 100 thousands the 100WU/GPU can´t be make any real diference on the DB.
ID: 1345799 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1345818 - Posted: 12 Mar 2013, 13:36:47 UTC

Who knows when, But at least all the different versions will have to be the new Version 7 for anyone to crucnh. Will it help? I hope so.
[/quote]

Old James
ID: 1345818 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1345856 - Posted: 12 Mar 2013, 20:14:26 UTC - in response to Message 1345767.  

Yeah, I don't really understand this whole argument Eric or Matt or both of them put-forward that we had a database limitation that forced the 100wu limit. I'm not arguing, I'm saying I don't understand.


I believe the problem comes in not from the number any one super cruncher can do in a day but from how many of those will persist in the database waiting for it's mate from a wingman with a much slower system.

The project could save database space if the scheduler paired up fast hosts witht fast hosts and slow hosts with slow hosts. That way fewer work units would be hanging around waiting for the wingman.

It would be a little more overhead but perhaps the hardware can take it.

The issue, as Ive understood what Matt said, is not a limitation of the DB per se, neither has to do with the disk space needed, it's the time it takes to the scheduller to get an answer from it. As the DB grows the queries need more time to be answered as there is more data to filter and if this time goes too high the scheduller fails losing the conection with client and creating ghosts, that then makes the queries longer.

Disclaimer: Im not trying to give a technically accurate answer, neither Im saying that the limits are the right fix to that issue... Im just "translating" what Matt said to what Ive understood...

Someone suggested the idea of reduced deadlines to reduce the number of tasks in the hands of the DB, and (IIRC) Eric or some one else in the lab said that was a plausible and good idea... But I think they have not applied it.
ID: 1345856 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1345872 - Posted: 12 Mar 2013, 20:45:54 UTC

I believe the larger work units, which are in the works, will be the better solution. However it will take time to test & develop that stuff. Which as far as I know came to a halt for some reason.
IIRC the value of Results out in the field was over 8000000 when everything started to get really bad. If a fixed number of results is an issue for the servers. Adding code to stop work unit generation at that point would also be a good idea. Such as if the results out his 7000000, or whatever the magic number is, stop making work units until it goes down to 6000000.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1345872 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1345954 - Posted: 12 Mar 2013, 23:36:25 UTC - in response to Message 1345872.  

I believe the larger work units, which are in the works, will be the better solution. However it will take time to test & develop that stuff. Which as far as I know came to a halt for some reason.
IIRC the value of Results out in the field was over 8000000 when everything started to get really bad. If a fixed number of results is an issue for the servers. Adding code to stop work unit generation at that point would also be a good idea. Such as if the results out his 7000000, or whatever the magic number is, stop making work units until it goes down to 6000000.

Except remember that we've seen failures where the server status page fails to update: and whatever underlying data drives the SSP fails to get through as well. We already have checks in place that are supposed to inhibit the splitters when 'ready to send' reach a high water mark: when the SSP failed, the splitters went on splitting, and splitting, and splitting....

Unless you can make the cap on 'results in the field' more failsafe than that, we risk a runaway event taking us into dangerous territory - again.
ID: 1345954 · Report as offensive
Profile Alaun

Send message
Joined: 29 Nov 05
Posts: 18
Credit: 9,310,773
RAC: 0
United States
Message 1345957 - Posted: 12 Mar 2013, 23:48:59 UTC

Easy to say, probably harder to do:

1. The project has been capped by bandwidth for quite awhile. Move the servers to the colo facility and open the taps!

2. If the database has too many entries to keep up, make the workunits much, much larger! Have the GPU's work for several hours or a day on one task. Fewer files to move around, fewer database entries.

3. If the CPU crunchers are too slow to be wingmen now, that trend will only continue. Either seperate the workunits or stop sending them to CPU's at all! How much computing power do we have available CPU vs GPU anyway?
ID: 1345957 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1345972 - Posted: 13 Mar 2013, 1:42:46 UTC

I believe point 2 is in progress, albeit slowly (stalled?).

For point 3, you can't underestimate how valuable CPU results are. For starters, it is more time efficient to send VLAR WUs to CPUs rather than GPUs, and much more importantly, CPU results are generally more accurate than those from GPUs (related to the precision of floating-point calculations, I think). So it is very important to have tasks continue to be sent to CPUs, especially for resolving inconclusive results from GPUs.
Soli Deo Gloria
ID: 1345972 · Report as offensive
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1345975 - Posted: 13 Mar 2013, 1:47:57 UTC - in response to Message 1345972.  

VLARS are faster on my gpu than on my cpu.

I think your brush may need trimming. It's too broad
ID: 1345975 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1345979 - Posted: 13 Mar 2013, 1:55:37 UTC - in response to Message 1345975.  

It's not that GPUs can't process VLAR WUs faster than CPUs - they can. The problem is that VLAR WUs on GPUs - especially NV GPUs - are drastically slower than regular MB WUs. Thus, in terms of opportunity cost, it makes sense to move VLAR WUs to the CPU, as has been done for a long time now.
Soli Deo Gloria
ID: 1345979 · Report as offensive
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1345986 - Posted: 13 Mar 2013, 2:45:47 UTC - in response to Message 1345979.  

and MBs are slower on cpus than gpus. VLARs were moved
to cpus only because they were crashing the older gpus
moreso than the speed issue. Why should SAH care where I crunch
VLARs as long as they validate?
ID: 1345986 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1346000 - Posted: 13 Mar 2013, 4:43:52 UTC - in response to Message 1345986.  

and MBs are slower on cpus than gpus. VLARs were moved
to cpus only because they were crashing the older gpus
moreso than the speed issue. Why should SAH care where I crunch
VLARs as long as they validate?

A few VLAR's at a time don't worry my GTX560 Ti or my GTX550 Ti's either but once under a heavy constant VLAR storm all of mine will eventually stutter to a stop just like the older cards with their memory clogged full so they are definitely not worth doing on Nvidia GPU's.

Cheers.
ID: 1346000 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1346014 - Posted: 13 Mar 2013, 6:00:09 UTC

I'm aware of the crashing issue on lower-end GPUs. But obviously the concept of opportunity cost is completely lost. No matter, it's still the case that GPU calculations are not quite as precise as those made on the CPU.
Soli Deo Gloria
ID: 1346014 · Report as offensive
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1346015 - Posted: 13 Mar 2013, 6:01:32 UTC - in response to Message 1346000.  

and MBs are slower on cpus than gpus. VLARs were moved
to cpus only because they were crashing the older gpus
moreso than the speed issue. Why should SAH care where I crunch
VLARs as long as they validate?

A few VLAR's at a time don't worry my GTX560 Ti or my GTX550 Ti's either but once under a heavy constant VLAR storm all of mine will eventually stutter to a stop just like the older cards with their memory clogged full so they are definitely not worth doing on Nvidia GPU's.

Cheers.


Mine have never stuttered to a stop so THEY'RE DEFINITELY WORTH DOING ON MY NVIDIA GPUS, and I'm not the only one to have reported the same results.
ID: 1346015 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1346018 - Posted: 13 Mar 2013, 6:08:45 UTC

I don't suppose it's worth noting that even within the WUs marked as VLAR there's a fair angle range difference among them. Before the server switched to sending VLAR WUs only to CPUs, I noticed that WUs in the upper end of the VLAR zone were only taking perhaps two or three times as long as a regular MB WU, while those towards the lower end were taking as long as ten times the normal run-time. I'm pretty sure this was among the reasons why people chose to use the VLAR-kill version of the MB CUDA application, before the server eventually implemented VLAR detection.

But I tire of this debate. If you want to process VLAR WUs on your GPUs, go right ahead. There's no need to shout and carry on, arguing the point with discourtesy.
Soli Deo Gloria
ID: 1346018 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1346033 - Posted: 13 Mar 2013, 6:54:36 UTC - in response to Message 1345751.  
Last modified: 13 Mar 2013, 7:02:28 UTC

S@H uses multiple servers, so the load on the upload servers does not affect that on the download servers. The upload servers feed into the validators, which aren't affected by the problem addressed by the poor performance of the download servers - S@H has always had a large pool of data awaiting validation, and appears to manage that side of things quite well.


Yeah, I don't really understand this whole argument Eric or Matt or both of them put-forward that we had a database limitation that forced the 100wu limit. I'm not arguing, I'm saying I don't understand.

Couldn't we go to a 2 day cache and crush the number of entries the db had to keep-up with? I would think that from 10 days to 2 days would be at least a 66% decrease in the size of *that* db.

I obviously don't understand the issue.


Best as I understood the problem.

Some super crunchers literally had 10,000+ units in various states that needed to be searched through for a database operation (might have been to report completion). It took longer than the connection timeout so the report didn't complete. Super cruncher tries again, same result. Now during all this others user's hosts start failing to report because of the overtaxing of the database by the super crunchers who are trying to report. It cascads to the point nearly everyone was experiencing timeouts reporting. Uploads and download rates don't matter if a finished unit can't be reported.

So to shrink the number of units a super cruncher had that were valid but not yet deleted, pending on a wingman or a conclusive match and units in progress to a number that would easily be completed with little fear of timing out the connection, the fix number queue size limits were introduced.

Again, that's at least how I understood the problem.

PS, and I imagine for those with slow machines where their queue size wouldn't even give them 100 units in progress, won't get 100 units but just the number that satisfied their number of days based queue request.
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1346033 · Report as offensive
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1346034 - Posted: 13 Mar 2013, 6:55:48 UTC - in response to Message 1346018.  

"it's still the case that GPU calculations are not quite as precise as those made on the CPU."

In that case GPUs shouldn't be used all then.

I've done hundreds of VLARs on my GPUs at every angle that they sent me.
I crunch them there to allow my old slow CPU to crunch the quicker MBs.
That's how I judge my setup to be most efficient as to how I use it.
I'm sure your milage will vary. And I wasn't debating anything.

The problem is in people giving out false information to other people that don't know it's not true. That you and others have not been able to run VLARS on NVIDIA GPUs does not make it true in all instances. If somebody wants to run VLARs that end up validating on their NVIDIA GPUs then it's their concern and nobody elses. And if the information was passed out in a way that stated VLARs may or may not work for someone's NVIDIA GPU, but because of problems on older NVIDIA GPUs Seti@home has restricted them to CPUs only, nobody would ever hear a peep out of me about it.
ID: 1346034 · Report as offensive
Profile trader
Volunteer tester

Send message
Joined: 25 Jun 00
Posts: 126
Credit: 4,968,173
RAC: 0
United States
Message 1346039 - Posted: 13 Mar 2013, 7:10:38 UTC - in response to Message 1346018.  

My two cents worth for this topic......

I think an increase of wu alloted is a good idea, HOWEVER!! i think is should only be implimented for and based on systems and users them meet X criteria where X is a numerical value based on how many wu said user and machines have done and how long said user and machine have been crunching.

example a+b+c=x

a = number of years user has been actively crunching (if you started crunching in 2000 but have only been actively crunching since 2008 a would = 5)

b = number of years or months of active crunching machine has been crunching

c = number of processors usable for crunching

this way only people with proven records get more work. anybody else think this is a good way to go?



I RTFM and it was WYSIWYG then i found out it was a PEBKAC error
ID: 1346039 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1346045 - Posted: 13 Mar 2013, 7:28:18 UTC - in response to Message 1346034.  
Last modified: 13 Mar 2013, 7:29:06 UTC

In that case GPUs shouldn't be used all then.

~sigh~ This is not a fair conclusion to come to. GPUs are less precise than CPUs, but in most cases they are still 'good enough'. A problem arises when there are results returned to the server that match fairly closely, but not closely enough for the validator to make a call on their validity. Sometimes this may be due to the reduced precision a GPU may have compared with a CPU. This is all that I meant when I said that CPUs are still valuable in the science of this project.

I've done hundreds of VLARs on my GPUs at every angle that they sent me.
I crunch them there to allow my old slow CPU to crunch the quicker MBs.
That's how I judge my setup to be most efficient as to how I use it.

By 'quicker MBs' I'm guessing you mean the plain WUs not labelled as VLAR. This is certainly an interesting choice. There doesn't appear to be a substantial difference between the run-times for VLAR and non-VLAR MB WUs on a CPU (excluding 'shorties'), yet there is certainly a huge difference in run-times between VLAR and non-VLAR MB WUs on NV GPUs - at least for ones that you're not using. But if you choose to run the VLAR WUs on the GPU, that's certainly your right.

The problem is in people giving out false information to other people that don't know it's not true. That you and others have not been able to run VLARS on NVIDIA GPUs does not make it true in all instances.

Who says it is false information? I never said that I could not run VLAR WUs on NV GPUs, only that there is a severe performance penalty in doing so. It may well be that Fermi and Kepler GPUs don't suffer as much with previous NV generations. If so, then that is good news. But it's not fair to flat out declare that this is false information. Certainly, a great many contributors - including myself - found it beneficial for stability and performance reasons to redirect VLAR WUs to the CPU (despite the hit to the APR count), before the server did this automatically, and even went so far as to write scripts to automate this process.
Soli Deo Gloria
ID: 1346045 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Please rise the limits... just a little...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.