100 WU limit for GPUs is too low.

Message boards : Number crunching : 100 WU limit for GPUs is too low.
Message board moderation

To post messages, you must log in.

AuthorMessage
BoMbY

Send message
Joined: 3 Apr 99
Posts: 8
Credit: 759,919
RAC: 0
Germany
Message 1913214 - Posted: 15 Jan 2018, 16:45:43 UTC

As I just learned there is a maximum of 100 WUs per compute device, or device class, and with an average of about 4-5 minutes per WU, that's a maximum of about 8 hours of work I can store, and with the splitters out all the time recently, this really doesn't cut it.

I'm already out of work for two hours with the current outage.
ID: 1913214 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1913216 - Posted: 15 Jan 2018, 16:49:13 UTC - in response to Message 1913214.  
Last modified: 15 Jan 2018, 16:49:43 UTC

As I just learned there is a maximum of 100 WUs per compute device, or device class, and with an average of about 4-5 minutes per WU, that's a maximum of about 8 hours of work I can store, and with the splitters out all the time recently, this really doesn't cut it.

I'm already out of work for two hours with the current outage.

You are preaching to the choir.
Old news, my friend. This has been discussed ad infinitum here.
For many with faster GPUs, 100 WUs/GPU runs out much sooner than that.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1913216 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1913217 - Posted: 15 Jan 2018, 17:18:48 UTC - in response to Message 1913214.  
Last modified: 15 Jan 2018, 17:19:26 UTC

As I just learned there is a maximum of 100 WUs per compute device, or device class, and with an average of about 4-5 minutes per WU, that's a maximum of about 8 hours of work I can store, and with the splitters out all the time recently, this really doesn't cut it.

I'm already out of work for two hours with the current outage.

A cache of 100 tasks is only about an 8 hour cache for my dual E5-2670, or R9 390X as well. When we have shorties it is even less.
There are a few options to handle running out of work during an outage.
1) Have a backup project in BOINC.
2) Let your system(s) have a break when they run out of work.
3) Not run the project.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1913217 · Report as offensive
Profile marsinph
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 172
Credit: 23,823,824
RAC: 0
Belgium
Message 1913219 - Posted: 15 Jan 2018, 17:24:59 UTC

Since a few weeks/monts, there are server outage. We all know it.
With result that a lot of cruncher are out of work after a few hours or day

We all known that it is a limitation of 100WU/CPU and 100/GPU.
Not depending of 10 days WU and 10 days additionnal WU.
It nothing change. Still 100 CPU and 100 GPU.
I understand the limitation because some users download WU, then nothing more do with results of many WU "out of time".

More and more we have powerfull computers and GPU. For a lot of us, we runs thoses WU in less than 24 hours. In case of outage, we always are out of WU.

Why not increase the limitation based on the RAC or based the turnaround time average (better i think) ???
To be clear : basic (new) user, max WU amount 100. Based on 10 days
Average 1 day : 1000 WU
Or turn around average of 5 days, then 500WU
ID: 1913219 · Report as offensive
BoMbY

Send message
Joined: 3 Apr 99
Posts: 8
Credit: 759,919
RAC: 0
Germany
Message 1913220 - Posted: 15 Jan 2018, 17:25:33 UTC

Why not make a simple rule and set it to at least a maximum of WUs that last 24 hours? That would be a very simple calculation. Okay, maybe not at the first day the device is running, but maybe if it was running for at least 99% of the time for the last X days? All the data should be readily available.
ID: 1913220 · Report as offensive
Profile marsinph
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 172
Credit: 23,823,824
RAC: 0
Belgium
Message 1913256 - Posted: 15 Jan 2018, 21:24:35 UTC - in response to Message 1913217.  

As I just learned there is a maximum of 100 WUs per compute device, or device class, and with an average of about 4-5 minutes per WU, that's a maximum of about 8 hours of work I can store, and with the splitters out all the time recently, this really doesn't cut it.

I'm already out of work for two hours with the current outage.

A cache of 100 tasks is only about an 8 hour cache for my dual E5-2670, or R9 390X as well. When we have shorties it is even less.
There are a few options to handle running out of work during an outage.
1) Have a backup project in BOINC.
2) Let your system(s) have a break when they run out of work.
3) Not run the project.




Hall, so if i understand enough, you ask to not more run SETI. OK
I wil do and follow your proposal !!!
Consider my suggestion about increase max alowed WU for veru acive users (like you)
ID: 1913256 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1913259 - Posted: 15 Jan 2018, 21:35:44 UTC - in response to Message 1913256.  
Last modified: 15 Jan 2018, 21:38:30 UTC

The work unit limit was deliberately set like this years ago by the project scientists to avoid stressing the fragile Informix database upon which the entire project utterly depends with too many work units in progress at once and other similar "stressers" on it. It's been discussed to death, and it isn't going to be changing in the foreseeable future. Also, they don't read this forum that we know of. So, best to make one's peace with it as the way it is.

By the way I got about 2-3 hours of work units max. cached when they were Arecibo ones.... I get about 4-5 from the BLCs which is a slight (and accidental) improvement. We're all in the same boat.
ID: 1913259 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1913269 - Posted: 15 Jan 2018, 22:49:34 UTC - in response to Message 1913259.  
Last modified: 15 Jan 2018, 22:54:46 UTC

Is there a 'better' database to use than Informix? What does Google use, for example? SQL by MS? And is it practical to switch? Though I have heard of Informix for quite some time, I just Googled it and found that it is an IBM product. I guess I am old, but I thought that IBM had a reputation for doing big iron, and a solid DB? Or has that ossified over the years?

*edit* Just took a closer look at the quick blurb on Google, and this caught my eye: Stable release: 12.10.xC7 / June 15, 2016. I can appreciate having something just work if it ain't broke, but I don't know, it seems to me that over 18 months between stable releases seems like a pretty long time in the compressed IT world we live in now. But then again, what do I know? ;-)

ID: 1913269 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1913285 - Posted: 16 Jan 2018, 2:13:04 UTC - in response to Message 1913269.  

Is there a 'better' database to use than Informix? What does Google use, for example? SQL by MS? And is it practical to switch? Though I have heard of Informix for quite some time, I just Googled it and found that it is an IBM product. I guess I am old, but I thought that IBM had a reputation for doing big iron, and a solid DB? Or has that ossified over the years?

*edit* Just took a closer look at the quick blurb on Google, and this caught my eye: Stable release: 12.10.xC7 / June 15, 2016. I can appreciate having something just work if it ain't broke, but I don't know, it seems to me that over 18 months between stable releases seems like a pretty long time in the compressed IT world we live in now. But then again, what do I know? ;-)

It is the BOINC master database which uses mySQL rather than the science databases that use Informix. The issue is more how SETI@home uses the database.
For all of the tasks assigned to hosts, "results out in the field", the task information is stored in a single table.
So when tasks are assigned, reported, or a host doesn't an update then that table is accessed.
For the hardware running the db the limit of that table turns out to be around 11,000,000 rows. Then it can not longer complete a query fast enough. Which causes the db server to stop and the project to be offline for several hours if not days which recovery is run.

So the solutions are
1) Have more hefty master and replica db servers.
2) Recode how the results are sent to hosts are stored.
3) Limit the number of tasks sent to hosts.

#1 is a stop gap until that hardware hits its limit.
#2 is probably the best solution, but requires dedicated time that isn't available to the project.
#3 easiest to implement and can be adjusted if the db server gets near the limit again.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1913285 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30608
Credit: 53,134,872
RAC: 32
United States
Message 1913295 - Posted: 16 Jan 2018, 4:36:02 UTC - in response to Message 1913285.  

Is there a 'better' database to use than Informix? What does Google use, for example? SQL by MS? And is it practical to switch? Though I have heard of Informix for quite some time, I just Googled it and found that it is an IBM product. I guess I am old, but I thought that IBM had a reputation for doing big iron, and a solid DB? Or has that ossified over the years?

*edit* Just took a closer look at the quick blurb on Google, and this caught my eye: Stable release: 12.10.xC7 / June 15, 2016. I can appreciate having something just work if it ain't broke, but I don't know, it seems to me that over 18 months between stable releases seems like a pretty long time in the compressed IT world we live in now. But then again, what do I know? ;-)

It is the BOINC master database which uses mySQL rather than the science databases that use Informix. The issue is more how SETI@home uses the database.
For all of the tasks assigned to hosts, "results out in the field", the task information is stored in a single table.
So when tasks are assigned, reported, or a host doesn't an update then that table is accessed.
For the hardware running the db the limit of that table turns out to be around 11,000,000 rows. Then it can not longer complete a query fast enough. Which causes the db server to stop and the project to be offline for several hours if not days which recovery is run.

So the solutions are
1) Have more hefty master and replica db servers.
2) Recode how the results are sent to hosts are stored.
3) Limit the number of tasks sent to hosts.

#1 is a stop gap until that hardware hits its limit.
#2 is probably the best solution, but requires dedicated time that isn't available to the project.
#3 easiest to implement and can be adjusted if the db server gets near the limit again.

#3 is the opposite of the thread title. One other thing that can be done is to shorten the amount of time until a report is due. The should shorten the queue of workunits out in the bushes and also let more workunits validate faster. However as Seti allows phones to crunch, I'm not sure how much shorter it can be before many of those low crunch able phones would start missing deadlines. Perhaps some sort of dynamic deadline based on the reported FLOPS?
ID: 1913295 · Report as offensive
Profile marsinph
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 172
Credit: 23,823,824
RAC: 0
Belgium
Message 1913756 - Posted: 18 Jan 2018, 17:16:47 UTC - in response to Message 1913295.  

Hello Gary, also a nice idea about flops, but not forget that some of us let computer run a few hours/day but with very powerfull computer.
Some less powerfull, but 24/24 and 7/7 (my self)
So not easy !






Is there a 'better' database to use than Informix? What does Google use, for example? SQL by MS? And is it practical to switch? Though I have heard of Informix for quite some time, I just Googled it and found that it is an IBM product. I guess I am old, but I thought that IBM had a reputation for doing big iron, and a solid DB? Or has that ossified over the years?

*edit* Just took a closer look at the quick blurb on Google, and this caught my eye: Stable release: 12.10.xC7 / June 15, 2016. I can appreciate having something just work if it ain't broke, but I don't know, it seems to me that over 18 months between stable releases seems like a pretty long time in the compressed IT world we live in now. But then again, what do I know? ;-)

It is the BOINC master database which uses mySQL rather than the science databases that use Informix. The issue is more how SETI@home uses the database.
For all of the tasks assigned to hosts, "results out in the field", the task information is stored in a single table.
So when tasks are assigned, reported, or a host doesn't an update then that table is accessed.
For the hardware running the db the limit of that table turns out to be around 11,000,000 rows. Then it can not longer complete a query fast enough. Which causes the db server to stop and the project to be offline for several hours if not days which recovery is run.

So the solutions are
1) Have more hefty master and replica db servers.
2) Recode how the results are sent to hosts are stored.
3) Limit the number of tasks sent to hosts.

#1 is a stop gap until that hardware hits its limit.
#2 is probably the best solution, but requires dedicated time that isn't available to the project.
#3 easiest to implement and can be adjusted if the db server gets near the limit again.

#3 is the opposite of the thread title. One other thing that can be done is to shorten the amount of time until a report is due. The should shorten the queue of workunits out in the bushes and also let more workunits validate faster. However as Seti allows phones to crunch, I'm not sure how much shorter it can be before many of those low crunch able phones would start missing deadlines. Perhaps some sort of dynamic deadline based on the reported FLOPS?

ID: 1913756 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913759 - Posted: 18 Jan 2018, 17:30:05 UTC - in response to Message 1913295.  

However as Seti allows phones to crunch, I'm not sure how much shorter it can be before many of those low crunch able phones would start missing deadlines. Perhaps some sort of dynamic deadline based on the reported FLOPS?

It has already been discussed in another thread. Android phones aren't that slow in returning tasks, just the opposite in fact because they are always on. The worst offenders are the normal PC's with low usage, that is only turned on for a few minutes a day/week/month.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913759 · Report as offensive
Profile marsinph
Volunteer tester

Send message
Joined: 7 Apr 01
Posts: 172
Credit: 23,823,824
RAC: 0
Belgium
Message 1913768 - Posted: 18 Jan 2018, 18:02:12 UTC

Hello,
Like I wrote, why not a variable amount of WU based on computer average turnround time.
So to increase cache and also to not overload the system

Basic start : computer run 24/24 and 7/7
For beginner (RAC and/or credit : 0 ) : 50 WU (there are so much WU out in field waiting crunching)

Then
actually 100WU
divide by avg turnaround day
x 10 days work (or depending the settings)
/ 2 to let WU for evryone (unless splitter will never produce enough work for everyone
It will give in this case 500 WU

The same powerfull computer working 12 hours a day will also have a turnaround of 2 days
So it will receive 250 WU

With this easy calculation it not depend the power of computer.
A PC power 10 with avg 1 day will receive the same as a twice powerfull with avg 2 days (perhaps it works 6hr/day)
A PC 1/2 powerfull but avg 0.5 day also the same perhaps oy works 24/24.

With such calculation, everyone have the same cache time depending crunching, depending crunching. How more sent WU, how more cache.

Of course it not need a calculation to ten or twenty decimals. Only two decimal is enough.

For the one with super powerfull calculator, no problem. And if one of it crashes, there will be more results "out of time".
OK. BUT !!! I really think it will be less than thousand of computers with 100CPU and 100GPU, who have never run any tasks !!
For beginners, the turnaround will very fast increase depending crunching.

Advantage of such way of going : reduce amount of WU out in field or waiting validation. ( I have WU waiting calculation since october!!!)
I am sure a lot of you too.
It will also reduce the servers "waiting validation" about 3,000,000 WU !!!
ID: 1913768 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1913839 - Posted: 18 Jan 2018, 23:42:31 UTC
Last modified: 18 Jan 2018, 23:43:29 UTC

Actually, TBH, my bigger issue with some of my machines is that the 100WU limit for CPU's is more of a problem for me. This Tuesday, I noticed that the project went off line at 7:30ish, and by 10, my higher core crunchers are empty of CPU tasks, but my GPU's seem to last till earlyish afternoon. Of course, having dual CPU's doesn't help matters, but if some of the suggestions above could be implemented, like raising the limit based upon some combo of # of tasks returned valid, and number processed per day, that would help matters. Of course, if our database can't handle it at this time, which has been discussed, that'll have to wait till it is upgraded. Maybe one of these days we will be doing like Einstein is next week, bringing 'er down and (re)building it up. All that is needed is personnel and $, sadly... Say, maybe they can loan us some of their expertise and willing hands once they've completed their upgrade! ;-)

ID: 1913839 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913847 - Posted: 19 Jan 2018, 0:21:46 UTC
Last modified: 19 Jan 2018, 0:22:51 UTC

I agree, I think the 100 task per cpu limit needs to be urgently updated. More so than the gpu limit. The gpu caches will always be exhausted first no matter how many tasks are allowed. If you still had cpu tasks to work on at least your host wouldn't go cold. With increasing number of multi-core or multi-cpu hosts, the problem is just getting worse. Anybody with dual Xeons or Threadrippers or Ryzen 7's are finding they are out of work only 4-6 hours after the project outage starts. They then have cold machines for the next 4-6 hours.

Maybe we should try and persuade the Einstein administrators to publish a post-mortem on their server upgrades.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913847 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1913938 - Posted: 19 Jan 2018, 5:27:42 UTC - in response to Message 1913847.  

My TR runs out of CPU work in 3 hours, while my slow GPUs X3 usually makes it through the maintaince period.
It will be interesting this coming week to see what happens. When I load work from Einstein to the CPU that work is taking 11 hours per WU.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1913938 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913972 - Posted: 19 Jan 2018, 9:37:46 UTC - in response to Message 1913938.  

Yes those n-body and Continuous Gravity Wave cpu apps seem to take forever. I've stayed away from them so far.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913972 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 8797
Credit: 2,930,782
RAC: 1
Italy
Message 1914215 - Posted: 20 Jan 2018, 14:41:20 UTC

I have finished two climatepredictio.net tasks on the Windows 10 PC. Each took more than 5 days, running 24/7 and so far I got no credits. They are given only once a week or so. So don't blame CreditNew.
Tullio
ID: 1914215 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 715
Credit: 8,032,827
RAC: 62
France
Message 1914248 - Posted: 20 Jan 2018, 17:35:02 UTC - in response to Message 1914215.  

I have finished two climatepredictio.net tasks on the Windows 10 PC. Each took more than 5 days, running 24/7 and so far I got no credits. They are given only once a week or so. So don't blame CreditNew.
Tullio


Créé 13 Jan 2018, 22:17:25 UTC
Envoyé 14 Jan 2018, 13:21:32 UTC
Date limite de rapport 27 Dec 2018, 18:41:32 UTC
Reçu 19 Jan 2018, 18:03:15 UTC

curious time limite ;)
ID: 1914248 · Report as offensive

Message boards : Number crunching : 100 WU limit for GPUs is too low.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.