Panic Mode On (109) Server Problems?

Message boards : Number crunching : Panic Mode On (109) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 35 · 36 · 37 · 38

AuthorMessage
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 190,593,029
RAC: 14,702
United States
Message 1913877 - Posted: 19 Jan 2018, 1:45:56 UTC - in response to Message 1913859.  
Last modified: 19 Jan 2018, 1:52:26 UTC

MW is fine, but IIRC to do something productive there you need to have an AMD GPU, NV stuff has troubles to work with DP used by MW.

Does that changes?

GPUGrid with it's long time to crunch WU is not really a project to use as a backup. IMHO

Milkyway used Double Precision calculations.
Most GeForce GPUs are allowed DP performance of 1/32 SP
Radeon GPUs are allowed DP performance of 1/16 SP

So if both GPUs were 6000 GFLOPS in Single Precision the GeForce would be 188 GFLOPS DP and the Radeon 375 GFLOPS

If you move to the workstation GPUs they will have DP performance of up to 1/2 SP. Which is likely why they have 4 digit price tags.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1913877 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 6157
Credit: 440,531,011
RAC: 1,006,910
United States
Message 1913883 - Posted: 19 Jan 2018, 1:59:59 UTC - in response to Message 1913859.  

MW is fine, but IIRC to do something productive there you need to have an AMD GPU, NV stuff has troubles to work with DP used by MW.

Does that changes?

GPUGrid with it's long time to crunch WU is not really a project to use as a backup. IMHO

No, MW has no issue with Nvidia as long as the card can do double-precision. Probably any card greater than Kepler or avoiding the lowest denomination of any family. Nvidia doesn't have the same degree of performance in double-precision as ATI/AMD but they still work fine. I do a Gamma Ray Binary Pulsar task in 190 seconds and get awarded 227 credits for it. The credit is static for all task types.

The longest running GPUGrid gpu task so far I've run was 8 hours and was awarded 387,150 credits. The shortest task was 3 hours. The longest cpu task was 1 hour and the shortest 20 minutes.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1913883 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10530
Credit: 143,212,704
RAC: 79,012
Australia
Message 1913914 - Posted: 19 Jan 2018, 4:03:02 UTC

In progress back up around 5 million, Received-last-hour back over 120k.
WU-awaiting-deletion climbing, splitter output dropped down to lower level again. Will clearing awaiting-deletion fire up the splitters again?
We'll just have to wait and see!
Grant
Darwin NT
ID: 1913914 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 6157
Credit: 440,531,011
RAC: 1,006,910
United States
Message 1913916 - Posted: 19 Jan 2018, 4:11:09 UTC - in response to Message 1913914.  

In progress back up around 5 million, Received-last-hour back over 120k.
WU-awaiting-deletion climbing, splitter output dropped down to lower level again. Will clearing awaiting-deletion fire up the splitters again?
We'll just have to wait and see!

NEWS at 10!
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1913916 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10530
Credit: 143,212,704
RAC: 79,012
Australia
Message 1913943 - Posted: 19 Jan 2018, 6:17:52 UTC

Around 4min 55sec on my GTX 1070s for the current BLC_02s, now we have some BLC_02s that aren't VLARs. About 50sec quicker to crunch, but they do cause some noticeable system/display lag.
Grant
Darwin NT
ID: 1913943 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10530
Credit: 143,212,704
RAC: 79,012
Australia
Message 1913965 - Posted: 19 Jan 2018, 9:07:18 UTC

And there we go.
Awaiting-deletion backlog cleared, splitters crank out the WUs.
Grant
Darwin NT
ID: 1913965 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 6157
Credit: 440,531,011
RAC: 1,006,910
United States
Message 1913967 - Posted: 19 Jan 2018, 9:12:11 UTC - in response to Message 1913965.  
Last modified: 19 Jan 2018, 9:12:26 UTC

I'd say that is pretty convincing evidence that the two are directly linked. If you overlay the graphs, they are coincident.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1913967 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10530
Credit: 143,212,704
RAC: 79,012
Australia
Message 1913974 - Posted: 19 Jan 2018, 9:46:21 UTC - in response to Message 1913967.  
Last modified: 19 Jan 2018, 9:46:49 UTC

I'd say that is pretty convincing evidence that the two are directly linked. If you overlay the graphs, they are coincident.

Correlation isn't causation, but yeah, when returned-per-hour hits it's present highs, and work-in-progress gets right up there, the deleters & splitters certainly aren't able to both run at 100% at the same time; the splitters crank out the work, the deleter backlog grows. It gets to a certain point & the splitters slow down and stay there till the delete backlog clears. And it's continued to occur after the weekly outage.
It's choking on it's own I/O.
Grant
Darwin NT
ID: 1913974 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 6157
Credit: 440,531,011
RAC: 1,006,910
United States
Message 1914035 - Posted: 19 Jan 2018, 17:50:14 UTC - in response to Message 1913974.  

I'd say the administrators need to shorten up the cron job interval on the deleters purge task so that we could maintain a higher average RTS buffer quantity. Or if the purge is threshold based, to lower it.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1914035 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10530
Credit: 143,212,704
RAC: 79,012
Australia
Message 1914094 - Posted: 19 Jan 2018, 21:37:29 UTC - in response to Message 1914035.  

I'd say the administrators need to shorten up the cron job interval on the deleters purge task so that we could maintain a higher average RTS buffer quantity. Or if the purge is threshold based, to lower it.

I think it's just a question of I/O congestion.
The deleters run all the time, however with the current rate of work return and the current rate of WU splitting required to keep that rate of return going, there's so much I/O contention that the deleters can't keep up. Eventually the I/O contention gets to such a point that the output of the splitters falls away, but the deleters still can't keep up with the load, so the backlog continues to grow. Eventually it gets to the point where the deleters are able to catch up & clear the backlog, then their reduced level of I/O allows the splitters to crank back up again; till the delete backlog & load reaches that trigger point & the splitter slow down again.
Rinse and repeat.
The combination of returned per hour, in progress, awaiting deletion & required splitter output is resulting in a huge amount of I/O, which is more than the servers can actually meet. So you end up with these moving trigger points where one function slows down and the other speeds up, then it slows down & the first one speeds up again. And back & forth they go.
That's my speculation based on minimal facts.
Grant
Darwin NT
ID: 1914094 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 6157
Credit: 440,531,011
RAC: 1,006,910
United States
Message 1914100 - Posted: 19 Jan 2018, 22:12:50 UTC - in response to Message 1914094.  
Last modified: 19 Jan 2018, 22:47:33 UTC


I think it's just a question of I/O congestion.
The deleters run all the time

OK, I'm sure I read in some other post in recent days that the deleters and purgers don't run continuously. Now I have to find that post.

[Edit] Found it. By Rob Smith Message 1913582
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1914100 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12347
Credit: 127,038,533
RAC: 35,822
United Kingdom
Message 1914103 - Posted: 19 Jan 2018, 22:20:52 UTC - in response to Message 1914100.  

I think it's just a question of I/O congestion.
The deleters run all the time
OK, I'm sure I read in some other post in recent days that the deleters and purgers don't run continuously. Now I have to find that post.
That's probably a difference between Main and Beta. Beta certainly doesn't purge the database continuously - Eric likes to keep older tasks visible for comparison and retrospective bug-hunting. Main, on the other hand, needs to clear the decks within 24 hours or we're swamped.
ID: 1914103 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10530
Credit: 143,212,704
RAC: 79,012
Australia
Message 1914105 - Posted: 19 Jan 2018, 22:30:37 UTC - in response to Message 1914100.  

OK, I'm sure I read in some other post in recent days that the deleters and purgers don't run continuously. Now I have to find that post.

Got me curious too.
Generally (when things have been working well), the number of WU awaiting Validation, Assimilation and Deletion is generally around 0, occasionally 1-3 (emphasis on when everything is working OK). So even if they don't run all the time, they run when there is something to do. Which is pretty much all the time (especially with 145k results being returned per hour).

Looking at AP, where the return rate is less than 1 per minute at the moment, the WUs awaiting Validation, Assimilation & Deletion are around 1, with periods of 0 & a few periods of 2 or 3. It could be they run all the time, and those values of 1-3 are at the time the data is read, before the WU is processed. Or it could be as you say- they don't run all the time, only when there is work to be done.
Either way, it means the MB WU Validator/Deleter/Assimilators are running (effectively) all the time with 40/s there to be processed, as the values there were usually around (or very close to) 0.
Grant
Darwin NT
ID: 1914105 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9506
Credit: 59,528,127
RAC: 11,426
United Kingdom
Message 1914179 - Posted: 20 Jan 2018, 7:17:21 UTC

Panic Mode On (110) Server Problems? Now open for business
ID: 1914179 · Report as offensive
Previous · 1 . . . 35 · 36 · 37 · 38

Message boards : Number crunching : Panic Mode On (109) Server Problems?


 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.