Panic Mode On (109) Server Problems?

Message boards : Number crunching : Panic Mode On (109) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 32 · 33 · 34 · 35

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913852 - Posted: 19 Jan 2018, 0:40:27 UTC - in response to Message 1913850.  
Last modified: 19 Jan 2018, 0:46:25 UTC

Well, I found it necessary to make Post 58550. Read through to Post 58572, and note his titles.

Well, even Project Scientists and Developers are human. Witness our own Eric K. and the recent spate of typo errors. I do remember they (MW) having issues initially with the n-body mt application but they sorted it out evidently and I didn't follow any of the threads since as I stated I don't do MW cpu work. I haven't seen many posts about n-body issues other than host configuration questions.

And the mt documentation must be mostly stable and understood by now as the mt cpu app deployed at GPUGrid just this month by a student had a relatively easy startup. It was nice to find it obeyed the app_config core usage setting so it didn't hog all cores on my Ryzen 1800X. I am using 4 cores to process the cpu tasks leaving the other cores for Seti cpu tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913852 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1913859 - Posted: 19 Jan 2018, 1:09:44 UTC

MW is fine, but IIRC to do something productive there you need to have an AMD GPU, NV stuff has troubles to work with DP used by MW.

Does that changes?

GPUGrid with it's long time to crunch WU is not really a project to use as a backup. IMHO
ID: 1913859 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1913874 - Posted: 19 Jan 2018, 1:39:21 UTC - in response to Message 1913844.  

MW is the most set and forget project I have run. I never have to micromanage it at all.
I only keep a backup available on one of my crunch-only machines, just to make sure it maintains a little heat in the bedroom on chilly nights when SaH runs out of work. My first choice is Asteroids, but they're often out of work, too, so I added MilkyWay as a backup to the backup. The last time it ran on Windows was about 3 years ago. It worked fine. But that machine is now Linux, and when MilkyWay kicked in one night a couple months ago, it turned out to be a colossal waste of time. I don't remember how many tasks it ran, but when I checked the results the next day, I found that all but one of them had been marked Invalid. I think they all ran to completion without throwing any errors, but it was all just wasted electricity (except for the little bit of extra heat). I never did try to figure out what might have happened, just turned off MilkyWay and added Einstein for the next time that both SaH and Asteroids ran out.
ID: 1913874 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1913877 - Posted: 19 Jan 2018, 1:45:56 UTC - in response to Message 1913859.  
Last modified: 19 Jan 2018, 1:52:26 UTC

MW is fine, but IIRC to do something productive there you need to have an AMD GPU, NV stuff has troubles to work with DP used by MW.

Does that changes?

GPUGrid with it's long time to crunch WU is not really a project to use as a backup. IMHO

Milkyway used Double Precision calculations.
Most GeForce GPUs are allowed DP performance of 1/32 SP
Radeon GPUs are allowed DP performance of 1/16 SP

So if both GPUs were 6000 GFLOPS in Single Precision the GeForce would be 188 GFLOPS DP and the Radeon 375 GFLOPS

If you move to the workstation GPUs they will have DP performance of up to 1/2 SP. Which is likely why they have 4 digit price tags.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1913877 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913883 - Posted: 19 Jan 2018, 1:59:59 UTC - in response to Message 1913859.  

MW is fine, but IIRC to do something productive there you need to have an AMD GPU, NV stuff has troubles to work with DP used by MW.

Does that changes?

GPUGrid with it's long time to crunch WU is not really a project to use as a backup. IMHO

No, MW has no issue with Nvidia as long as the card can do double-precision. Probably any card greater than Kepler or avoiding the lowest denomination of any family. Nvidia doesn't have the same degree of performance in double-precision as ATI/AMD but they still work fine. I do a Gamma Ray Binary Pulsar task in 190 seconds and get awarded 227 credits for it. The credit is static for all task types.

The longest running GPUGrid gpu task so far I've run was 8 hours and was awarded 387,150 credits. The shortest task was 3 hours. The longest cpu task was 1 hour and the shortest 20 minutes.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913883 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1913914 - Posted: 19 Jan 2018, 4:03:02 UTC

In progress back up around 5 million, Received-last-hour back over 120k.
WU-awaiting-deletion climbing, splitter output dropped down to lower level again. Will clearing awaiting-deletion fire up the splitters again?
We'll just have to wait and see!
Grant
Darwin NT
ID: 1913914 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913916 - Posted: 19 Jan 2018, 4:11:09 UTC - in response to Message 1913914.  

In progress back up around 5 million, Received-last-hour back over 120k.
WU-awaiting-deletion climbing, splitter output dropped down to lower level again. Will clearing awaiting-deletion fire up the splitters again?
We'll just have to wait and see!

NEWS at 10!
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913916 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1913943 - Posted: 19 Jan 2018, 6:17:52 UTC

Around 4min 55sec on my GTX 1070s for the current BLC_02s, now we have some BLC_02s that aren't VLARs. About 50sec quicker to crunch, but they do cause some noticeable system/display lag.
Grant
Darwin NT
ID: 1913943 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1913965 - Posted: 19 Jan 2018, 9:07:18 UTC

And there we go.
Awaiting-deletion backlog cleared, splitters crank out the WUs.
Grant
Darwin NT
ID: 1913965 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913967 - Posted: 19 Jan 2018, 9:12:11 UTC - in response to Message 1913965.  
Last modified: 19 Jan 2018, 9:12:26 UTC

I'd say that is pretty convincing evidence that the two are directly linked. If you overlay the graphs, they are coincident.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913967 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1913974 - Posted: 19 Jan 2018, 9:46:21 UTC - in response to Message 1913967.  
Last modified: 19 Jan 2018, 9:46:49 UTC

I'd say that is pretty convincing evidence that the two are directly linked. If you overlay the graphs, they are coincident.

Correlation isn't causation, but yeah, when returned-per-hour hits it's present highs, and work-in-progress gets right up there, the deleters & splitters certainly aren't able to both run at 100% at the same time; the splitters crank out the work, the deleter backlog grows. It gets to a certain point & the splitters slow down and stay there till the delete backlog clears. And it's continued to occur after the weekly outage.
It's choking on it's own I/O.
Grant
Darwin NT
ID: 1913974 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1914035 - Posted: 19 Jan 2018, 17:50:14 UTC - in response to Message 1913974.  

I'd say the administrators need to shorten up the cron job interval on the deleters purge task so that we could maintain a higher average RTS buffer quantity. Or if the purge is threshold based, to lower it.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1914035 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1914094 - Posted: 19 Jan 2018, 21:37:29 UTC - in response to Message 1914035.  

I'd say the administrators need to shorten up the cron job interval on the deleters purge task so that we could maintain a higher average RTS buffer quantity. Or if the purge is threshold based, to lower it.

I think it's just a question of I/O congestion.
The deleters run all the time, however with the current rate of work return and the current rate of WU splitting required to keep that rate of return going, there's so much I/O contention that the deleters can't keep up. Eventually the I/O contention gets to such a point that the output of the splitters falls away, but the deleters still can't keep up with the load, so the backlog continues to grow. Eventually it gets to the point where the deleters are able to catch up & clear the backlog, then their reduced level of I/O allows the splitters to crank back up again; till the delete backlog & load reaches that trigger point & the splitter slow down again.
Rinse and repeat.
The combination of returned per hour, in progress, awaiting deletion & required splitter output is resulting in a huge amount of I/O, which is more than the servers can actually meet. So you end up with these moving trigger points where one function slows down and the other speeds up, then it slows down & the first one speeds up again. And back & forth they go.
That's my speculation based on minimal facts.
Grant
Darwin NT
ID: 1914094 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1914100 - Posted: 19 Jan 2018, 22:12:50 UTC - in response to Message 1914094.  
Last modified: 19 Jan 2018, 22:47:33 UTC


I think it's just a question of I/O congestion.
The deleters run all the time

OK, I'm sure I read in some other post in recent days that the deleters and purgers don't run continuously. Now I have to find that post.

[Edit] Found it. By Rob Smith Message 1913582
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1914100 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1914103 - Posted: 19 Jan 2018, 22:20:52 UTC - in response to Message 1914100.  

I think it's just a question of I/O congestion.
The deleters run all the time
OK, I'm sure I read in some other post in recent days that the deleters and purgers don't run continuously. Now I have to find that post.
That's probably a difference between Main and Beta. Beta certainly doesn't purge the database continuously - Eric likes to keep older tasks visible for comparison and retrospective bug-hunting. Main, on the other hand, needs to clear the decks within 24 hours or we're swamped.
ID: 1914103 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 1914105 - Posted: 19 Jan 2018, 22:30:37 UTC - in response to Message 1914100.  

OK, I'm sure I read in some other post in recent days that the deleters and purgers don't run continuously. Now I have to find that post.

Got me curious too.
Generally (when things have been working well), the number of WU awaiting Validation, Assimilation and Deletion is generally around 0, occasionally 1-3 (emphasis on when everything is working OK). So even if they don't run all the time, they run when there is something to do. Which is pretty much all the time (especially with 145k results being returned per hour).

Looking at AP, where the return rate is less than 1 per minute at the moment, the WUs awaiting Validation, Assimilation & Deletion are around 1, with periods of 0 & a few periods of 2 or 3. It could be they run all the time, and those values of 1-3 are at the time the data is read, before the WU is processed. Or it could be as you say- they don't run all the time, only when there is work to be done.
Either way, it means the MB WU Validator/Deleter/Assimilators are running (effectively) all the time with 40/s there to be processed, as the values there were usually around (or very close to) 0.
Grant
Darwin NT
ID: 1914105 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1914179 - Posted: 20 Jan 2018, 7:17:21 UTC

Panic Mode On (110) Server Problems? Now open for business
ID: 1914179 · Report as offensive
Previous · 1 . . . 32 · 33 · 34 · 35

Message boards : Number crunching : Panic Mode On (109) Server Problems?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.