Panic Mode On (110) Server Problems?

Message boards : Number crunching : Panic Mode On (110) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 31 · 32 · 33 · 34 · 35 · 36 · 37 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1925273 - Posted: 18 Mar 2018, 23:49:48 UTC - in response to Message 1925250.  

If your CPU caches are full before your GPU caches, you're probably asking for too much work.


. . Hi Richard,

. . I can't say that I follow that line of thought ...

Stephen

? ?
ID: 1925273 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1925283 - Posted: 19 Mar 2018, 0:34:57 UTC - in response to Message 1925247.  

No this is the old scheduler issue from a couple of Xmas' past that still affects some Hosts. Most definitely mine. I am very familiar with the symptoms and the fixes necessary. It is a separate issue from the schedulers only issuing Arecibo non-VLAR work from the RTS buffer to Nvidia hosts.

The scheduler is just hung up at the moment and we will have to wait for staff to fix things in the morning. They do but never post what their fix is and which I am most interested in.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1925283 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1925306 - Posted: 19 Mar 2018, 8:02:55 UTC - in response to Message 1925283.  

Well, I just opened up my three micro-managed crunchers for their breakfast, and all three got exactly what they requested at the first attempt. Up to their 'max tasks in progress', and I don't imagine the staff intervened at all on a Sunday evening.
ID: 1925306 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1925309 - Posted: 19 Mar 2018, 8:56:33 UTC

It's the "your worrying too much" flag being set to give you something to worry about.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1925309 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1925311 - Posted: 19 Mar 2018, 9:48:43 UTC

It's odd, but I don't think I have ever seen the problem being discussed.

Even when I had a couple of problem crunchers and was checking several times a day.

Now I have "retired" all the old machines, I probably check a couple of times a day. I always tend to check before I go to bed and first thing in the morning and I always see full caches, and currently as one machine is new and the older is now a full time cruncher, I see rising RAC as well!!

Both machines have reasonable GPU's and the new one has a better CPU as well.
ID: 1925311 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1925345 - Posted: 19 Mar 2018, 16:13:55 UTC
Last modified: 19 Mar 2018, 17:07:40 UTC

The problem has been discussed ad nauseum here in Number Crunching. I don't know how you could say you had never seen it discussed. The problem goes back to Xmas 2016 when the staff attempted to fix the schedulers for an issue with ATI cards not getting work. It didn't work and they ended up releasing a different application.

But the result of their actions was to cause some Hosts with Nvidia cards to not get work and just get no work is available messages along with no AP tasks are available when there is plenty of available work in the RTS buffer. The cause is the setting in the project preferences for accept additional work for other applications if no work is available for selected applications. If you have that set along with AP work, you get the no work is available messages at each work request. Until you toggle off that setting and/or the AP setting and do a project Update to change the preferences you get squat and your caches plummet to zero.

You might have to do the Triple Update trick to wake up the schedulers too. Lots of others are affected and have commented on the issue many times, Grant being one of them.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1925345 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1925348 - Posted: 19 Mar 2018, 16:28:54 UTC

Here you go. I just set this Host for AP: yes, SETI@home v8: no & If no work for selected applications is available, accept work from other applications? = yes
Up until now it's been receiving work, let's see how it works now, https://setiathome.berkeley.edu/results.php?hostid=7769537
It's been a while since I tried it...so, anything can happen.
ID: 1925348 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1925349 - Posted: 19 Mar 2018, 16:38:19 UTC - in response to Message 1925348.  
Last modified: 19 Mar 2018, 16:39:45 UTC

Task		Work unit	Sent
6496227687	2905174336	19 Mar 2018, 16:31:43 UTC	
6496227760	2905174285	19 Mar 2018, 16:31:43 UTC
Both MB, NVidia GPU
ID: 1925349 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1925351 - Posted: 19 Mar 2018, 16:39:17 UTC

I don't know how you could say you had never seen it discussed


Ah I think a little misunderstanding here, let me rephrase.

I have never experienced the problem currently being discussed.

Is that clearer.
ID: 1925351 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1925352 - Posted: 19 Mar 2018, 16:41:56 UTC - in response to Message 1925345.  

I actually have seen very little problems from the ATI fallout lately. I have not changed my Yes, Yes, No other work settings in ages.
The extent of my kicks are to simply suspend 1 task, wait for the 5 minute timer to expire, enable task, do a manual update.

But that doesn't help when the problem yesterday was from the server cache being stuffed full of Arecibo VLARS. I didn't run out of tasks, but came very close to it. It's just luck of the draw to get the none VLARS when they become available.
ID: 1925352 · Report as offensive
Iona
Avatar

Send message
Joined: 12 Jul 07
Posts: 790
Credit: 22,438,118
RAC: 0
United Kingdom
Message 1925354 - Posted: 19 Mar 2018, 16:46:30 UTC - in response to Message 1925345.  
Last modified: 19 Mar 2018, 16:47:57 UTC

I don't think that is what Bernie meant or tried to say, Keith. What I am pretty sure he is saying is, he has not had (or, seen) the problem being discussed. A slightly different wording perhaps, but the same meaning. Lucky Bernie!

Edit. No way this took several minutes to write, either!
Don't take life too seriously, as you'll never come out of it alive!
ID: 1925354 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1925356 - Posted: 19 Mar 2018, 17:05:38 UTC - in response to Message 1925354.  

Sorry about the misunderstanding. I am sensitive to the problem because I am greatly affected. Not everyone with Nvidia is affected. Just some. We have discussed at length and compared settings and hardware to try and understand why some hosts are affected and others not. No insight has been discovered yet. As I stated in my message, I am very familiar with the symptoms and have a bag of tricks to get the schedulers to acknowledge a task deficit and refill the caches. The one that Brent mentioned one of them and being a variation on the 'ghost task recovery" protocol.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1925356 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1925357 - Posted: 19 Mar 2018, 17:14:42 UTC - in response to Message 1925352.  

I actually have seen very little problems from the ATI fallout lately. I have not changed my Yes, Yes, No other work settings in ages.
The extent of my kicks are to simply suspend 1 task, wait for the 5 minute timer to expire, enable task, do a manual update.

But that doesn't help when the problem yesterday was from the server cache being stuffed full of Arecibo VLARS. I didn't run out of tasks, but came very close to it. It's just luck of the draw to get the none VLARS when they become available.

Yes, the luck of the draw when the RTS buffer has nothing but Arecibo VLAR's to send. My luck is crap as usual. I came very close to running out multiple times with only a trickle of 1 task delivered for every 15 returned. Eventually the mass of Arecibo VLAR's cleared out and the usual mix of Arecibo non-VLAR's and BLC tasks started to return.

Would be nice if they removed the old no Arecibo VLAR's sent to Nvidia cards restriction. Or give a new setting in Preferences to allow that to occur.

So there must have been some sort of event in the splitters where nothing was coming out of the BLC tapes and only Arecibo was filling the RTS buffers to have created the pocket of Arecibo VLAR's in the first place.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1925357 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1925364 - Posted: 19 Mar 2018, 17:38:32 UTC - in response to Message 1925357.  

So there must have been some sort of event in the splitters where nothing was coming out of the BLC tapes and only Arecibo was filling the RTS buffers to have created the pocket of Arecibo VLAR's in the first place.
I think that we were running pretty close to high-water mark for RTS yesterday (*). I think the rule for high-water mark can be summed up as something like "finish whatever you're doing, then go take a break."

So the next question is, when they need to come back onstream, who gets first dibs? Consciously or unconsciously, somebody comes first. It might be the index on a database table, in which case Arecibo was here first and is likely to have the low numbers. Or maybe A(recibo) comes before B(reakthrough) in the alphabet. I think any 'event' affecting the breakthrough splitters is likely to be nothing more suspicious than they weren't needed, and hence didn't get the 'back to work, lads' message until the last Arecibo tape was finished. Remember Eric made extra Arecibo splitters available when he couldn't get data from Green Bank quickly enough, so they could probably keep up with demand with no RTS shortfall to fill on top.

* Confirmed by Haveland.
ID: 1925364 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1925371 - Posted: 19 Mar 2018, 18:12:19 UTC - in response to Message 1925364.  
Last modified: 19 Mar 2018, 18:12:54 UTC

I have seen quite a few times where the BLC splitters won't start back up when the RTS is full, UNTIL the script runs for timed out tasks, then they fire back up again.

Yesterday at the end of the vlar floor, I got lucky and received ~150 resends right before the BLC splitters came back to life. I have seen this often that one of my computers will get resends right before BLCs restart.
ID: 1925371 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1925413 - Posted: 19 Mar 2018, 20:31:48 UTC - in response to Message 1925351.  

I don't know how you could say you had never seen it discussed


Ah I think a little misunderstanding here, let me rephrase.

I have never experienced the problem currently being discussed.

Is that clearer.


. . Yep, I think it is, I was confused before as well :)

Stephen

:)
ID: 1925413 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1925418 - Posted: 19 Mar 2018, 20:38:24 UTC - in response to Message 1925357.  
Last modified: 19 Mar 2018, 20:44:39 UTC

I actually have seen very little problems from the ATI fallout lately. I have not changed my Yes, Yes, No other work settings in ages.
The extent of my kicks are to simply suspend 1 task, wait for the 5 minute timer to expire, enable task, do a manual update.

But that doesn't help when the problem yesterday was from the server cache being stuffed full of Arecibo VLARS. I didn't run out of tasks, but came very close to it. It's just luck of the draw to get the none VLARS when they become available.


Would be nice if they removed the old no Arecibo VLAR's sent to Nvidia cards restriction. Or give a new setting in Preferences to allow that to occur.

So there must have been some sort of event in the splitters where nothing was coming out of the BLC tapes and only Arecibo was filling the RTS buffers to have created the pocket of Arecibo VLAR's in the first place.


. . It seems there is a block of Blc01 tapes that are somehow stalled in the splitters, but the more recently mounted Bloc02 tapes are splitting nicely and I am currently seeing only Blc02 WUs. Maybe someone needs to give those splitters working on the Blc01 tapes a bit of a kick too :)

. . As to removing the bar on sending Arecibo VLARs to Nvidia cards, that could be a bit of a disaster, there are still a lot of older cards out there on rigs running the older CUDA apps. An option to accept them on a rig with Nvidia cards would be useful for people running more modern hardware and apps like SoG and special sauce variants. These can cope quite well with those VLAR tasks even if they are a little slow.

Stephen

. .


Stephen

? ?
ID: 1925418 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1925436 - Posted: 19 Mar 2018, 22:56:59 UTC - in response to Message 1925418.  

If there's enough 'ready to send' (which there is), it really doesn't matter which tape is currently active. Ever since Breakthrough Listen came online, it's seemed that the newest tapes are active first. Maybe that's an active decision to do the new stuff quickly, maybe it's a side effect of some other selection process. No matter. They all get done in the end.
ID: 1925436 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1925437 - Posted: 19 Mar 2018, 23:56:55 UTC - in response to Message 1925436.  

If there's enough 'ready to send' (which there is), it really doesn't matter which tape is currently active. Ever since Breakthrough Listen came online, it's seemed that the newest tapes are active first. Maybe that's an active decision to do the new stuff quickly, maybe it's a side effect of some other selection process. No matter. They all get done in the end.


. . It'll be right on the night you reckon?

Stephen

:)
ID: 1925437 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1639
Credit: 12,921,799
RAC: 89
New Zealand
Message 1925469 - Posted: 20 Mar 2018, 7:50:44 UTC - in response to Message 1923837.  

See that six new 100GB tapes were loaded this morning. I wonder if these are the new "standard" size? Can anyone remember if the number of channels per tape is still the same?

Yes they are the same the 52 GB tapes were also 128 channels the same with the 104.8/104.83 GB tapes

Sizing of the tapes have doubled but the number of tasks is still the same I think because there is still only 128 channels
ID: 1925469 · Report as offensive
Previous · 1 . . . 31 · 32 · 33 · 34 · 35 · 36 · 37 · Next

Message boards : Number crunching : Panic Mode On (110) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.