Panic Mode On (84) Server Problems?

Message boards : Number crunching : Panic Mode On (84) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 21 · Next

AuthorMessage
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1384561 - Posted: 25 Jun 2013, 14:59:01 UTC - in response to Message 1384560.  

I think its more a feeder problem since still over 300.000 V7 are ready to send.
I might be wrong on that.


The question then becomes: do we split a feeder or feed a splitter?
ID: 1384561 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30707
Credit: 53,134,872
RAC: 32
United States
Message 1384566 - Posted: 25 Jun 2013, 15:15:39 UTC - in response to Message 1384536.  

Right now, neither is getting work except for a (very) few APs overnight. I7-3820 is about to run out of work; Fermibox2 still has a bunch to do.

About to run out of work is normal with the BOINC 7.X. Unless you change the setting for connect every to a long value, it drains the queue until it is dry before it asks for more work.

e.g. set to default 10 times a day, queue is drained to 1/10 day's work. Set to every 2 days, queue is drained to 2 days work. Yeah, kind of works backwards.

I don't suggest a too big value for the extra work fill either. We haven't had a long outage since they have moved to the co-lo and too big a number here simply runs you into the max tasks limit and you don't get it anyway.

ID: 1384566 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51469
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1384577 - Posted: 25 Jun 2013, 15:43:48 UTC

And I see the 'mystery bandwidth' has continued.
The kitties have been losing cache all night. Not a usual thingy since the move to the colo.
Hopefully they'll sort it all during today's outage.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1384577 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1384621 - Posted: 25 Jun 2013, 20:03:33 UTC

Guess something got fixed. My first request after maintenance brought me back to the limits. Hope it holds up.

And I see the 'mystery bandwidth' has continued.

Don't know what to think of that. Maybe they thorottled back to Lab-to-CoLo transfers?
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1384621 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22241
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1384632 - Posted: 25 Jun 2013, 21:04:22 UTC

Oh dear, I see 20jn12ac is still hanging around, stuck with loads of errors :-(
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1384632 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1384650 - Posted: 25 Jun 2013, 22:05:08 UTC - in response to Message 1384566.  

@Gary - THANKS! That explains why I7-3820 (my new cruncher) wasn't even asking for WUs except when nearly empty. I changed the settings per your suggestion, and the machine promptly loaded up to the limit.
ID: 1384650 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 35012
Credit: 261,360,520
RAC: 489
Australia
Message 1384692 - Posted: 26 Jun 2013, 2:17:11 UTC - in response to Message 1384632.  

Oh dear, I see 20jn12ac is still hanging around, stuck with loads of errors :-(

It certainly isn't giving in, is it?

Cheers.
ID: 1384692 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 1384729 - Posted: 26 Jun 2013, 5:53:04 UTC - in response to Message 1384692.  
Last modified: 26 Jun 2013, 5:53:24 UTC

Splitters still having issues. Ready-to-send buffer continues to shrink with the splitters unable to produce enough work. Could run out of work yet again all within 24hors of last running out.
Grant
Darwin NT
ID: 1384729 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51469
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1384740 - Posted: 26 Jun 2013, 7:00:33 UTC - in response to Message 1384729.  

Splitters still having issues. Ready-to-send buffer continues to shrink with the splitters unable to produce enough work. Could run out of work yet again all within 24hors of last running out.

Well, that stuck dataset 20jn12ac is not helping anything. Tying up one MB splitter that is not producing WUs.

Eric said that Matt had restarted it. It apparently is still in terminal fail mode. I send Eric another message, and we'll have to wait until tomorrow to see if they kick it again or just boot the sucker.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1384740 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1384744 - Posted: 26 Jun 2013, 7:15:21 UTC
Last modified: 26 Jun 2013, 7:22:46 UTC

Was just looking through my single-core machine's recently-reported v7 WUs and saw a cool one. 1265536732

The beginnings of an inconclusive train. I guess build 1846 and stock CPU app didn't quite agree with each other..? Its obvious cuda32 just completely got it wrong. It'll be interesting to see what happens when cuda50 gives it a whirl.

edit: and then I looked into it more.. the cuda32 attempt is a runaway machine. hostid 6721035. PMed to inform them and pointed to NC to ask for solutions for fixing it.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1384744 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1384759 - Posted: 26 Jun 2013, 8:34:53 UTC - in response to Message 1384744.  

Was just looking through my single-core machine's recently-reported v7 WUs and saw a cool one. 1265536732

The beginnings of an inconclusive train. I guess build 1846 and stock CPU app didn't quite agree with each other..? Its obvious cuda32 just completely got it wrong. It'll be interesting to see what happens when cuda50 gives it a whirl.

edit: and then I looked into it more.. the cuda32 attempt is a runaway machine. hostid 6721035. PMed to inform them and pointed to NC to ask for solutions for fixing it.

Wish we'd get behind why some hosts don't print stderr... bloody annoying that.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1384759 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 1384771 - Posted: 26 Jun 2013, 8:58:17 UTC - in response to Message 1384759.  


I only just noticed it, but it looks like we hit another record after todays outage.
691Mb/s download traffic.
Grant
Darwin NT
ID: 1384771 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1384902 - Posted: 26 Jun 2013, 19:03:26 UTC - in response to Message 1384744.  

Was just looking through my single-core machine's recently-reported v7 WUs and saw a cool one. 1265536732

The beginnings of an inconclusive train. I guess build 1846 and stock CPU app didn't quite agree with each other..? Its obvious cuda32 just completely got it wrong. It'll be interesting to see what happens when cuda50 gives it a whirl.

edit: and then I looked into it more.. the cuda32 attempt is a runaway machine. hostid 6721035. PMed to inform them and pointed to NC to ask for solutions for fixing it.

And the cuda50 task came back and agreed with both CPU apps.

Looks like the runaway host was doing the thing the server is set to do.. send a few (thousand?) tasks to each type of app and see which works best. 32 obviously didn't work, 42 isn't working either, and 50 isn't showing any promise, either. Bad drivers, or a bad card is what I'm thinking. Or it could just be some other environment misconfiguration. Hard telling.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1384902 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30707
Credit: 53,134,872
RAC: 32
United States
Message 1384954 - Posted: 26 Jun 2013, 22:33:58 UTC - in response to Message 1384650.  

@Gary - THANKS! That explains why I7-3820 (my new cruncher) wasn't even asking for WUs except when nearly empty. I changed the settings per your suggestion, and the machine promptly loaded up to the limit.

Welcome.

I think a lot of people are getting caught by this and don't realize it.

ID: 1384954 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 1385024 - Posted: 27 Jun 2013, 5:02:52 UTC - in response to Message 1384954.  


Network traffic graph is interesting at the moment- it's got to be the flatest (while the system is running) that i can recall.
Grant
Darwin NT
ID: 1385024 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51469
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1385130 - Posted: 27 Jun 2013, 16:09:52 UTC - in response to Message 1385129.  

Isn't it time now to finally kick that 20jn12ac file? It's been sitting there, stuck for over a week now, holding up one splitter from doing useful splitting.

I poked Eric about it again....
Dunno if he's in the lab right now or not.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1385130 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13755
Credit: 208,696,464
RAC: 304
Australia
Message 1385308 - Posted: 28 Jun 2013, 6:19:19 UTC - in response to Message 1385024.  

Network traffic graph is interesting at the moment- it's got to be the flatest (while the system is running) that i can recall.

At least untill the huge spike went off the 24hr graph. Still, while up & down there aren't any huge spikes or dips in it.
Grant
Darwin NT
ID: 1385308 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22241
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1385708 - Posted: 29 Jun 2013, 6:38:49 UTC

And while they are loading some new tapes perhaps they can do something about poor old 20jn12ac which has been stuck at this:
20jn12ac 50.20 GB (13) (done)

For days.

Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1385708 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1385869 - Posted: 29 Jun 2013, 19:50:09 UTC - in response to Message 1385708.  

And while they are loading some new tapes perhaps they can do something about poor old 20jn12ac which has been stuck at this:
20jn12ac 50.20 GB (13) (done)

For days.

They have done something about it. There has not been a splitter running on that file for several days. Still listed on the page, but not being processed.
Donald
Infernal Optimist / Submariner, retired
ID: 1385869 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1385929 - Posted: 29 Jun 2013, 23:12:49 UTC

Good job, anonymous wingmate. I appreciate you aborting your AP on the CPU and making too many errors to validate. workunit 1266348384
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1385929 · Report as offensive
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 21 · Next

Message boards : Number crunching : Panic Mode On (84) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.