Panic Mode On (114) Server Problems?

Message boards : Number crunching : Panic Mode On (114) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 45 · Next

AuthorMessage
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11416
Credit: 29,581,041
RAC: 66
United States
Message 1969424 - Posted: 9 Dec 2018, 1:47:02 UTC - in response to Message 1969415.  

Stephen you should remember that Sten is from Sweden and could very well be suffering from lutefisk poisoning. That has been know to cause all sorts of irrational behavior.
Ten Stinkiest Foods In the World
Iru. ...
More from Delish: The Weirdest Restaurants Around the World.
Doenjang. ...
Lutefisk. ...
Stinky Tofu (chòu dòufu) ...
Vieux Boulogne. ...
Surströmming. ...
Durian. This southeast Asian fruit, often used in smoothies or as the stuffing in sweet buns, is revered by some for its ripe, nutty, pungent flavor.

ID: 1969424 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1969432 - Posted: 9 Dec 2018, 2:34:20 UTC

I'm seeing some good signs... assimilation and validation are happening. It also looks like some splitting is happening, but this is a big hole to dig out of.
ID: 1969432 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1969433 - Posted: 9 Dec 2018, 2:35:06 UTC

Well, the splitters are running again, and at a good pace- we'll see how long they last this time.
Grant
Darwin NT
ID: 1969433 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1969434 - Posted: 9 Dec 2018, 2:35:41 UTC - in response to Message 1969424.  

Stephen you should remember that Sten is from Sweden and could very well be suffering from lutefisk poisoning. That has been know to cause all sorts of irrational behavior.

He likes to go fishing- chuck out a lure & see what bites.
Grant
Darwin NT
ID: 1969434 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1969435 - Posted: 9 Dec 2018, 2:36:40 UTC - in response to Message 1969432.  

I'm seeing some good signs... assimilation and validation are happening. It also looks like some splitting is happening, but this is a big hole to dig out of.

+1 Crossing my fingers. If the servers stay functional it is going to be a very bumpy recovery.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1969435 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1969437 - Posted: 9 Dec 2018, 3:02:57 UTC

Well, my system that ran out of work is still out of work.
The one that still had some work has been able to pick up 155 WUs since the project came back to life.
Grant
Darwin NT
ID: 1969437 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1969440 - Posted: 9 Dec 2018, 3:58:37 UTC

Full power to the kibble core.
Onward, ye splitters, onward.

Meeeeeeeeeeow!!
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1969440 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1969442 - Posted: 9 Dec 2018, 4:17:30 UTC - in response to Message 1969437.  

The new TR seems to have picked up about 150 tasks or so. Nothing else on any of the other hosts.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1969442 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1969444 - Posted: 9 Dec 2018, 4:27:18 UTC - in response to Message 1969437.  
Last modified: 9 Dec 2018, 4:31:07 UTC

Well, my system that ran out of work is still out of work.
The one that still had some work has been able to pick up 155 WUs since the project came back to life.

The system that has work, continues to pick up work every few requests.
The one without any work, still not a thing.


With any luck it looks like they've sorted out whatever was wrong since the last outage- The Splitters are splitting at a sustained good rate, work is going out, and the Validators are clearing their backlog and the Deleters are cleaning up after them, and the Replica is catching up at a much faster rate than it had been.
*fingers crossed*
Grant
Darwin NT
ID: 1969444 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1969446 - Posted: 9 Dec 2018, 4:43:43 UTC
Last modified: 9 Dec 2018, 4:50:03 UTC

Finally picked up some work on the empty system. Hopefully it'll be enough to keep more work coming.



Edit- looks like it might be the case for most users that systems with work are getting more, before those with none get any.
In progress has increased by over 400k, however Received-last-hour has barely moved from it's initial updated value when the servers came back of only 53k up to it's present 60k- it was over 140k before things fell over.
Grant
Darwin NT
ID: 1969446 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1969456 - Posted: 9 Dec 2018, 5:47:07 UTC - in response to Message 1969444.  

Well, my system that ran out of work is still out of work.
The one that still had some work has been able to pick up 155 WUs since the project came back to life.

The system that has work, continues to pick up work every few requests.
The one without any work, still not a thing.


With any luck it looks like they've sorted out whatever was wrong since the last outage- The Splitters are splitting at a sustained good rate, work is going out, and the Validators are clearing their backlog and the Deleters are cleaning up after them, and the Replica is catching up at a much faster rate than it had been.
*fingers crossed*


. . And toes ...

. . Finally got some work on the one machine still crunching. Time to fire the others back up.

Stephen

:)
ID: 1969456 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1969467 - Posted: 9 Dec 2018, 9:24:50 UTC

Still hit or miss getting work, "Project has no tasks available" the usual response.
But then I had a look at work In progress, and we've managed to get back to the level we're usually at after the weekly outage...
It was a big hole to climb out of.
Grant
Darwin NT
ID: 1969467 · Report as offensive
Profile Stargate (SA)
Volunteer tester
Avatar

Send message
Joined: 4 Mar 10
Posts: 1854
Credit: 2,258,721
RAC: 0
Australia
Message 1969473 - Posted: 9 Dec 2018, 12:22:39 UTC - in response to Message 1969415.  

Thank you Stephen

Stargate
ID: 1969473 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1969476 - Posted: 9 Dec 2018, 12:51:34 UTC - in response to Message 1969467.  

Still hit or miss getting work, "Project has no tasks available" the usual response.
But then I had a look at work In progress, and we've managed to get back to the level we're usually at after the weekly outage...
It was a big hole to climb out of.


Well I have a full belly at the moment, and I have about est. 19 hrs. of M/W GPU work (72 tasks) to get rid of first. I suspended Seti until that's finished and so you guys and gals have fun picking up what's out there. When I crank Seti back up things should be almost back to normal.


I don't buy computers, I build them!!
ID: 1969476 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1969628 - Posted: 9 Dec 2018, 20:35:31 UTC

Panic is officially over. Replica is caught up. Assimilation and validation are happening in a timely manner and all caught up. Splitting is going well, and the RTS queue is building.

I've been thinking (like an armchair quarterback) that it would be nice if they upped the RTS queue to be closer to a 6 hour reserve like it used to be. Now it is only good for 4 hours. I'm guessing that a bigger RTS queue could cause other problems either with the extended time on high load needed to refill it, or db space issues.

I've also been thinking about the preference to hand out work to those machines that already have work, as someone noted. Could it be that the machines that had work after such a long outage, were asking for a smaller amount of work at a time and thus got service earlier? Maybe if the empty machines asked for a smaller amount of work to start off would they then get some sooner?
ID: 1969628 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1969648 - Posted: 9 Dec 2018, 21:28:59 UTC - in response to Message 1969628.  

I've also been thinking about the preference to hand out work to those machines that already have work, as someone noted. Could it be that the machines that had work after such a long outage, were asking for a smaller amount of work at a time and thus got service earlier? Maybe if the empty machines asked for a smaller amount of work to start off would they then get some sooner?


. . I think that is Richard's theory too...

Stephen

? ?
ID: 1969648 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1969651 - Posted: 9 Dec 2018, 21:31:48 UTC - in response to Message 1969628.  

Yes, the servers have been doing a kick arse job at recovery. It could be because we cleaned out all the tasks so the server could start fresh.

Actually, the slow machines that still have some tasks left are more likely to get work sooner that the fast computers. Slow computers will ask for more work after every completed WU, then again 5 minutes later, then a 1h backoff, then longer, and longer.

Fast computers get stuck in that long backoff period since they have nothing left to report and reset the timer.
ID: 1969651 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1969652 - Posted: 9 Dec 2018, 21:37:19 UTC - in response to Message 1969628.  

I've also been thinking about the preference to hand out work to those machines that already have work, as someone noted. Could it be that the machines that had work after such a long outage, were asking for a smaller amount of work at a time and thus got service earlier? Maybe if the empty machines asked for a smaller amount of work to start off would they then get some sooner?
Remember that (under BOINC), the servers don't 'hand out' work: they respond to requests for work.

I suspect that most times people report that the project isn't handing out work, what's really happening is that their client isn't asking for work. When BOINC asks for work, but isn't given any, it goes into a sulk ("why bother?"). If you still have old work, every completed task clears that sulk: but once you run dry, the sulk continues. A single project update after the outage (if it gets work) clears the logjam: but without that, BOINC can wait hours before it tries of its own accord.
ID: 1969652 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1969663 - Posted: 9 Dec 2018, 22:59:38 UTC - in response to Message 1969652.  

I believe I played with the cache settings once back in the day to only ask for .25 day cache after a normal Tuesday outage instead of my normal 1 day cache. Didn't seem to make much difference in getting a successful work request or not. Still got the usual "no work is available" messages I normally do after an outage as everyone bellies up to the scheduler feeding trough.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1969663 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1969695 - Posted: 10 Dec 2018, 1:59:28 UTC
Last modified: 10 Dec 2018, 1:59:59 UTC

has anyone else noticed that it doesnt seem like credit is being awarded, or it's being awarded VERY slowly. we've been back up and running all day, but RAC numbers are still in nosedive.

i mean, i see validation numbers going up, but credit totals arent.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1969695 · Report as offensive
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 45 · Next

Message boards : Number crunching : Panic Mode On (114) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.