Panic Mode On (107) Server Problems?

Message boards : Number crunching : Panic Mode On (107) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 29 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1892025 - Posted: 27 Sep 2017, 18:44:29 UTC - in response to Message 1892024.  

I suspect, and it's only a suspicion, that the reason invoking the "ghost recovery" process is often successful in retrieving new tasks, even when no ghosts are present, is that a different timer is used, or at least a different, longer time interval. That "ghost recovery" process would, by necessity, require a database query in order to determine what tasks the server thinks are on hand for the requesting host. The results of that query then would have to be compared, task by task, against the tasks identified in the "<other_results>" section of the scheduler request, in order to see if any are missing and need to be resent. It would make sense to me (if making sense matters) that a longer response time might be allowed in order to accomplish that database retrieval and comparison, thus perhaps providing an extra cushion for normal scheduler operations.

I like your deduction. Makes sense to me that there is another timer mechanism in play for the "ghost recovery" protocol.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1892025 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1892080 - Posted: 27 Sep 2017, 23:37:23 UTC

Another consideration is the maximum number of tasks sent per request can be set on the sever.
Several years ago the max number of AP tasks per request was reduced so it would be harder for users to stockpile them.
It went from being able to get 100 at a time to ~7.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1892080 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1892128 - Posted: 28 Sep 2017, 8:32:50 UTC - in response to Message 1892024.  

I suspect, and it's only a suspicion, that the reason invoking the "ghost recovery" process is often successful in retrieving new tasks, even when no ghosts are present, is that a different timer is used, or at least a different, longer time interval. That "ghost recovery" process would, by necessity, require a database query in order to determine what tasks the server thinks are on hand for the requesting host. The results of that query then would have to be compared, task by task, against the tasks identified in the "<other_results>" section of the scheduler request, in order to see if any are missing and need to be resent. It would make sense to me (if making sense matters) that a longer response time might be allowed in order to accomplish that database retrieval and comparison, thus perhaps providing an extra cushion for normal scheduler operations.


See thats the thing, the "ghost recovery" procedure has the same timeouts as a normal scheduler request.
See the client in its scheduler request sends a list of tasks that it is currently processing. The logic of the scheduler, in the db using the count() SQL statement sees that the host has reached the 100WU per cpu + 100WU per gpu limit, when it sees this, the scheduler decides it is absolutely necessary to query the db. And this is probably the statement that it would execute:
SELECT id, workunitid, name, hostid FROM result WHERE hostid='Your host id'

I've skimmed the code, so that SQL statement is a guess.
And the parsing of what it received from doing that, takes up the majority of its time, resulting in you recovering 20 WU only. I'd hazard a guess it times out

Oh the server that the scheduler is running on is a synergy.
Which has a spec according to the SSP page: Intel Server (2 x hexa-core 2.53GHz Xeon, 96 GB RAM)
I'd hazard a guess the specs of this server are:
Dual Intel Xeon E5649 2.53Ghz with max turbo upto 2.93Ghz on the LGA1366 socket
96GB of DDR3 1066 Mhz memory
ID: 1892128 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1892310 - Posted: 29 Sep 2017, 5:27:31 UTC

Ah, we're back.
The forums went from being slower than a year of Sundays, to the Web site vanishing completely for a while there.
Grant
Darwin NT
ID: 1892310 · Report as offensive
Profile David@home
Volunteer tester
Avatar

Send message
Joined: 16 Jan 03
Posts: 755
Credit: 5,040,916
RAC: 28
United Kingdom
Message 1892441 - Posted: 29 Sep 2017, 18:09:52 UTC

Getting "Project is temporarily shut down for maintenance" messages in my event log. All looks OK on the web site. Anybody elese experiencing this?
ID: 1892441 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1892443 - Posted: 29 Sep 2017, 18:12:55 UTC - in response to Message 1892441.  

Yes, here too.
I thought I over slept by 4 days ;P
ID: 1892443 · Report as offensive
Iona
Avatar

Send message
Joined: 12 Jul 07
Posts: 790
Credit: 22,438,118
RAC: 0
United Kingdom
Message 1892446 - Posted: 29 Sep 2017, 18:18:21 UTC - in response to Message 1892441.  

Yes, just noticed a few finished tasks, going nowhere. Tried to 'update'.....no luck. Been a while since we had server trouble on a Friday afternoon (here, that is)!
Don't take life too seriously, as you'll never come out of it alive!
ID: 1892446 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1892449 - Posted: 29 Sep 2017, 18:35:00 UTC - in response to Message 1892441.  
Last modified: 29 Sep 2017, 18:37:28 UTC

Getting "Project is temporarily shut down for maintenance" messages in my event log. All looks OK on the web site. Anybody elese experiencing this?

Yes.....it's not just you.
I just noticed that all my rigs have not reported for about half an hour. Which is about the time that server message and backoff was sent out.

SSP looks OK as of this post. And I just hit the update button on my daily driver.......reported and got new tasks. So, it looks like the coast is clear again.
I'll just let the rest of my rigs time out by themselves.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1892449 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1892450 - Posted: 29 Sep 2017, 18:39:30 UTC

I had a 30 minute backoff on all machines and I just updated and started getting task again. So a short lived event.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1892450 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1892598 - Posted: 30 Sep 2017, 10:40:38 UTC - in response to Message 1892450.  

I had a 30 minute backoff on all machines and I just updated and started getting task again. So a short lived event.


. . Well I was blissfully asleep through that event, but right now the servers are very reluctant to this rig {i5-6600 with 2 x 970s} any work, I had to tickle its tonsils very hard to refill my cache to only almost full, still 20 or so down and it won't send any more ...

Stephen

??
ID: 1892598 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1892602 - Posted: 30 Sep 2017, 12:17:26 UTC

Would appear to be a lack of nvidia GPU work being available.
My caches are down a bit too.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1892602 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1892607 - Posted: 30 Sep 2017, 13:49:48 UTC

Yes, seeing the same. Down about 250 tasks on the linux cruncher and it sent (4) after the ghost recovery protocol. Must not have any Nvidia tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1892607 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1892634 - Posted: 30 Sep 2017, 15:11:55 UTC - in response to Message 1892607.  

Yes, seeing the same. Down about 250 tasks on the linux cruncher and it sent (4) after the ghost recovery protocol. Must not have any Nvidia tasks.


That or the feeder is not fast enough to refill the buffer
ID: 1892634 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1892644 - Posted: 30 Sep 2017, 15:33:57 UTC - in response to Message 1892607.  

It seems the RTS is full of Arecibo VLARs. The only CPU tasks I'm getting are Arecibo VLARs and the numbers on the Arecibo splitters aren't changing much even though some of the files are small. Someone needs to add some more Arecibo files and hope a splitter jumps on a new file.
ID: 1892644 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1892688 - Posted: 30 Sep 2017, 17:30:44 UTC - in response to Message 1892644.  

Yep, lots and lots of Arecibo VLARs. Even my machines are having difficulty getting new tasks, and my #1 cruncher actually got down to its last GPU task, the first time I can recall that happening. I just went ahead and rescheduled all the available guppis and non-VLAR Arecibo tasks from the CPU to the GPU queue. That freed up enough CPU queue space to be able to accept a bunch of Arecibo VLARs which, in turn, seemed to let a bunch of guppis and Arecibo non-VLAR tasks loose. It looks like that machine is slowly starting to rebuild the GPU queue but who knows how long that'll last. If I have to, I'll just move Arecibo VLARs over to the GPUs. I don't think they do all that badly with the Special App.
ID: 1892688 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1892700 - Posted: 30 Sep 2017, 20:14:42 UTC

The logjam of Arecibo VLARs may have finally broken up. I just got a 154 task slug of BLCs on the Ryzen system. Hope that occurs for the other crunchers.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1892700 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1892706 - Posted: 30 Sep 2017, 20:39:57 UTC - in response to Message 1892700.  

I was just thinking ... damn that Keith for saying there was not enough BLC files for his Ryzen.
I have more than enough to do my CPUs.
ID: 1892706 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36826
Credit: 261,360,520
RAC: 489
Australia
Message 1892707 - Posted: 30 Sep 2017, 20:51:08 UTC

Not very many Arecibo VLAR's here, but I do have heaps of GBT and some normal Arecibo work. ;-)

Cheers.
ID: 1892707 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34380
Credit: 79,922,639
RAC: 80
Germany
Message 1892708 - Posted: 30 Sep 2017, 20:56:56 UTC

Murphies Law maybe.
90% Arecibo on my CPU and 90% GBT on my GPU whilst i would like it vice versa.


With each crime and every kindness we birth our future.
ID: 1892708 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1892710 - Posted: 30 Sep 2017, 21:01:31 UTC - in response to Message 1892708.  

Murphies Law maybe.
90% Arecibo on my CPU and 90% GBT on my GPU whilst i would like it vice versa.

For me it's generally my slower system gets mostly GBT work, and my faster system mostly Arecibo.
Grant
Darwin NT
ID: 1892710 · Report as offensive
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 29 · Next

Message boards : Number crunching : Panic Mode On (107) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.