The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 71 · 72 · 73 · 74 · 75 · 76 · 77 . . . 94 · Next

AuthorMessage
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030501 - Posted: 2 Feb 2020, 12:26:41 UTC - in response to Message 2030478.  

And there is now a fix for the AMD RX 5000 card issues.
They can force only 'vanilla' hosts to upgrade their apps. So they can't really revert the triple validation kludge for overflow results before enough of the anonymous platform hosts have updated their apps to make the risk of a task getting sent to two bad hosts tiny enough to be acceptable.

Unless they can 'blacklist' amd gpus from receiving the _1 if the corresponding _0 was sent to one. But I don't think the system supports this because if it did, they would have already done it instead of using this triple validation kludge - which isn't even 100% watertight because there's still the risk of all three going to bad hosts.
ID: 2030501 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2030502 - Posted: 2 Feb 2020, 12:36:27 UTC
Last modified: 2 Feb 2020, 12:38:28 UTC

I am waiting and waiting to have the website confirm that I have a full cache.

Everything is running Seti@Home except for three weather forecast tasks from WCG.

Eyeballing it looks like I have a full set of cpu tasks and a less than full set of gpu tasks. But all the gpus are engaged and I think I may have 150 gpu tasks so hopefully it will stay that way.

Apparently the Replica DB is "just a bit behind". It just reported I have 6 tasks in progress.

I know I have to take off my shoes to count past 10 but I am sure I have more than "6" :)

Here it is Sunday morning and I/we? are finally get a steady flow of tasks?

Tom
A proud member of the OFA (Old Farts Association).
ID: 2030502 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030505 - Posted: 2 Feb 2020, 12:41:50 UTC - in response to Message 2030495.  

Now, if the "Results returned and awaiting validation" were on the same graph as the "Results out in the field" for both for MB & AP it'd be perfect
Actually one of the more interesting graphs would be ts SUM of 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging' for both MB & AP. That is all eight fields in one sum.

This would be the number of results in the database. The value that Eric said has to be kept under 20 milllion to avoid the result table spilling out of RAM. It is now 18.9 milllion.

Those 71 ancient zombie S@Hv7 results appear to have finally been purged!
ID: 2030505 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030507 - Posted: 2 Feb 2020, 12:56:53 UTC - in response to Message 2030502.  
Last modified: 2 Feb 2020, 12:58:10 UTC

I am waiting and waiting to have the website confirm that I have a full cache.
Do what I did: Write a program that reads the client_state.xml and reports the number of tasks for CPU and GPU. That way you can easily see how full your queues are and you don't need the website for that, so it works even during the out(r)ages.

And the data will always be fresh no matter how behind the relica db is.
ID: 2030507 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030508 - Posted: 2 Feb 2020, 13:07:21 UTC - in response to Message 2030507.  

I think BoincTasks can do that, as well.
ID: 2030508 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2030512 - Posted: 2 Feb 2020, 13:42:54 UTC - in response to Message 2030508.  

I think BoincTasks can do that, as well.

Quite well, in fact.
ID: 2030512 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1861
Credit: 268,616,081
RAC: 1,349
United States
Message 2030513 - Posted: 2 Feb 2020, 13:43:39 UTC

And, at least for the moment, the floodgates appear to have opened.
ID: 2030513 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030523 - Posted: 2 Feb 2020, 14:54:04 UTC

Something has changed. The floodgates are wide open but the assimilation queue is still getting smaller.
ID: 2030523 · Report as offensive
Profile Chris904395093209d Project Donor
Volunteer tester

Send message
Joined: 1 Jan 01
Posts: 112
Credit: 29,923,129
RAC: 6
United States
Message 2030524 - Posted: 2 Feb 2020, 15:00:34 UTC

I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase?
~Chris

ID: 2030524 · Report as offensive
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 717
Credit: 8,032,827
RAC: 62
France
Message 2030525 - Posted: 2 Feb 2020, 15:03:31 UTC


02-Feb-2020 15:51:01 [SETI@home] Sending scheduler request: To fetch work.
02-Feb-2020 15:51:01 [SETI@home] Requesting new tasks for CPU and AMD/ATI GPU
02-Feb-2020 15:51:06 [SETI@home] Scheduler request completed: got 124 new tasks


UTC+1 ^^
ID: 2030525 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3866
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030527 - Posted: 2 Feb 2020, 15:04:47 UTC - in response to Message 2030524.  

I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase?


It appears they did... the purging queue has fallen by half, so work generation is back as the result table is well below 20M.
ID: 2030527 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2030529 - Posted: 2 Feb 2020, 15:09:49 UTC - in response to Message 2030527.  
Last modified: 2 Feb 2020, 15:44:09 UTC

I'm not seeing the '71' under the S@H V7 column on the server status page. Did those finally get cleaned up in the dbase?


It appears they did... the purging queue has fallen by half, so work generation is back as the result table is well below 20M.

Maybe is time to start to cut the timeline of the WUs and some changes in the way the work is distributed like sending the resends to the fastest hosts to clear them ASAP. Or we will be trapped on an endless loop of no new work each time the total reaches 20 MM.
ID: 2030529 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3866
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030530 - Posted: 2 Feb 2020, 15:12:40 UTC - in response to Message 2030529.  

Or we will be trapped on an endless loop of no new work each time the total reaches 20 MM.


Possible explanation of why this has only been happening recently here.... Briefly: Quorum=3 for overflows coupled with BLC35 files which generate little except overflows.
ID: 2030530 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030536 - Posted: 2 Feb 2020, 15:51:20 UTC - in response to Message 2030529.  

Maybe it time to start to cut the timeline of the WUs and some changes in the way the work is distributed like sending the resends to the fastest hosts to clear them ASAP.
Again NOT the fastest but the ones with the shortest average turnaround time. Slow host with a tiny cache can return the result faster than a fast host with a huge spoofed cache.

One thing that could prevent this from happening again is if the system monitored the rate of overflows returned and when any file being split exceeds some threshold, that file would be heavily throttled so that it continues being split but would produce only a small percentage of all the workunits.

Or this could even happen without any monitoring if the different splitters split different files instead of all bunching up on the same file. So if some file (or a few files) produced an overflow storm, the storm would be diluted by all the other splitters splitting clean files. But I don't know how this would affect the splitter performance. Spreading out could be faster or slower than bunching up.
ID: 2030536 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2030560 - Posted: 2 Feb 2020, 21:43:58 UTC - in response to Message 2030536.  
Last modified: 2 Feb 2020, 21:44:42 UTC

Again NOT the fastest but the ones with the shortest average turnaround time. Slow host with a tiny cache can return the result faster than a fast host with a huge spoofed cache.

Sorry the meaning was lost in the translation. For me fastests host are the ones with the shortest average turnaround time (less than 1 day). They could clear the WU in very little time and help to reduce the DB size. Obviusly the WU must be sended with a very small death time line (less than 3 days in this case) .

The way is done now, by sending the WU to any hosts (with a long death time line) just make the DB size problem even worst.
ID: 2030560 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1648
Credit: 12,921,799
RAC: 89
New Zealand
Message 2030566 - Posted: 2 Feb 2020, 21:58:36 UTC - in response to Message 2030508.  

I think BoincTasks can do that, as well.

I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?
ID: 2030566 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 2030584 - Posted: 2 Feb 2020, 23:16:33 UTC

Better solution: if you can detect short tasks without running them, why not just abort them?
Can Boinc Tasks do this? Could the servers?
ID: 2030584 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19851
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2030586 - Posted: 2 Feb 2020, 23:38:11 UTC - in response to Message 2030584.  
Last modified: 3 Feb 2020, 0:23:35 UTC

Better solution: if you can detect short tasks without running them, why not just abort them?
Can Boinc Tasks do this? Could the servers?

The only known way is to run them. For a short time, like the time taken on a 2060 GPU, or better, for bomb to be -9ed.
We don't know how many tasks are sent/day but we do know how many are returned/hr.

Average tasks returned per hr * 24 * short time on GPU / 86400 (s in day) = GPU's needed
ID: 2030586 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2030593 - Posted: 3 Feb 2020, 0:20:11 UTC

Sun 02 Feb 2020 06:16:57 PM CST | SETI@home | Scheduler request completed: got 92 new tasks

Yum! Something to crunch ;)

Tom
A proud member of the OFA (Old Farts Association).
ID: 2030593 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030611 - Posted: 3 Feb 2020, 6:44:56 UTC

Looks like more trouble. About 30 minutes ago the Website got very Slow and the Scheduler checked out;
Mon Feb 3 01:08:50 2020 | SETI@home | [sched_op] Starting scheduler request
Mon Feb 3 01:10:47 2020 | SETI@home | Scheduler request failed: HTTP internal server error
Mon Feb 3 01:10:47 2020 | SETI@home | [sched_op] Reason: Scheduler request failed
Mon Feb 3 01:13:08 2020 | SETI@home | Sending scheduler request: To report completed tasks.
Mon Feb 3 01:14:23 2020 | SETI@home | Scheduler request failed: Couldn't connect to server
Mon Feb 3 01:22:01 2020 | SETI@home | [sched_op] Starting scheduler request
Mon Feb 3 01:23:15 2020 | SETI@home | Scheduler request failed: Failure when receiving data from the peer
Mon Feb 3 01:23:15 2020 | SETI@home | [sched_op] Reason: Scheduler request failed
Mon Feb 3 01:34:15 2020 | SETI@home | [sched_op] Starting scheduler request
Mon Feb 3 01:36:57 2020 | SETI@home | Scheduler request failed: HTTP internal server error
Mon Feb 3 01:36:57 2020 | SETI@home | [sched_op] Reason: Scheduler request failed
Just when everything was working well...
ID: 2030611 · Report as offensive
Previous · 1 . . . 71 · 72 · 73 · 74 · 75 · 76 · 77 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.