The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 72 · 73 · 74 · 75 · 76 · 77 · 78 . . . 94 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13985
Credit: 208,696,464
RAC: 304
Australia
Message 2030612 - Posted: 3 Feb 2020, 7:09:49 UTC

Well, of all the problems i was expecting to occur, the Scheduler going MIA wasn't one of them.

And it appears it might have just come back to life- no longer timing out, or HTTP errors, or failure when receiving data from the peer (I think every possible error has been given at some stage).
Now it's back to "Project has no tasks available", but at least i can report every thing that's accumulated since the Scheduler went AWOL earlier.
Grant
Darwin NT
ID: 2030612 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030615 - Posted: 3 Feb 2020, 7:31:36 UTC

Looks like the validators have been MIA too, not just the scheduler. The first successful scheduler contact made my RAC drop lower than the lowest point yesterday at the end of the dry period.
ID: 2030615 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030616 - Posted: 3 Feb 2020, 7:36:40 UTC

A few machines are starting to get Downloads again. Hopefully this will blow over quickly.
ID: 2030616 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13985
Credit: 208,696,464
RAC: 304
Australia
Message 2030617 - Posted: 3 Feb 2020, 7:46:51 UTC - in response to Message 2030615.  

Looks like the validators have been MIA too, not just the scheduler. The first successful scheduler contact made my RAC drop lower than the lowest point yesterday at the end of the dry period.
For a while there things were improving (steadily if slowly), but all the new work going out has caused the Validation backlog to increase again.
Grant
Darwin NT
ID: 2030617 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030619 - Posted: 3 Feb 2020, 7:55:46 UTC - in response to Message 2030617.  
Last modified: 3 Feb 2020, 8:55:29 UTC

For a while there things were improving (steadily if slowly), but all the new work going out has caused the Validation backlog to increase again.
The assimilation backlog was reducing until two SSP updates ago. But on the last two updates it too has grown bigger.

Here are the cumulative result counts for the last few days:



Each plotted value is the sum of that value plus all the values below it so that the width of the band between the line and the one below it represents the value of the specific variable. The plots show that db purging has been primarily responsible for the database size reduction and when the database ran out of purgeable results, the total result count started increasing again.

The results waiting for assimilation are an estimated value because the SSP doesn't report it separately. The estimation is based on two assumptions: Those are counted as waiting for validation on ssp and the average replication (number of results per workunit) is 2.2.

The numbers on x-axis are days of February.
ID: 2030619 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 2030627 - Posted: 3 Feb 2020, 10:20:58 UTC - in response to Message 2030586.  

I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?


Then how could any other piece of s/w do this...just asking for a friend.
ID: 2030627 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030629 - Posted: 3 Feb 2020, 10:42:32 UTC - in response to Message 2030627.  

I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?
Then how could any other piece of s/w do this...just asking for a friend.
Unfortunately, can't be done - consistently, at any rate.

That's what we're here for - finding the signals in the noise. The only way to do that is to run SETI's own software.

There are occasions when a whole group of tasks are 'similar' - like the recent run of BLC35 tasks. But it wasn't 100%, and there were tasks in there that needed running. The best we can hope for is that the powers that be provide enough workers in the SETI@Home labs to manage the tape splitting process more closely, so that when one of these self-similar groups appears, they can respond by distributing them gradually, amongst other types of work.
ID: 2030629 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2030634 - Posted: 3 Feb 2020, 12:00:57 UTC

I got up this morning and my Windows 10 box had shut down for some reason or other. When it does that I have to turn off the PSU before things will "reset" and then up it comes.

Got this when everything was up again:
2/3/2020 5:51:36 AM | SETI@home | Scheduler request completed: got 150 new tasks


Tom
A proud member of the OFA (Old Farts Association).
ID: 2030634 · Report as offensive
BetelgeuseFive Project Donor
Volunteer tester

Send message
Joined: 6 Jul 99
Posts: 158
Credit: 17,117,787
RAC: 19
Netherlands
Message 2030636 - Posted: 3 Feb 2020, 12:16:02 UTC - in response to Message 2030629.  

I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?
Then how could any other piece of s/w do this...just asking for a friend.
Unfortunately, can't be done - consistently, at any rate.

That's what we're here for - finding the signals in the noise. The only way to do that is to run SETI's own software.

There are occasions when a whole group of tasks are 'similar' - like the recent run of BLC35 tasks. But it wasn't 100%, and there were tasks in there that needed running. The best we can hope for is that the powers that be provide enough workers in the SETI@Home labs to manage the tape splitting process more closely, so that when one of these self-similar groups appears, they can respond by distributing them gradually, amongst other types of work.


But it should be possible to move resends to the top of the queue (or at least it used to be when all tasks where sent out as pairs: anything with a _2 or higher should be resends).

Tom
ID: 2030636 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2030638 - Posted: 3 Feb 2020, 12:47:23 UTC - in response to Message 2030277.  

My Inconclusive results are going up too, even though I've only had a handful of Tasks since last night. Last night I had a large number of Inconclusive results that said 'minimum quorum 1' and only listed a single Inconclusive host. I didn't see how a single Inconclusive host task could ever validate. Now, it's very difficult to bring up my Inconclusive tasks lists, but, it seems those tasks are now listed as; https://setiathome.berkeley.edu/workunit.php?wuid=3862758806
minimum quorum 1
initial replication 3
   Task    Computer            Sent                  Time reported                 Status        Runtime CPUtime Credit             Application
8495599283  1473578  31 Jan 2020, 5:02:48 UTC  31 Jan 2020, 21:47:15 UTC  Completed and validated  15.36  12.61   3.59  SETI@home v8 v8.20 (opencl_ati5_mac) x86_64-apple-darwin
8498611906  6796479   1 Feb 2020, 3:00:50 UTC   1 Feb 2020, 4:00:03 UTC   Completed and validated   4.10   1.93   3.59  SETI@home v8 v8.11 (cuda42_mac) x86_64-apple-darwin
8498669733  8673543   1 Feb 2020, 4:01:52 UTC   1 Feb 2020, 5:29:49 UTC   Completed and validated  15.11  13.09   3.59  SETI@home v8 v8.22 (opencl_nvidia_SoG)
So, the single host are now triple hosts, but they are still just sitting there with a number of them showing one or two Completed, waiting for validation hosts, and some with one or two Inconclusive hosts.
I have a couple of invalid tasks with minimum quorum = 1. Perhaps I have a lot of valid tasks as well with min.q.=1, but they are much harder to spot.
https://setiathome.berkeley.edu/workunit.php?wuid=3861384942
https://setiathome.berkeley.edu/workunit.php?wuid=3861339403
https://setiathome.berkeley.edu/workunit.php?wuid=3861247650
https://setiathome.berkeley.edu/workunit.php?wuid=3861247545
and so on...
https://setiathome.berkeley.edu/results.php?userid=5276&offset=0&show_names=0&state=5&appid=
ID: 2030638 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030639 - Posted: 3 Feb 2020, 12:52:34 UTC - in response to Message 2030636.  

But it should be possible to move resends to the top of the queue (or at least it used to be when all tasks where sent out as pairs: anything with a _2 or higher should be resends).
I don't think this is easy to do for an external tool. Except perhaps by modifying the deadlines of the tasks in client_state.xml to trick boinc into processing them in a hurry.

If you modified the boinc client itself, then you could change the rules it uses to pick the next task to crunch to make it prioritize _2s and higher over _0 and _1.
ID: 2030639 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2030640 - Posted: 3 Feb 2020, 13:13:03 UTC - in response to Message 2030639.  
Last modified: 3 Feb 2020, 13:15:25 UTC

Or...

Instead of modify the client itself, who is not recommended because the dev`s constantly release new updates on it, you could build an external app like the rescheduler.

But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order. Obviously until the panic mode is triggered by the client.

The question could be: Why you need to do that? Keep your WU cache big enough to make your host crunching all the WU within a day and you will help to clear the DB fast.
ID: 2030640 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030643 - Posted: 3 Feb 2020, 13:34:33 UTC - in response to Message 2030640.  

But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order.
Does the order in which the results are listed in client_state.xml count? There's no field for queue position, so if the physical order doesn't count, then the only way to do this would be faking the deadlines or receive times.

Hacking the client would have the advantage that you wouldn't then need to periodically stop and restart the client to edit the client_state.xml. Every restart makes you lose on average 2.5 minutes of CPU progress and half a task of GPU progress.
ID: 2030643 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030655 - Posted: 3 Feb 2020, 16:20:20 UTC

I'm still getting quite a few Uploads going immediately into Retry. Changing BOINC versions doesn't help, and it's also happening with the Stock Mac version of BOINC. Some run for 6 to 7 seconds and finish normally while others go into Retry after just One second. The only help I've found is to recompile boinc with the min wait time set to 30 secs instead of Two minutes, that manages to clear them before they have a chance to pileup. It certainly appears to be something on the other end...
ID: 2030655 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2030667 - Posted: 3 Feb 2020, 18:26:33 UTC - in response to Message 2030643.  
Last modified: 3 Feb 2020, 19:02:21 UTC

But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order.
Does the order in which the results are listed in client_state.xml count? There's no field for queue position, so if the physical order doesn't count, then the only way to do this would be faking the deadlines or receive times.

They are in order "First In First Out" instead one WU activate the "panic" switch due it`s time line.

Hacking the client would have the advantage that you wouldn't then need to periodically stop and restart the client to edit the client_state.xml. Every restart makes you lose on average 2.5 minutes of CPU progress and half a task of GPU progress.

In theory yes, but in the practical world the client is constantly updated with fixes and new devs. So you need to constantly recompile it. That is why one self contained external program like the rescheduling works better in this case. What you could do is hack the client to automatically call that program from time to time. But you need to be aware the client_state.xml is read on the start of the client program as you probably know.

About the crunching time loosed, the CPU part is not a problem, you could force the client to create a stop point on the CPU part. The GPU part you can`t because the way the optimized programs works. In this case you loose what is already crunched and the WU restart from zero.

Yes you could change the death time line of the WU to artificialy force the "panic mode", but that could leave you to loose a lot of WU if they reach the death line time before been crunched or in case of any server side problem to UL.

What i can`t imagine why somebody needs to do this change on the crunching order? Just keep the WU cache low and all is done by the internal boinc scheduled control.
ID: 2030667 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030671 - Posted: 3 Feb 2020, 19:01:49 UTC - in response to Message 2030667.  

What you could do is hack the client to automatically call that program from time to time.
In this case one could make the client save its state and suspend its operation while leaving the science apps running, run the program, then read the state back from the xml file and resume operation. This way you won't lose any progress except if some task gets finished while the client is still waiting for the program to do its job. And even in that case the only consequence would be the gpu or cpu idling for a second or two.
ID: 2030671 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2030673 - Posted: 3 Feb 2020, 19:09:02 UTC - in response to Message 2030671.  
Last modified: 3 Feb 2020, 19:59:59 UTC

What you could do is hack the client to automatically call that program from time to time.
In this case one could make the client save its state and suspend its operation while leaving the science apps running, run the program, then read the state back from the xml file and resume operation. This way you won't lose any progress except if some task gets finished while the client is still waiting for the program to do its job. And even in that case the only consequence would be the gpu or cpu idling for a second or two.

You could try but AFAIK with the Linux Special Sauce you can`t pause the crunching process or weird things could happening (petri post that on the doc file). That is why you need to change the stop point timer when you run with them.

What is safer IMHO is: totally exit the client (save the CPU work done, forget about the GPU work), start the child program and restart the client.

Mod: Could you change this talks to a more appropriated thread? Thanks
ID: 2030673 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19841
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2030702 - Posted: 3 Feb 2020, 22:09:38 UTC - in response to Message 2030696.  

Certainly something is going on. I went out at midday, UK time, and the replica was over 14 hours behind at that time.

I also had 1400+ Valid tasks listed in my account, that has dropped to 800. If correct then some serious assimilation, deletion and purging is going on, at last.
ID: 2030702 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030705 - Posted: 3 Feb 2020, 22:16:52 UTC - in response to Message 2030702.  

Another straw in the wind: I only had 38,638 new credits exported to BOINCstats yesterday. But I got 116,804 more added in the first half of today, and local records have another 178,197 so far today. That's well above my normal RAC.
ID: 2030705 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2030708 - Posted: 3 Feb 2020, 22:24:11 UTC

This is interesting. I looked at a sample host of mine for the validated tasks. 7 hours old, but the interesting part was the early overflow that was validated and awarded 0.25 credits.

But the work unit itself and its result file "can't find workunit" can't be pulled up. Normally validated results are viewable for 24 hours. Are they purging overflows as soon as they get validated now?

8505927055 3861538171 3 Feb 2020, 8:47:47 UTC 3 Feb 2020, 14:47:34 UTC Completed and validated 4.11 2.65 0.25 SETI@home v8 Anonymous platform (CPU)
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2030708 · Report as offensive
Previous · 1 . . . 72 · 73 · 74 · 75 · 76 · 77 · 78 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.