Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 72 · 73 · 74 · 75 · 76 · 77 · 78 . . . 94 · Next
| Author | Message |
|---|---|
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13985 Credit: 208,696,464 RAC: 304
|
Well, of all the problems i was expecting to occur, the Scheduler going MIA wasn't one of them. And it appears it might have just come back to life- no longer timing out, or HTTP errors, or failure when receiving data from the peer (I think every possible error has been given at some stage). Now it's back to "Project has no tasks available", but at least i can report every thing that's accumulated since the Scheduler went AWOL earlier. Grant Darwin NT |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
Looks like the validators have been MIA too, not just the scheduler. The first successful scheduler contact made my RAC drop lower than the lowest point yesterday at the end of the dry period. |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
A few machines are starting to get Downloads again. Hopefully this will blow over quickly. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13985 Credit: 208,696,464 RAC: 304
|
Looks like the validators have been MIA too, not just the scheduler. The first successful scheduler contact made my RAC drop lower than the lowest point yesterday at the end of the dry period.For a while there things were improving (steadily if slowly), but all the new work going out has caused the Validation backlog to increase again. Grant Darwin NT |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
For a while there things were improving (steadily if slowly), but all the new work going out has caused the Validation backlog to increase again.The assimilation backlog was reducing until two SSP updates ago. But on the last two updates it too has grown bigger. Here are the cumulative result counts for the last few days: Each plotted value is the sum of that value plus all the values below it so that the width of the band between the line and the one below it represents the value of the specific variable. The plots show that db purging has been primarily responsible for the database size reduction and when the database ran out of purgeable results, the total result count started increasing again. The results waiting for assimilation are an estimated value because the SSP doesn't report it separately. The estimation is based on two assumptions: Those are counted as waiting for validation on ssp and the average replication (number of results per workunit) is 2.2. The numbers on x-axis are days of February. |
Cruncher-American ![]() Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340
|
I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this? Then how could any other piece of s/w do this...just asking for a friend. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
Unfortunately, can't be done - consistently, at any rate.I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?Then how could any other piece of s/w do this...just asking for a friend. That's what we're here for - finding the signals in the noise. The only way to do that is to run SETI's own software. There are occasions when a whole group of tasks are 'similar' - like the recent run of BLC35 tasks. But it wasn't 100%, and there were tasks in there that needed running. The best we can hope for is that the powers that be provide enough workers in the SETI@Home labs to manage the tape splitting process more closely, so that when one of these self-similar groups appears, they can respond by distributing them gradually, amongst other types of work. |
Tom M Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 |
I got up this morning and my Windows 10 box had shut down for some reason or other. When it does that I have to turn off the PSU before things will "reset" and then up it comes. Got this when everything was up again: 2/3/2020 5:51:36 AM | SETI@home | Scheduler request completed: got 150 new tasks Tom A proud member of the OFA (Old Farts Association). |
BetelgeuseFive ![]() Send message Joined: 6 Jul 99 Posts: 158 Credit: 17,117,787 RAC: 19
|
Unfortunately, can't be done - consistently, at any rate.I agree. It would be good if boinc tasks or another piece of software could push short tasks to the front of the queue. Does anybody know of any software that does this?Then how could any other piece of s/w do this...just asking for a friend. But it should be possible to move resends to the top of the queue (or at least it used to be when all tasks where sent out as pairs: anything with a _2 or higher should be resends). Tom |
Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230
|
My Inconclusive results are going up too, even though I've only had a handful of Tasks since last night. Last night I had a large number of Inconclusive results that said 'minimum quorum 1' and only listed a single Inconclusive host. I didn't see how a single Inconclusive host task could ever validate. Now, it's very difficult to bring up my Inconclusive tasks lists, but, it seems those tasks are now listed as; https://setiathome.berkeley.edu/workunit.php?wuid=3862758806I have a couple of invalid tasks with minimum quorum = 1. Perhaps I have a lot of valid tasks as well with min.q.=1, but they are much harder to spot. https://setiathome.berkeley.edu/workunit.php?wuid=3861384942 https://setiathome.berkeley.edu/workunit.php?wuid=3861339403 https://setiathome.berkeley.edu/workunit.php?wuid=3861247650 https://setiathome.berkeley.edu/workunit.php?wuid=3861247545 and so on... https://setiathome.berkeley.edu/results.php?userid=5276&offset=0&show_names=0&state=5&appid= |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
But it should be possible to move resends to the top of the queue (or at least it used to be when all tasks where sent out as pairs: anything with a _2 or higher should be resends).I don't think this is easy to do for an external tool. Except perhaps by modifying the deadlines of the tasks in client_state.xml to trick boinc into processing them in a hurry. If you modified the boinc client itself, then you could change the rules it uses to pick the next task to crunch to make it prioritize _2s and higher over _0 and _1. |
juan BFP ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799
|
Or... Instead of modify the client itself, who is not recommended because the dev`s constantly release new updates on it, you could build an external app like the rescheduler. But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order. Obviously until the panic mode is triggered by the client. The question could be: Why you need to do that? Keep your WU cache big enough to make your host crunching all the WU within a day and you will help to clear the DB fast.
|
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order.Does the order in which the results are listed in client_state.xml count? There's no field for queue position, so if the physical order doesn't count, then the only way to do this would be faking the deadlines or receive times. Hacking the client would have the advantage that you wouldn't then need to periodically stop and restart the client to edit the client_state.xml. Every restart makes you lose on average 2.5 minutes of CPU progress and half a task of GPU progress. |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
I'm still getting quite a few Uploads going immediately into Retry. Changing BOINC versions doesn't help, and it's also happening with the Stock Mac version of BOINC. Some run for 6 to 7 seconds and finish normally while others go into Retry after just One second. The only help I've found is to recompile boinc with the min wait time set to 30 secs instead of Two minutes, that manages to clear them before they have a chance to pileup. It certainly appears to be something on the other end... |
juan BFP ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799
|
But instead of reschedulling WU from GPU<>CPU you could rearrange the FIFO order the WU are crunched. So they will be crunched in the order you choose, any order.Does the order in which the results are listed in client_state.xml count? There's no field for queue position, so if the physical order doesn't count, then the only way to do this would be faking the deadlines or receive times. They are in order "First In First Out" instead one WU activate the "panic" switch due it`s time line. Hacking the client would have the advantage that you wouldn't then need to periodically stop and restart the client to edit the client_state.xml. Every restart makes you lose on average 2.5 minutes of CPU progress and half a task of GPU progress. In theory yes, but in the practical world the client is constantly updated with fixes and new devs. So you need to constantly recompile it. That is why one self contained external program like the rescheduling works better in this case. What you could do is hack the client to automatically call that program from time to time. But you need to be aware the client_state.xml is read on the start of the client program as you probably know. About the crunching time loosed, the CPU part is not a problem, you could force the client to create a stop point on the CPU part. The GPU part you can`t because the way the optimized programs works. In this case you loose what is already crunched and the WU restart from zero. Yes you could change the death time line of the WU to artificialy force the "panic mode", but that could leave you to loose a lot of WU if they reach the death line time before been crunched or in case of any server side problem to UL. What i can`t imagine why somebody needs to do this change on the crunching order? Just keep the WU cache low and all is done by the internal boinc scheduled control.
|
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
What you could do is hack the client to automatically call that program from time to time.In this case one could make the client save its state and suspend its operation while leaving the science apps running, run the program, then read the state back from the xml file and resume operation. This way you won't lose any progress except if some task gets finished while the client is still waiting for the program to do its job. And even in that case the only consequence would be the gpu or cpu idling for a second or two. |
juan BFP ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799
|
What you could do is hack the client to automatically call that program from time to time.In this case one could make the client save its state and suspend its operation while leaving the science apps running, run the program, then read the state back from the xml file and resume operation. This way you won't lose any progress except if some task gets finished while the client is still waiting for the program to do its job. And even in that case the only consequence would be the gpu or cpu idling for a second or two. You could try but AFAIK with the Linux Special Sauce you can`t pause the crunching process or weird things could happening (petri post that on the doc file). That is why you need to change the stop point timer when you run with them. What is safer IMHO is: totally exit the client (save the CPU work done, forget about the GPU work), start the child program and restart the client. Mod: Could you change this talks to a more appropriated thread? Thanks
|
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19841 Credit: 40,757,560 RAC: 67
|
Certainly something is going on. I went out at midday, UK time, and the replica was over 14 hours behind at that time. I also had 1400+ Valid tasks listed in my account, that has dropped to 800. If correct then some serious assimilation, deletion and purging is going on, at last. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
Another straw in the wind: I only had 38,638 new credits exported to BOINCstats yesterday. But I got 116,804 more added in the first half of today, and local records have another 178,197 so far today. That's well above my normal RAC. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873
|
This is interesting. I looked at a sample host of mine for the validated tasks. 7 hours old, but the interesting part was the early overflow that was validated and awarded 0.25 credits. But the work unit itself and its result file "can't find workunit" can't be pulled up. Normally validated results are viewable for 24 hours. Are they purging overflows as soon as they get validated now? 8505927055 3861538171 3 Feb 2020, 8:47:47 UTC 3 Feb 2020, 14:47:34 UTC Completed and validated 4.11 2.65 0.25 SETI@home v8 Anonymous platform (CPU) Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.