The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2024390 - Posted: 22 Dec 2019, 15:43:55 UTC Last modified: 22 Dec 2019, 15:52:49 UTC I run 2 stock slowish machines. Getting WUs is hit or miss. looks like I got new WUs on the faster of the two about 3 hours ago. The slower machine I had running on NNT, but have now set to asking because I'm getting low. Can I trust any of the numbers in the status update?? or is it all old echos of how things used to be (like astronomy itself)?? how about results returned per hour?? 124K is that a number that is up-to-date?? seems ok. edit: Just caught up on old archived panic thread, and don't want eric's posts to get lost in the change https://setiathome.berkeley.edu/forum_thread.php?id=84416&postid=2024305#2024305 ID: 2024390 ·

Phil Burden Send message Joined: 26 Oct 00 Posts: 264 Credit: 22,303,899 RAC: 0	Message 2024392 - Posted: 22 Dec 2019, 15:54:16 UTC - in response to Message 2024390. I run 2 stock slowish machines. Getting WUs is hit or miss. looks like I got new WUs on the faster of the two about 3 hours ago. The slower machine I had running on NNT, but have now set to asking because I'm getting low. Can I trust any of the numbers in the status update?? or is it all old echos of how things used to be (like astronomy itself)?? how about results returned per hour?? 124K is that a number that is up-to-date?? seems ok. My understanding is that the status pages are driven from the replica database, and since that's currently 18 hours BEHIND the master, that's how old the data being displayed is ;-) But, like all things, I could be sooooooooooo wrong ;-)\| P. ID: 2024392 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2024393 - Posted: 22 Dec 2019, 15:58:50 UTC - in response to Message 2024392. My understanding is that the status pages are driven from the replica database Rather defeats the entire definition of a status page to have it set up this way, but I would not be surprised if that was the case. ID: 2024393 ·

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2024406 - Posted: 22 Dec 2019, 16:40:54 UTC I put one of my systems on stock tasks. I just did this: 1. close/exit boinc (it was already out of work, all reported) 2. rename app_info to app_info_bkp 3. start boinc that was it. it downloaded tasks right away (nvidia_opencl_sah and nvidia_opencl_SoG) Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2024406 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 2024408 - Posted: 22 Dec 2019, 16:59:41 UTC Copying Eric's message to this thread as well. Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at https://github.com/BOINC/boinc/tree/setiathome_server/sched Something goes wrong in the function SCHED_SHMEM::no_work. bool SCHED_SHMEM::no_work(int pid) { if (!ready) return true; for (int i=0; i<max_wu_results; i++) { if (wu_results[i].state == WR_STATE_PRESENT) { wu_results[i].state = pid; return false; } } return true; } This function works properly unless the requesting computer has anonymous platform apps, for which it always returns true. How could that be? I don't know despite additional 500 lines of debugging code. It's almost as if something else is pausing anonymous platform requests until the queue is empty. Well it's bed time now. :( ID: 2024408 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2024410 - Posted: 22 Dec 2019, 17:15:56 UTC So here is my guess after traversing the spaghetti. In the sched_util.h file you have this comment and code: https://github.com/BOINC/boinc/blob/0ee5c54381b262627f14c147f5528ed93f9d7672/sched/sched_util.h#L39 It speaks of generating a "pseudo ID" for anonymous platform and defines DB_ID_TYPE. Then over in sched_shmem.h you get: https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.h#L129 which references that DB_ID_TYPE variable. Which eventually leads us back to the sched_shmem.cpp module which Eric referenced as where the code blows up on anonymous platform and returns true. https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L283 https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L290 https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L304 https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L333 all of which sections use that DB_ID_TYPE variable which eventually leads us back to the SCHED_SHMEM::no_work section. Is the problem that the existing 715 server code doesn't properly define or handle the "pseudo ID" that is generated for anonymous platform? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2024410 ·

JohnDK Volunteer tester Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127	Message 2024412 - Posted: 22 Dec 2019, 17:32:25 UTC Before today I edited the client_state.xml file to rename all cuda60 WUs to SoG, worked fine, but now I'm only getting cuda60 work. Guess the server now thinks cuda60 is a good choice :( ID: 2024412 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20	Message 2024413 - Posted: 22 Dec 2019, 17:33:40 UTC - in response to Message 2024408. Last modified: 22 Dec 2019, 18:00:09 UTC At a first glance, my suspicions would be to check the pid: Are 'pid's getting reused or rolling over? Or otherwise malformed? Are 'pid's somehow 'special' for anonymous? This is further suspicious in that: Is this a problem from the recent sudden big rise in live tasks and work units?... Is the integer for the pid overflowing?!?? Or has the database table for anonymous overflowed? OK, just some wild guesses before I follow up on Kieth's comments :-) Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 2024413 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20	Message 2024419 - Posted: 22 Dec 2019, 17:59:22 UTC - in response to Message 2024410. Have the DB_ID_TYPE "id"s been changed across the versions/databases? Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 2024419 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20	Message 2024420 - Posted: 22 Dec 2019, 18:08:38 UTC - in response to Message 2024410. Last modified: 22 Dec 2019, 18:08:54 UTC From a very quick glance, note on: https://github.com/BOINC/boinc/blob/0ee5c54381b262627f14c147f5528ed93f9d7672/sched/sched_util.h#L39 there is "return appid*1000000 - avid". The one million is not that big a number if that returned (compound/combined?) result is to be unique wrt s@h users/tasks/wu...? Really, should not a wide hashing function or a structure be used to safely return such a result...? Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 2024420 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2024422 - Posted: 22 Dec 2019, 18:24:23 UTC Thanks for the comments, Martin. I too wondered if the size of the database now is at the root of the problem. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2024422 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2024424 - Posted: 22 Dec 2019, 18:25:38 UTC - in response to Message 2024368. Very simple to switch from Anonymous platform to Stock even with the All-In-One. All you have to do is change the Names on the two files app_info.xml & app_config.xml to something as app_info1.xml & app_config1.xml, that will revert you to Stock. To change back to Anonymous platform rename the files to the original names app_info.xml & app_config.xml . That's All that needs to be done, Nothing Else...NADA. It's not that simple in my experience. Or it is to get back to stock but if you want to be able to restore your anonymous setup later, then it is better to move or copy the anonymous apps out of the project folder. Boinc has a habit of deleting any file in the project folder it doesn't know what to do with. And sometimes even when it does! ID: 2024424 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2024425 - Posted: 22 Dec 2019, 18:27:51 UTC Last modified: 22 Dec 2019, 18:39:45 UTC For those now running stock- how long is it taking for the Scheduler to respond? Are the occasional errors still occurring & "Project has no tasks available" responses even though the return rate is now very low? When I reverted one of my systems to stock for a while, it was still getting the occasional Scheduler error & "Project has no tasks available" messages, and Scheduler responses were taking 20-30sec. Usual response time is 2-3 sec. Which all indicates that while there is a bug that results in Anonymous platform not getting any work, there is still some other issue resulting in the whole Scheduler response taking an excessively long time to occur. Edit- People have mentioned resends are occurring- didn't we have that disabled due to it bringing the database to it's knees due to excessively long response times when the database was only a fraction of it's present size? How about we get that disabled again & see if that allows work to flow to Anonymous hosts, and that will allow people to fix the buggy code that stops them from getting work under these circumstances at their convenience? Grant Darwin NT ID: 2024425 ·

Lazydude Volunteer tester Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136	Message 2024426 - Posted: 22 Dec 2019, 18:28:59 UTC I got a whole lots of new and some resends tasks with anon platform. Is something more broken or not yet announced that its a good way to be fixed "Normal" response time on "ALL TASKS for" page ID: 2024426 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024428 - Posted: 22 Dec 2019, 18:41:06 UTC Keep looking, guys :-) Another team might start from the history. There's only one change to sched_shmem.cpp in the timescale that we're looking at: back end: add feature for assigning WUs to a particular version num That adds an app_version_num field to the workunit database table - one candidate for the update that made Eric say the old code couldn't be used any more. An anonymous platform request, using one of those negative "pseudo IDs" that Keith found might barf when compared against a real version number in a task usability test? I'll need to look whether any of the files affected have separate handling sections for stock and anonymous platform - some I've seen in the past do. Then check if, in any case, one handler has been updated but the other not. --- Meanwhile, Eric has picked up on the report I made to the server release manager, and replied with an indication of the areas he's looking at. I won't confuse matters by posting them here, but I'll keep an eye open for anything that might be coming in on that front. At this stage, they're just at the "The possibilities that come to my mind are ..." stage. ID: 2024428 ·

Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762	Message 2024430 - Posted: 22 Dec 2019, 19:02:26 UTC - in response to Message 2024408. Last modified: 22 Dec 2019, 19:07:21 UTC Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at https://github.com/BOINC/boinc/tree/setiathome_server/sched Something goes wrong in the function SCHED_SHMEM::no_work. bool SCHED_SHMEM::no_work(int pid) { if (!ready) return true; for (int i=0; i<max_wu_results; i++) { if (wu_results[i].state == WR_STATE_PRESENT) { wu_results[i].state = pid; return false; } } return true; } This function works properly unless the requesting computer has anonymous platform apps, for which it always returns true. How could that be? I don't know despite additional 500 lines of debugging code. It's almost as if something else is pausing anonymous platform requests until the queue is empty. Well it's bed time now. :( I guess my first question would be: Is it returning true because of "!ready" ? or is it falling through and returning the bottom true. If its falling through then either max_wu_results is less than zero or wu_results[i].state is never equal to WR_STATE_PRESENT Based on that analysis one can then decide what to look at next. ID: 2024430 ·

wujj123456 Send message Joined: 5 Sep 04 Posts: 40 Credit: 20,877,975 RAC: 219	Message 2024431 - Posted: 22 Dec 2019, 19:12:52 UTC - in response to Message 2024430. Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at https://github.com/BOINC/boinc/tree/setiathome_server/sched Something goes wrong in the function SCHED_SHMEM::no_work. bool SCHED_SHMEM::no_work(int pid) { if (!ready) return true; for (int i=0; i<max_wu_results; i++) { if (wu_results[i].state == WR_STATE_PRESENT) { wu_results[i].state = pid; return false; } } return true; } This function works properly unless the requesting computer has anonymous platform apps, for which it always returns true. How could that be? I don't know despite additional 500 lines of debugging code. It's almost as if something else is pausing anonymous platform requests until the queue is empty. Well it's bed time now. :( I guess my first question would be: Is it returning true because of "!ready" ? or is it falling through and returning the bottom true. If its falling through then either max_wu_results is less than zero or wu_results[i].state is never equal to WR_STATE_PRESENT Based on that analysis one can then decide what to look at next. Pretty sure it's true. It's set to true at the beginning of feeder loop. https://github.com/BOINC/boinc/blob/setiathome_server/sched/feeder.cpp#L572 The only time it's set to false is atexit() which is when the program terminates. https://github.com/BOINC/boinc/blob/setiathome_server/sched/feeder.cpp#L170 https://github.com/BOINC/boinc/blob/setiathome_server/sched/feeder.cpp#L859 ID: 2024431 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2024432 - Posted: 22 Dec 2019, 19:13:34 UTC I'm loving all the comments and snippets of code. Fantastic to see the community using their talents to help the project. Just wanted to mention some good news. For some reason the replica is catching up. Still a long way to catch up, but just happy the number of seconds is going down and not up! ID: 2024432 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2024433 - Posted: 22 Dec 2019, 19:17:08 UTC - in response to Message 2024425. For those now running stock- how long is it taking for the Scheduler to respond? Are the occasional errors still occurring & "Project has no tasks available" responses even though the return rate is now very low? The only thing that changed when I switched back to stock was that my client can now occasionally get some work. Great majority of the work requests still result in http errors, timeouts or 'Project has no tasks available'. I got my queue full at some point today but right now I have had so long streak or errors or 'zero tasks' that I'm about 100 tasks short of the full queue. ID: 2024433 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2024435 - Posted: 22 Dec 2019, 19:21:28 UTC - in response to Message 2024425. For those now running stock- how long is it taking for the Scheduler to respond? Are the occasional errors still occurring & "Project has no tasks available" responses even though the return rate is now very low? When I reverted one of my systems to stock for a while, it was still getting the occasional Scheduler error & "Project has no tasks available" messages, and Scheduler responses were taking 20-30sec. Usual response time is 2-3 sec. The response time to a request is very slow. It used to be so fast that I couldn't read to keep up with the log, now it pauses for so long, that I wonder if it is still doing something. 20-30 seconds sounds about right. I'm also only successful in getting new WUs about every 2ish hours. I will get a healthy amount, then nothing for another 2ish hours. The faster machine I have set to keep asking, the slower machine I ask once or twice a day, since I'm getting a large (40-50 WUs- when my machine only does 50/day) batch. ID: 2024435 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.