Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 94 · Next
Author | Message |
---|---|
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
I run 2 stock slowish machines. Getting WUs is hit or miss. looks like I got new WUs on the faster of the two about 3 hours ago. The slower machine I had running on NNT, but have now set to asking because I'm getting low. Can I trust any of the numbers in the status update?? or is it all old echos of how things used to be (like astronomy itself)?? how about results returned per hour?? 124K is that a number that is up-to-date?? seems ok. edit: Just caught up on old archived panic thread, and don't want eric's posts to get lost in the change https://setiathome.berkeley.edu/forum_thread.php?id=84416&postid=2024305#2024305 |
Phil Burden Send message Joined: 26 Oct 00 Posts: 264 Credit: 22,303,899 RAC: 0 |
I run 2 stock slowish machines. Getting WUs is hit or miss. looks like I got new WUs on the faster of the two about 3 hours ago. The slower machine I had running on NNT, but have now set to asking because I'm getting low. My understanding is that the status pages are driven from the replica database, and since that's currently 18 hours BEHIND the master, that's how old the data being displayed is ;-) But, like all things, I could be sooooooooooo wrong ;-)| P. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3797 Credit: 1,114,826,392 RAC: 3,319 |
|
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
I put one of my systems on stock tasks. I just did this: 1. close/exit boinc (it was already out of work, all reported) 2. rename app_info to app_info_bkp 3. start boinc that was it. it downloaded tasks right away (nvidia_opencl_sah and nvidia_opencl_SoG) Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
Copying Eric's message to this thread as well. Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
So here is my guess after traversing the spaghetti. In the sched_util.h file you have this comment and code: https://github.com/BOINC/boinc/blob/0ee5c54381b262627f14c147f5528ed93f9d7672/sched/sched_util.h#L39 It speaks of generating a "pseudo ID" for anonymous platform and defines DB_ID_TYPE. Then over in sched_shmem.h you get: https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.h#L129 which references that DB_ID_TYPE variable. Which eventually leads us back to the sched_shmem.cpp module which Eric referenced as where the code blows up on anonymous platform and returns true. https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L283 https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L290 https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L304 https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L333 all of which sections use that DB_ID_TYPE variable which eventually leads us back to the SCHED_SHMEM::no_work section. Is the problem that the existing 715 server code doesn't properly define or handle the "pseudo ID" that is generated for anonymous platform? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
JohnDK Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127 |
Before today I edited the client_state.xml file to rename all cuda60 WUs to SoG, worked fine, but now I'm only getting cuda60 work. Guess the server now thinks cuda60 is a good choice :( |
ML1 Send message Joined: 25 Nov 01 Posts: 20949 Credit: 7,508,002 RAC: 20 |
At a first glance, my suspicions would be to check the pid: Are 'pid's getting reused or rolling over? Or otherwise malformed? Are 'pid's somehow 'special' for anonymous? This is further suspicious in that: Is this a problem from the recent sudden big rise in live tasks and work units?... Is the integer for the pid overflowing?!?? Or has the database table for anonymous overflowed? OK, just some wild guesses before I follow up on Kieth's comments :-) Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
ML1 Send message Joined: 25 Nov 01 Posts: 20949 Credit: 7,508,002 RAC: 20 |
Have the DB_ID_TYPE "id"s been changed across the versions/databases? Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
ML1 Send message Joined: 25 Nov 01 Posts: 20949 Credit: 7,508,002 RAC: 20 |
From a very quick glance, note on: https://github.com/BOINC/boinc/blob/0ee5c54381b262627f14c147f5528ed93f9d7672/sched/sched_util.h#L39 there is "return appid*1000000 - avid". The one million is not that big a number if that returned (compound/combined?) result is to be unique wrt s@h users/tasks/wu...? Really, should not a wide hashing function or a structure be used to safely return such a result...? Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the comments, Martin. I too wondered if the size of the database now is at the root of the problem. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Very simple to switch from Anonymous platform to Stock even with the All-In-One. All you have to do is change the Names on the two files app_info.xml & app_config.xml to something as app_info1.xml & app_config1.xml, that will revert you to Stock. To change back to Anonymous platform rename the files to the original names app_info.xml & app_config.xml .It's not that simple in my experience. Or it is to get back to stock but if you want to be able to restore your anonymous setup later, then it is better to move or copy the anonymous apps out of the project folder. Boinc has a habit of deleting any file in the project folder it doesn't know what to do with. And sometimes even when it does! |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13834 Credit: 208,696,464 RAC: 304 |
For those now running stock- how long is it taking for the Scheduler to respond? Are the occasional errors still occurring & "Project has no tasks available" responses even though the return rate is now very low? When I reverted one of my systems to stock for a while, it was still getting the occasional Scheduler error & "Project has no tasks available" messages, and Scheduler responses were taking 20-30sec. Usual response time is 2-3 sec. Which all indicates that while there is a bug that results in Anonymous platform not getting any work, there is still some other issue resulting in the whole Scheduler response taking an excessively long time to occur. Edit- People have mentioned resends are occurring- didn't we have that disabled due to it bringing the database to it's knees due to excessively long response times when the database was only a fraction of it's present size? How about we get that disabled again & see if that allows work to flow to Anonymous hosts, and that will allow people to fix the buggy code that stops them from getting work under these circumstances at their convenience? Grant Darwin NT |
Lazydude Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136 |
I got a whole lots of new and some resends tasks with anon platform. Is something more broken or not yet announced that its a good way to be fixed "Normal" response time on "ALL TASKS for" page |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
Keep looking, guys :-) Another team might start from the history. There's only one change to sched_shmem.cpp in the timescale that we're looking at: back end: add feature for assigning WUs to a particular version num That adds an app_version_num field to the workunit database table - one candidate for the update that made Eric say the old code couldn't be used any more. An anonymous platform request, using one of those negative "pseudo IDs" that Keith found might barf when compared against a real version number in a task usability test? I'll need to look whether any of the files affected have separate handling sections for stock and anonymous platform - some I've seen in the past do. Then check if, in any case, one handler has been updated but the other not. --- Meanwhile, Eric has picked up on the report I made to the server release manager, and replied with an indication of the areas he's looking at. I won't confuse matters by posting them here, but I'll keep an eye open for anything that might be coming in on that front. At this stage, they're just at the "The possibilities that come to my mind are ..." stage. |
Eric B Send message Joined: 9 Mar 00 Posts: 88 Credit: 168,875,085 RAC: 762 |
Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at I guess my first question would be: Is it returning true because of "!ready" ? or is it falling through and returning the bottom true. If its falling through then either max_wu_results is less than zero or wu_results[i].state is never equal to WR_STATE_PRESENT Based on that analysis one can then decide what to look at next. |
wujj123456 Send message Joined: 5 Sep 04 Posts: 40 Credit: 20,877,975 RAC: 219 |
Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at Pretty sure it's true. It's set to true at the beginning of feeder loop. https://github.com/BOINC/boinc/blob/setiathome_server/sched/feeder.cpp#L572 The only time it's set to false is atexit() which is when the program terminates. https://github.com/BOINC/boinc/blob/setiathome_server/sched/feeder.cpp#L170 https://github.com/BOINC/boinc/blob/setiathome_server/sched/feeder.cpp#L859 |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
I'm loving all the comments and snippets of code. Fantastic to see the community using their talents to help the project. Just wanted to mention some good news. For some reason the replica is catching up. Still a long way to catch up, but just happy the number of seconds is going down and not up! |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
For those now running stock- how long is it taking for the Scheduler to respond? Are the occasional errors still occurring & "Project has no tasks available" responses even though the return rate is now very low?The only thing that changed when I switched back to stock was that my client can now occasionally get some work. Great majority of the work requests still result in http errors, timeouts or 'Project has no tasks available'. I got my queue full at some point today but right now I have had so long streak or errors or 'zero tasks' that I'm about 100 tasks short of the full queue. |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
For those now running stock- how long is it taking for the Scheduler to respond? Are the occasional errors still occurring & "Project has no tasks available" responses even though the return rate is now very low? The response time to a request is very slow. It used to be so fast that I couldn't read to keep up with the log, now it pauses for so long, that I wonder if it is still doing something. 20-30 seconds sounds about right. I'm also only successful in getting new WUs about every 2ish hours. I will get a healthy amount, then nothing for another 2ish hours. The faster machine I have set to keep asking, the slower machine I ask once or twice a day, since I'm getting a large (40-50 WUs- when my machine only does 50/day) batch. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.