The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 94 · Next

AuthorMessage
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2024390 - Posted: 22 Dec 2019, 15:43:55 UTC
Last modified: 22 Dec 2019, 15:52:49 UTC

I run 2 stock slowish machines. Getting WUs is hit or miss. looks like I got new WUs on the faster of the two about 3 hours ago. The slower machine I had running on NNT, but have now set to asking because I'm getting low.

Can I trust any of the numbers in the status update?? or is it all old echos of how things used to be (like astronomy itself)?? how about results returned per hour?? 124K is that a number that is up-to-date?? seems ok.

edit:
Just caught up on old archived panic thread, and don't want eric's posts to get lost in the change
https://setiathome.berkeley.edu/forum_thread.php?id=84416&postid=2024305#2024305
ID: 2024390 · Report as offensive
Phil Burden

Send message
Joined: 26 Oct 00
Posts: 264
Credit: 22,303,899
RAC: 0
United Kingdom
Message 2024392 - Posted: 22 Dec 2019, 15:54:16 UTC - in response to Message 2024390.  

I run 2 stock slowish machines. Getting WUs is hit or miss. looks like I got new WUs on the faster of the two about 3 hours ago. The slower machine I had running on NNT, but have now set to asking because I'm getting low.

Can I trust any of the numbers in the status update?? or is it all old echos of how things used to be (like astronomy itself)?? how about results returned per hour?? 124K is that a number that is up-to-date?? seems ok.


My understanding is that the status pages are driven from the replica database, and since that's currently 18 hours BEHIND the master, that's how old the data being displayed is ;-)

But, like all things, I could be sooooooooooo wrong ;-)|

P.
ID: 2024392 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024393 - Posted: 22 Dec 2019, 15:58:50 UTC - in response to Message 2024392.  

My understanding is that the status pages are driven from the replica database


Rather defeats the entire definition of a status page to have it set up this way, but I would not be surprised if that was the case.
ID: 2024393 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024406 - Posted: 22 Dec 2019, 16:40:54 UTC

I put one of my systems on stock tasks.

I just did this:
1. close/exit boinc (it was already out of work, all reported)
2. rename app_info to app_info_bkp
3. start boinc

that was it. it downloaded tasks right away (nvidia_opencl_sah and nvidia_opencl_SoG)
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024406 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 2024408 - Posted: 22 Dec 2019, 16:59:41 UTC

Copying Eric's message to this thread as well.

Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at

https://github.com/BOINC/boinc/tree/setiathome_server/sched

Something goes wrong in the function SCHED_SHMEM::no_work.
bool SCHED_SHMEM::no_work(int pid) {
    if (!ready) return true;
    for (int i=0; i<max_wu_results; i++) {
        if (wu_results[i].state == WR_STATE_PRESENT) {
            wu_results[i].state = pid;
            return false;
        }
    }
    return true;
}




This function works properly unless the requesting computer has anonymous platform apps, for which it always returns true. How could that be? I don't know despite additional 500 lines of debugging code. It's almost as if something else is pausing anonymous platform requests until the queue is empty. Well it's bed time now. :(


ID: 2024408 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2024410 - Posted: 22 Dec 2019, 17:15:56 UTC

So here is my guess after traversing the spaghetti.

In the sched_util.h file you have this comment and code:
https://github.com/BOINC/boinc/blob/0ee5c54381b262627f14c147f5528ed93f9d7672/sched/sched_util.h#L39
It speaks of generating a "pseudo ID" for anonymous platform and defines DB_ID_TYPE.

Then over in sched_shmem.h you get:
https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.h#L129
which references that DB_ID_TYPE variable.

Which eventually leads us back to the sched_shmem.cpp module which Eric referenced as where the code blows up on anonymous platform and returns true.
https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L283
https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L290
https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L304
https://github.com/BOINC/boinc/blob/94de79c362537587ce4297c42d973d4be07f4768/sched/sched_shmem.cpp#L333

all of which sections use that DB_ID_TYPE variable
which eventually leads us back to the SCHED_SHMEM::no_work section.

Is the problem that the existing 715 server code doesn't properly define or handle the "pseudo ID" that is generated for anonymous platform?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2024410 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 2024412 - Posted: 22 Dec 2019, 17:32:25 UTC

Before today I edited the client_state.xml file to rename all cuda60 WUs to SoG, worked fine, but now I'm only getting cuda60 work. Guess the server now thinks cuda60 is a good choice :(
ID: 2024412 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 2024413 - Posted: 22 Dec 2019, 17:33:40 UTC - in response to Message 2024408.  
Last modified: 22 Dec 2019, 18:00:09 UTC

At a first glance, my suspicions would be to check the pid:

Are 'pid's getting reused or rolling over? Or otherwise malformed?

Are 'pid's somehow 'special' for anonymous?

This is further suspicious in that: Is this a problem from the recent sudden big rise in live tasks and work units?...

Is the integer for the pid overflowing?!??

Or has the database table for anonymous overflowed?


OK, just some wild guesses before I follow up on Kieth's comments :-)


Keep searchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 2024413 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 2024419 - Posted: 22 Dec 2019, 17:59:22 UTC - in response to Message 2024410.  

Have the DB_ID_TYPE "id"s been changed across the versions/databases?


Keep searchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 2024419 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 2024420 - Posted: 22 Dec 2019, 18:08:38 UTC - in response to Message 2024410.  
Last modified: 22 Dec 2019, 18:08:54 UTC

From a very quick glance, note on:

https://github.com/BOINC/boinc/blob/0ee5c54381b262627f14c147f5528ed93f9d7672/sched/sched_util.h#L39

there is "return appid*1000000 - avid".

The one million is not that big a number if that returned (compound/combined?) result is to be unique wrt s@h users/tasks/wu...?

Really, should not a wide hashing function or a structure be used to safely return such a result...?


Keep searchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 2024420 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2024422 - Posted: 22 Dec 2019, 18:24:23 UTC

Thanks for the comments, Martin.

I too wondered if the size of the database now is at the root of the problem.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2024422 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2024424 - Posted: 22 Dec 2019, 18:25:38 UTC - in response to Message 2024368.  

Very simple to switch from Anonymous platform to Stock even with the All-In-One. All you have to do is change the Names on the two files app_info.xml & app_config.xml to something as app_info1.xml & app_config1.xml, that will revert you to Stock. To change back to Anonymous platform rename the files to the original names app_info.xml & app_config.xml .
That's All that needs to be done, Nothing Else...NADA.
It's not that simple in my experience. Or it is to get back to stock but if you want to be able to restore your anonymous setup later, then it is better to move or copy the anonymous apps out of the project folder. Boinc has a habit of deleting any file in the project folder it doesn't know what to do with. And sometimes even when it does!
ID: 2024424 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 2024425 - Posted: 22 Dec 2019, 18:27:51 UTC
Last modified: 22 Dec 2019, 18:39:45 UTC

For those now running stock- how long is it taking for the Scheduler to respond? Are the occasional errors still occurring & "Project has no tasks available" responses even though the return rate is now very low?
When I reverted one of my systems to stock for a while, it was still getting the occasional Scheduler error & "Project has no tasks available" messages, and Scheduler responses were taking 20-30sec. Usual response time is 2-3 sec.

Which all indicates that while there is a bug that results in Anonymous platform not getting any work, there is still some other issue resulting in the whole Scheduler response taking an excessively long time to occur.


Edit-
People have mentioned resends are occurring- didn't we have that disabled due to it bringing the database to it's knees due to excessively long response times when the database was only a fraction of it's present size?
How about we get that disabled again & see if that allows work to flow to Anonymous hosts, and that will allow people to fix the buggy code that stops them from getting work under these circumstances at their convenience?
Grant
Darwin NT
ID: 2024425 · Report as offensive
Lazydude
Volunteer tester

Send message
Joined: 17 Jan 01
Posts: 45
Credit: 96,158,001
RAC: 136
Sweden
Message 2024426 - Posted: 22 Dec 2019, 18:28:59 UTC

I got a whole lots of new and some resends tasks with anon platform.
Is something more broken or not yet announced that its a good way to be fixed
"Normal" response time on "ALL TASKS for" page
ID: 2024426 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024428 - Posted: 22 Dec 2019, 18:41:06 UTC

Keep looking, guys :-)

Another team might start from the history. There's only one change to sched_shmem.cpp in the timescale that we're looking at: back end: add feature for assigning WUs to a particular version num

That adds an app_version_num field to the workunit database table - one candidate for the update that made Eric say the old code couldn't be used any more. An anonymous platform request, using one of those negative "pseudo IDs" that Keith found might barf when compared against a real version number in a task usability test?

I'll need to look whether any of the files affected have separate handling sections for stock and anonymous platform - some I've seen in the past do. Then check if, in any case, one handler has been updated but the other not.

---

Meanwhile, Eric has picked up on the report I made to the server release manager, and replied with an indication of the areas he's looking at. I won't confuse matters by posting them here, but I'll keep an eye open for anything that might be coming in on that front. At this stage, they're just at the "The possibilities that come to my mind are ..." stage.
ID: 2024428 · Report as offensive
Profile Eric B

Send message
Joined: 9 Mar 00
Posts: 88
Credit: 168,875,085
RAC: 762
United States
Message 2024430 - Posted: 22 Dec 2019, 19:02:26 UTC - in response to Message 2024408.  
Last modified: 22 Dec 2019, 19:07:21 UTC

Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at

https://github.com/BOINC/boinc/tree/setiathome_server/sched

Something goes wrong in the function SCHED_SHMEM::no_work.

bool SCHED_SHMEM::no_work(int pid) {
        if (!ready) return true;
        for (int i=0; i<max_wu_results; i++) {
            if (wu_results[i].state == WR_STATE_PRESENT) {
                wu_results[i].state = pid;
                return false;
            }
        }
        return true;
    }


This function works properly unless the requesting computer has anonymous platform apps, for which it always returns true. How could that be? I don't know despite additional 500 lines of debugging code. It's almost as if something else is pausing anonymous platform requests until the queue is empty. Well it's bed time now. :(


I guess my first question would be:
Is it returning true because of "!ready" ? or is it falling through and returning the bottom true.
If its falling through then either max_wu_results is less than zero or
wu_results[i].state is never equal to WR_STATE_PRESENT

Based on that analysis one can then decide what to look at next.
ID: 2024430 · Report as offensive
wujj123456

Send message
Joined: 5 Sep 04
Posts: 40
Credit: 20,877,975
RAC: 219
China
Message 2024431 - Posted: 22 Dec 2019, 19:12:52 UTC - in response to Message 2024430.  

Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at

https://github.com/BOINC/boinc/tree/setiathome_server/sched

Something goes wrong in the function SCHED_SHMEM::no_work.

bool SCHED_SHMEM::no_work(int pid) {
        if (!ready) return true;
        for (int i=0; i<max_wu_results; i++) {
            if (wu_results[i].state == WR_STATE_PRESENT) {
                wu_results[i].state = pid;
                return false;
            }
        }
        return true;
    }


This function works properly unless the requesting computer has anonymous platform apps, for which it always returns true. How could that be? I don't know despite additional 500 lines of debugging code. It's almost as if something else is pausing anonymous platform requests until the queue is empty. Well it's bed time now. :(


I guess my first question would be:
Is it returning true because of "!ready" ? or is it falling through and returning the bottom true.
If its falling through then either max_wu_results is less than zero or
wu_results[i].state is never equal to WR_STATE_PRESENT

Based on that analysis one can then decide what to look at next.

Pretty sure it's true. It's set to true at the beginning of feeder loop. https://github.com/BOINC/boinc/blob/setiathome_server/sched/feeder.cpp#L572

The only time it's set to false is atexit() which is when the program terminates.
https://github.com/BOINC/boinc/blob/setiathome_server/sched/feeder.cpp#L170
https://github.com/BOINC/boinc/blob/setiathome_server/sched/feeder.cpp#L859
ID: 2024431 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2024432 - Posted: 22 Dec 2019, 19:13:34 UTC

I'm loving all the comments and snippets of code. Fantastic to see the community using their talents to help the project.

Just wanted to mention some good news. For some reason the replica is catching up. Still a long way to catch up, but just happy the number of seconds is going down and not up!
ID: 2024432 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2024433 - Posted: 22 Dec 2019, 19:17:08 UTC - in response to Message 2024425.  

For those now running stock- how long is it taking for the Scheduler to respond? Are the occasional errors still occurring & "Project has no tasks available" responses even though the return rate is now very low?
The only thing that changed when I switched back to stock was that my client can now occasionally get some work. Great majority of the work requests still result in http errors, timeouts or 'Project has no tasks available'.

I got my queue full at some point today but right now I have had so long streak or errors or 'zero tasks' that I'm about 100 tasks short of the full queue.
ID: 2024433 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2024435 - Posted: 22 Dec 2019, 19:21:28 UTC - in response to Message 2024425.  

For those now running stock- how long is it taking for the Scheduler to respond? Are the occasional errors still occurring & "Project has no tasks available" responses even though the return rate is now very low?
When I reverted one of my systems to stock for a while, it was still getting the occasional Scheduler error & "Project has no tasks available" messages, and Scheduler responses were taking 20-30sec. Usual response time is 2-3 sec.


The response time to a request is very slow. It used to be so fast that I couldn't read to keep up with the log, now it pauses for so long, that I wonder if it is still doing something. 20-30 seconds sounds about right.
I'm also only successful in getting new WUs about every 2ish hours. I will get a healthy amount, then nothing for another 2ish hours. The faster machine I have set to keep asking, the slower machine I ask once or twice a day, since I'm getting a large (40-50 WUs- when my machine only does 50/day) batch.
ID: 2024435 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.