The Server Issues / Outages Thread - Panic Mode On! (117)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 48 · 49 · 50 · 51 · 52 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024266 - Posted: 22 Dec 2019, 1:48:15 UTC - in response to Message 2024205.  

Anonymous Platform here.
Once or twice I have got a valid response (taking almost 2 min, usual response time 3 sec) which results in a "Project has no tasks available" response.
Edit- both systems have all work completed & reported, they're just trying for new work.


. . Exactly the same here, I was only able to report by using NTT, but once work fetch is turned back on "no tasks".

Stephen

:(
ID: 2024266 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024269 - Posted: 22 Dec 2019, 2:42:16 UTC

. . I have tried reconfiguring the location for this machine to accept only CPU work and 'other work when requested work not available' but it still comes back with 'no tasks'.

. . I guess the servers as they are will not talk to anything it has identified as 'anonymous platform'.

Stephen

:(
ID: 2024269 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 2024278 - Posted: 22 Dec 2019, 3:35:04 UTC - in response to Message 2024269.  

I guess the servers as they are will not talk to anything it has identified as 'anonymous platform'.
Even running stock, it's hit & miss making contact with the Scheduler, and quite a few responses are "Project has no tasks available" when you do. But when you do get work, you get a lot of it.
But I gave up with running it as stock because when I got some SoG WUs the downloads errored out & after that I got almost nothing but CUDA42 work, with the odd CUDA50. With runtimes on par with 10 year old hardware on current hardware it didn't make much sense to continue with that.
Grant
Darwin NT
ID: 2024278 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1639
Credit: 12,921,799
RAC: 89
New Zealand
Message 2024287 - Posted: 22 Dec 2019, 3:59:04 UTC - in response to Message 2024252.  

I'm looking into the problem. Grrrrr.....

Thanks Eric for looking into this on the weekend
ID: 2024287 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 2024288 - Posted: 22 Dec 2019, 4:02:03 UTC - in response to Message 2024287.  

I'm looking into the problem. Grrrrr.....

Thanks Eric for looking into this on the weekend

+1 !!
ID: 2024288 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 2024305 - Posted: 22 Dec 2019, 7:09:03 UTC
Last modified: 22 Dec 2019, 7:10:13 UTC

Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at

https://github.com/BOINC/boinc/tree/setiathome_server/sched

Something goes wrong in the function SCHED_SHMEM::no_work.

bool SCHED_SHMEM::no_work(int pid) {
    if (!ready) return true;
    for (int i=0; i<max_wu_results; i++) {
        if (wu_results[i].state == WR_STATE_PRESENT) {
            wu_results[i].state = pid;
            return false;
        }
    }
    return true;
}


This function works properly unless the requesting computer has anonymous platform apps, for which it always returns true. How could that be? I don't know despite additional 500 lines of debugging code. It's almost as if something else is pausing anonymous platform requests until the queue is empty. Well it's bed time now. :(
@SETIEric@qoto.org (Mastodon)

ID: 2024305 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2024307 - Posted: 22 Dec 2019, 7:37:15 UTC

Can't you just reload the previous server level 709 code instead of the level 715 code?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2024307 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2024310 - Posted: 22 Dec 2019, 8:07:32 UTC - in response to Message 2024307.  

Good night Eric. Get some rest. Rested eyes are much more likely to see something than tired eyes. Thank you for taking the time to look at this. Hopefully one of the other here will have some ideas.
ID: 2024310 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 2024312 - Posted: 22 Dec 2019, 8:10:05 UTC - in response to Message 2024307.  

Unfortunately there a database change that renders the 7.09 server inoperable. :(

Really going to bed this time.
@SETIEric@qoto.org (Mastodon)

ID: 2024312 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2024315 - Posted: 22 Dec 2019, 8:19:13 UTC
Last modified: 22 Dec 2019, 8:19:59 UTC

Angela has called ;-)

I've had a quick look at the bowl of spaghetti, sorry "code", that surrounds the lines that Eric posted. It is one of the many tangled bits, and it is quite probable that there is another process lurking around that sets one of the triggers that stops the delivery of tasks to anonymous applications - and I haven't got any of my track&trace tools with me.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2024315 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 2024317 - Posted: 22 Dec 2019, 8:38:59 UTC - in response to Message 2024305.  

It's almost as if something else is pausing anonymous platform requests until the queue is empty. Well it's bed time now. :(
It's not just Anonymous platform- stock client requests are also taking ages, it's just that Anonymous Platform systems get hit even harder resulting in the host getting nothing. The normal response time for a Scheduler response is 2-3sec. Ever since this issue began, Scheduler response times have been 30sec to 2min or so (Scheduler timeout response).

It's just that with the Stock application, while you still get some Scheduler error responses, they aren't as frequent. Nor are the "Project has no tasks available" messages if you do get a valid response, and when you do get work you get a lot of it with a single request.
So while many requests on a Stock system still result in no work, you do get enough good requests to keep your cache full.


The fact that this delay issue stops Anonymous Platform systems from getting work is a bug in it's own right, the delay is still affecting Stock systems from contacting the Scheduler & getting work when they do.
While sorting out what is causing the delay won't fix the underlying bug affecting Anonymous Platform systems, but at least it should get work flowing regularly again, for all platform systems.
Grant
Darwin NT
ID: 2024317 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2024318 - Posted: 22 Dec 2019, 8:55:10 UTC

The other day Eric posted this:
This problem may be affecting the rate at which the main project can handle results, so the validation and assimilation queues are getting large, which may affect the rate of work generation.


Eric, ever the master of understatement.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2024318 · Report as offensive
wujj123456

Send message
Joined: 5 Sep 04
Posts: 40
Credit: 20,877,975
RAC: 219
China
Message 2024319 - Posted: 22 Dec 2019, 8:59:46 UTC - in response to Message 2024312.  

Unfortunately there a database change that renders the 7.09 server inoperable. :(

I also saw this from a different project just now: https://universeathome.pl/universe/forum_thread.php?id=486

I don't know how boinc projects are operated, but if that's not a coincidence, there seems to be some changes forcing every project to upgrade now? Even if it's the worst case everyone just dragged their feet for upgrades after months/years, it's still not very nice to set a deadline or force it to happen around major holidays...
ID: 2024319 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2024320 - Posted: 22 Dec 2019, 9:10:51 UTC

Each project has its own servers, administrators etc., so this is just a sad coincedence.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2024320 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024321 - Posted: 22 Dec 2019, 9:16:59 UTC - in response to Message 2024315.  

Some quick thoughts to go with the first coffee of the day.

1) Eric mentioned a database change meaning the old code couldn't be used. That means it was a deliberate upgrade, and everything everyone said about not doing that on a pre-holiday Friday needs hanging up in neon fairylights.

2) I think the excessive delays tie in the the re-enabling of 'Resend Lost Results'. Let's treat that as a separate problem. Maybe the upgrade put in a default configuration setting (easy), or maybe it broke the 'off' switch.

3) Eric has given us a code snippet to work from, and a symptom: 'always returns true for anonymous platform'. I looked at the code yesterday, and I looked at the history of changes to that and related files. Just two caught my eye: the addition of keyword support for Science United, and some code to allow tasks to be processed by a specific version number of the science application. Just maybe, one or both of those were added for stock apps, but not for anon plat? It'll give me something to read when I wake up...
ID: 2024321 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024322 - Posted: 22 Dec 2019, 9:24:32 UTC - in response to Message 2024320.  

Each project has its own servers, administrators etc., so this is just a sad coincidence.
But they all use the same server code, from BOINC. There are occasional problems like security alerts which prompt a mass updating, but unfortunately the Universe administrator doesn't say WHY the server 'must be updated' or WHY it had to happen 'today'. I haven't seen any email traffic to that effect.
ID: 2024322 · Report as offensive
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2024326 - Posted: 22 Dec 2019, 10:02:54 UTC

Anyone else got getting new work?
Sun 22 Dec 2019 04:51:43 AM EST | SETI@home | Sending scheduler request: To fetch work.
Sun 22 Dec 2019 04:51:43 AM EST | SETI@home | Requesting new tasks for CPU and NVIDIA GPU
Sun 22 Dec 2019 04:52:51 AM EST | SETI@home | Scheduler request completed: got 0 new tasks
Sun 22 Dec 2019 04:52:51 AM EST | SETI@home | Project has no tasks available

Sun 22 Dec 2019 04:51:42 AM EST | | Starting BOINC client version 7.16.3 for x86_64-pc-linux-gnu
Sun 22 Dec 2019 04:51:42 AM EST | | log flags: file_xfer, sched_ops, task
Sun 22 Dec 2019 04:51:42 AM EST | | Libraries: libcurl/7.65.3 OpenSSL/1.1.1c zlib/1.2.11 libidn2/2.2.0 libpsl/0.20.2 (+libidn2/2.0.5) libssh/0.9.0/openssl/zlib nghttp2/1.39.2 librtmp/2.3
Sun 22 Dec 2019 04:51:42 AM EST | | Data directory: /var/lib/boinc-client
Sun 22 Dec 2019 04:51:43 AM EST | | CUDA: NVIDIA GPU 0: GeForce RTX 2060 SUPER (driver version 440.36, CUDA version 10.2, compute capability 7.5, 4096MB, 3970MB available, 7311 GFLOPS peak)
Sun 22 Dec 2019 04:51:43 AM EST | | CUDA: NVIDIA GPU 1: GeForce GTX 1070 Ti (driver version 440.36, CUDA version 10.2, compute capability 6.1, 4096MB, 3968MB available, 8186 GFLOPS peak)
Sun 22 Dec 2019 04:51:43 AM EST | | CUDA: NVIDIA GPU 2: GeForce GTX 1050 Ti (driver version 440.36, CUDA version 10.2, compute capability 6.1, 4040MB, 3978MB available, 2138 GFLOPS peak)
Sun 22 Dec 2019 04:51:43 AM EST | | OpenCL: NVIDIA GPU 0: GeForce RTX 2060 SUPER (driver version 440.36, device version OpenCL 1.2 CUDA, 7979MB, 3970MB available, 7311 GFLOPS peak)
Sun 22 Dec 2019 04:51:43 AM EST | | OpenCL: NVIDIA GPU 1: GeForce GTX 1070 Ti (driver version 440.36, device version OpenCL 1.2 CUDA, 8120MB, 3968MB available, 8186 GFLOPS peak)
Sun 22 Dec 2019 04:51:43 AM EST | | OpenCL: NVIDIA GPU 2: GeForce GTX 1050 Ti (driver version 440.36, device version OpenCL 1.2 CUDA, 4040MB, 3978MB available, 2138 GFLOPS peak)
Sun 22 Dec 2019 04:51:43 AM EST | SETI@home | Found app_info.xml; using anonymous platform
Sun 22 Dec 2019 04:51:43 AM EST | | [libc detection] gathered: 2.30, Ubuntu GLIBC 2.30-0ubuntu2
Sun 22 Dec 2019 04:51:43 AM EST | | Host name: seti-AB350-Gaming
Sun 22 Dec 2019 04:51:43 AM EST | | Processor: 16 AuthenticAMD AMD Ryzen 7 2700X Eight-Core Processor [Family 23 Model 8 Stepping 2]
Sun 22 Dec 2019 04:51:43 AM EST | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
Sun 22 Dec 2019 04:51:43 AM EST | | OS: Linux Ubuntu: Ubuntu 19.10 [5.3.0-24-generic|libc 2.30 (Ubuntu GLIBC 2.30-0ubuntu2)]
ID: 2024326 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024328 - Posted: 22 Dec 2019, 10:07:24 UTC - in response to Message 2024326.  

Anyone else got getting new work?
Try reading, instead of writing.
ID: 2024328 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024340 - Posted: 22 Dec 2019, 12:12:15 UTC - in response to Message 2024252.  

I'm looking into the problem. Grrrrr.....


. . Thanks Eric, and lotsa luck!

Stephen

. .
ID: 2024340 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024341 - Posted: 22 Dec 2019, 12:13:51 UTC - in response to Message 2024305.  

Debugging the server is virtually impossible. If anyone wants to help.... The setiathome_server branch is at

https://github.com/BOINC/boinc/tree/setiathome_server/sched

Something goes wrong in the function SCHED_SHMEM::no_work.

bool SCHED_SHMEM::no_work(int pid) {
    if (!ready) return true;
    for (int i=0; i<max_wu_results; i++) {
        if (wu_results[i].state == WR_STATE_PRESENT) {
            wu_results[i].state = pid;
            return false;
        }
    }
    return true;
}


This function works properly unless the requesting computer has anonymous platform apps, for which it always returns true. How could that be? I don't know despite additional 500 lines of debugging code. It's almost as if something else is pausing anonymous platform requests until the queue is empty. Well it's bed time now. :(


Until someone can figure out why and fix it, is it possible to hard code it to return false if the requesting computer has anonymous platform apps?

Barring that, does anyone have ideas on how to make the client side hide that it is running anonymous platform? All my PCs were running great until this happened, and I don't want to mess them up by going back to stock, especially given how slow it is.

If someone can show me it is really simple to switch to/from stock, I would be willing to try on my slowest box at least to be running something.
ID: 2024341 · Report as offensive
Previous · 1 . . . 48 · 49 · 50 · 51 · 52 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.