The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 94 · Next

AuthorMessage
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4263
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2024437 - Posted: 22 Dec 2019, 19:28:58 UTC

from an Anonymous Platform host:

Sun 22 Dec 2019 02:26:21 PM EST | SETI@home | [sched_op] Starting scheduler request
Sun 22 Dec 2019 02:26:21 PM EST | SETI@home | Sending scheduler request: Requested by user.
Sun 22 Dec 2019 02:26:21 PM EST | SETI@home | Requesting new tasks for NVIDIA GPU
Sun 22 Dec 2019 02:26:21 PM EST | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Sun 22 Dec 2019 02:26:21 PM EST | SETI@home | [sched_op] NVIDIA GPU work request: 2617920.00 seconds; 30.00 devices
Sun 22 Dec 2019 02:26:31 PM EST | SETI@home | Scheduler request completed: got 0 new tasks
Sun 22 Dec 2019 02:26:31 PM EST | SETI@home | [sched_op] Server version 715
Sun 22 Dec 2019 02:26:31 PM EST | SETI@home | Project has no tasks available
Sun 22 Dec 2019 02:26:31 PM EST | SETI@home | Project requested delay of 303 seconds
Sun 22 Dec 2019 02:26:31 PM EST | SETI@home | [sched_op] Deferring communication for 00:05:03
Sun 22 Dec 2019 02:26:31 PM EST | SETI@home | [sched_op] Reason: requested by project


the request completed quickly (10 seconds), but returned that no work was available. didn't timeout like it had been doing previously.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2024437 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14511
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024438 - Posted: 22 Dec 2019, 19:31:35 UTC - in response to Message 2024435.  

The response time to a request is very slow. It used to be so fast that I couldn't read to keep up with the log, now it pauses for so long, that I wonder if it is still doing something. 20-30 seconds sounds about right.
I think the slow response time is purely because this glitch has also turned 'resend lost tasks' back on, when we have a huge number of tasks in the database.

I replicated our problem at LHC, and there I'm getting an 'internal server error' response in 1 second, and it's always 'internal server error' - never time out or no tasks available.
ID: 2024438 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14511
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024440 - Posted: 22 Dec 2019, 19:37:41 UTC - in response to Message 2024437.  

the request completed quickly (10 seconds), but returned that no work was available. didn't timeout like it had been doing previously.
Looking at your 30 GPU host (and we can see it quickly now - yay!), it has no tasks thought by the server to be in progress. That means the check for lost tasks can run very quickly, which will be a great help.

Or perhaps Eric has managed to turn off 'resend lost tasks' since I last heard - that would help, too.
ID: 2024440 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2024442 - Posted: 22 Dec 2019, 19:48:43 UTC
Last modified: 22 Dec 2019, 19:53:06 UTC

I know there are a lot trying to help and more capable to help than me with my old rusty C coding but...

I not read the entire server code to be sure but i was unable to locate where is set any value to avid & appid when run anonymous?

) {
    if (avid < 0) {
        return appid*1000000 - avid;
    }
    return avid;
}


So the return of the
if (!ready) return true;

Will be always true.

Who knows?
ID: 2024442 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024448 - Posted: 22 Dec 2019, 20:46:41 UTC
Last modified: 22 Dec 2019, 20:47:29 UTC

I'm running stock with a 1080ti and 2070 Super. Here's a job that just finished. Should it be running cuda60 or something better? 46+ minutes seems long even for stock.

Application SETI@home v8 8.01 (cuda60)
Name 20dc19aa.24041.21335.5.32.215.vlar
State Ready to report
Received Sun 22 Dec 2019 08:29:24 AM EST
Report deadline Thu 13 Feb 2020 01:29:05 PM EST
Resources 0.516 CPUs + 1 NVIDIA GPU
Estimated computation size 183,887 GFLOPs
CPU time 00:04:42
Elapsed time 00:46:49
Executable setiathome_8.01_x86_64-pc-linux-gnu__cuda60

Here is app_config and the error log is complaining about cuda90 (doesn't match any app versions).
<app_config>
<project_max_concurrent>12</project_max_concurrent>
 <app_version>
   <app_name>setiathome_v8</app_name>
   <plan_class>cuda90</plan_class>
   <avg_ncpus>1.0</avg_ncpus>
   <ngpus>1.0</ngpus>
   <cmdline>-nobs</cmdline>
 </app_version>
</app_config>

ID: 2024448 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024449 - Posted: 22 Dec 2019, 21:16:29 UTC - in response to Message 2024365.  

Since I am using the All-In-One, I don't even have a stock to revert to. I'd need to archive the BOINC folder, download/install the detested Repository version, reconnect to SETI, download/install all the setup/apps/work including some ancient and slow CUDA50 that takes 10x as long to finish if it doesn't crash, then when this is fixed (which with my luck will happen exactly when I have completed this) wait for the work to complete, uninstall it, unpack the All-In-One back and hope for the best...
... on eight computers.
Or I could just connect to Einstein. Takes about ten seconds apiece. Much easier.


. . I only have 5 machines but I have been thinking much the same. Apart from this one running SoG under lunatics which I can change back to stock in less than half an hour from the time it runs out of work, on the others it would be much easier (and far less intimidating) to just change projects.

. . I may be being a little over dramatic but this 'disastrous' change could be the death knell for SETI@home. It has the potential to drive most/all of the projects most productive volunteers to other projects or simply away. By forcing these people to shut down their machines for what is looking like an indefinite period, they will inevitably seek other hobbies or pasttimes which may then preclude them from returning. We really do need assurances that this is not a long term problem due to some administrative BOINC issue over which the SETI guys have no control.

Stephen

please!
ID: 2024449 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 2024451 - Posted: 22 Dec 2019, 21:19:20 UTC
Last modified: 22 Dec 2019, 21:20:17 UTC


Here is app_config and the error log is complaining about cuda90 (doesn't match any app versions).
<app_config>
<project_max_concurrent>12</project_max_concurrent>
 <app_version>
   <app_name>setiathome_v8</app_name>
   <plan_class>cuda90</plan_class>
   <avg_ncpus>1.0</avg_ncpus>
   <ngpus>1.0</ngpus>
   <cmdline>-nobs</cmdline>
 </app_version>
</app_config>

Your still using the anonymous app_config setup, here's the one I use for running stock apps

<app_config>
 <app_version>
   <app_name>setiathome_v8</app_name>
   <plan_class>opencl_nvidia_sah</plan_class>
   <avg_ncpus>1</avg_ncpus>
   <ngpus>1</ngpus>
   <cmdline>-sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64</cmdline>
 </app_version>
 <app_version>
   <app_name>setiathome_v8</app_name>
   <plan_class>opencl_nvidia_SoG</plan_class>
   <avg_ncpus>1</avg_ncpus>
   <ngpus>1</ngpus>
   <cmdline>-sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64</cmdline>
 </app_version>
 <app_version>
   <app_name>setiathome_v8</app_name>
   <plan_class>cuda60</plan_class>
   <avg_ncpus>1</avg_ncpus>
   <ngpus>1</ngpus>
   <cmdline>-sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64</cmdline>
 </app_version>
</app_config>


Yes cuda60 sucks, I don't know if one can force the server to send SoG or sah work.

I think the server is supposed to learn which app is the best and only send the right one, my main host doesn't get cuda60 work anymore, my other host only get cuda60 work :(
ID: 2024451 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024453 - Posted: 22 Dec 2019, 21:24:14 UTC - in response to Message 2024367.  

The LHC project at CERN now has the responsibility for testing and releasing new server code. Their website says: <snip>
But at every work request since then, the LHC server has responded 'internal server error'. It only happens when work is requested and a task is already running.
Bingo! We have a reproduction of the problem here, on an independent project, without all the congestion and delays. And that project is well resourced, and has a vested interest in getting the problem sorted. I'll be writing to the guys once I've got this posted.


. . But does that mean if requesting new work when there are NO tasks to be running you could expect to get some? Because I am not seeing that here on any machine...

Stephen

?
ID: 2024453 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2024454 - Posted: 22 Dec 2019, 21:25:33 UTC

Any oldtimers here remember that one Christmas season that seti was very down... was it a month?? two??

I'm sorry that there is a group of you that aren't getting WUs. Eric has responded, and even tried last night to track down the problem. I'm amazed they have kept the system up for so many of us. They are trying to fix things on a weekend and a holiday season, and that is way more than I would ask for (but I'm grateful).
ID: 2024454 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024457 - Posted: 22 Dec 2019, 21:31:30 UTC - in response to Message 2024451.  

Thanks, John. Yes, I was trying to do minimal changes to my config. Since I did backup the folder, I've put in your version and will see how it runs. Never thought I would be going back to it.
ID: 2024457 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024458 - Posted: 22 Dec 2019, 21:35:10 UTC - in response to Message 2024378.  

You can, if the platform, version and plan_class strings in app_info match the values for the stock tasks you have received. That's how the Lunatics installer worked: all known platform, version and plan_class combinations were covered in the supplied app_info files.


. . This raises a question for me Richard. How does S@H decide that a host is 'anonymous platform'? If we go back to stock and then edit app_info.xml to redirect all appropriate platforms to use the enhanced app won't that cause it to be classified as 'anonymous platform'?

Stephen

??
ID: 2024458 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024460 - Posted: 22 Dec 2019, 21:59:04 UTC - in response to Message 2024393.  

My understanding is that the status pages are driven from the replica database


Rather defeats the entire definition of a status page to have it set up this way, but I would not be surprised if that was the case.


. . But we were still seeing the status page while the replica database was offline ...

Stephen

? ?
ID: 2024460 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14511
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024461 - Posted: 22 Dec 2019, 22:06:49 UTC - in response to Message 2024458.  

If you have an app_info.xml file, and it's active - in place when BOINC was started - your reports to the server will say that you're running anonymous platform. That's the definition.

If you didn't have an app_info.xml file active when you started BOINC, you're running stock.

My comment was that if you prepare an app_info.xml file offline that matches the characteristics of the tasks you're running as stock, you can switch from stock to anonymous platform without losing work - I've done that today at LHC. I don't think you can ever switch back from anonymous platform to stock and keep existing cached work.
ID: 2024461 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024463 - Posted: 22 Dec 2019, 22:17:51 UTC - in response to Message 2024448.  
Last modified: 22 Dec 2019, 22:18:36 UTC

I'm running stock with a 1080ti and 2070 Super. Here's a job that just finished. Should it be running cuda60 or something better? 46+ minutes seems long even for stock.


. . Cuda 60 is reallly sloooowww!

. . I had this problem when I moved my test machine into Beta so I brought it back to main. It occurred to me later on that the Nvidia drivers I am using in Linux do NOT have OpenCL support so the servers will never send S0G tasks. If you check and confirm that you have Nvidia drivers WITH OpenCL support hopefully you should start to receive SoG WUs.

Stephen

:)
ID: 2024463 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024465 - Posted: 22 Dec 2019, 22:22:23 UTC - in response to Message 2024454.  
Last modified: 22 Dec 2019, 22:23:25 UTC

Any oldtimers here remember that one Christmas season that seti was very down... was it a month?? two??

I'm sorry that there is a group of you that aren't getting WUs. Eric has responded, and even tried last night to track down the problem. I'm amazed they have kept the system up for so many of us. They are trying to fix things on a weekend and a holiday season, and that is way more than I would ask for (but I'm grateful).


. . That talk of being down for 2 months has me wanting to set my hair on fire ...

. . And I am sure most/all of us appreciate the efforts made by the SETi HQ crew and especially Eric, but frustration is a very driving issue ... :(

Stephen

:(
ID: 2024465 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7373
Credit: 44,181,323
RAC: 238
United States
Message 2024466 - Posted: 22 Dec 2019, 22:25:28 UTC - in response to Message 2024424.  

Very simple to switch from Anonymous platform to Stock even with the All-In-One. All you have to do is change the Names on the two files app_info.xml & app_config.xml to something as app_info1.xml & app_config1.xml, that will revert you to Stock. To change back to Anonymous platform rename the files to the original names app_info.xml & app_config.xml .
That's All that needs to be done, Nothing Else...NADA.
It's not that simple in my experience. Or it is to get back to stock but if you want to be able to restore your anonymous setup later, then it is better to move or copy the anonymous apps out of the project folder. Boinc has a habit of deleting any file in the project folder it doesn't know what to do with. And sometimes even when it does!

Hi Ville,

I alleviated that by renaming my anonymous project folder with "anonymous_" attached to the front of the existing folder name before I restarted BOINC and reset the project. When this issue gets resolved, all I need to do is attach "stock_" to the front of the stock folder and remove the "anonymous_" from the other folder name, restart BOINC and reset the project. Of course others may say just delete the stock folder. Yeah, could do that, but if this happens again, this will save bandwidth on non-WU downloads. Maybe... I don't know, I won't know fer sure until I have to do it. ;)

Have a great day! :)

Siran
CAPT Siran d'Vel'nahr XO - L L & P _\\//
USS Vre'kasht NCC-33187
Winders 10 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 2024466 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024468 - Posted: 22 Dec 2019, 22:29:17 UTC - in response to Message 2024461.  

hi Richard,

. . Yep I am with you on the process for not trashing WUs when shifting from stock to 'anonymous platform' but I had the impression that the other party had hoped that we could continue get work issued by being 'stock' but get it redirected it to run with the special ap.

Stephen

. .
ID: 2024468 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14511
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024470 - Posted: 22 Dec 2019, 22:35:06 UTC - in response to Message 2024468.  

Yes, you can do that with a carefully composed app_info. The application doesn't matter, but the task description - platform, version number, plan_class - does.
ID: 2024470 · Report as offensive
Profile AllenIN
Volunteer tester
Avatar

Send message
Joined: 5 Dec 00
Posts: 292
Credit: 58,297,005
RAC: 311
United States
Message 2024471 - Posted: 22 Dec 2019, 22:39:48 UTC - in response to Message 2024468.  

I'm sure there are many of you that are on top of this but I thought I would just show what I am getting when I try to update. Yes, I'm running Anonymous.

12/22/2019 5:32:36 PM | SETI@home | Requesting new tasks for CPU and Intel GPU
12/22/2019 5:33:01 PM | SETI@home | Scheduler request failed: HTTP internal server error

Just thought the " HTTP internal server error" might be a clue to the problem.

Allen
ID: 2024471 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7373
Credit: 44,181,323
RAC: 238
United States
Message 2024473 - Posted: 22 Dec 2019, 22:50:27 UTC - in response to Message 2024470.  

Yes, you can do that with a carefully composed app_info. The application doesn't matter, but the task description - platform, version number, plan_class - does.

Hi Richard,

My stock project folder does not have the app_info.xml file. My anonymous project folder does. Is this file just used for the anonymous platform? What will happen if I place that file in the stock project folder and restart BOINC, say after doing a NNT first? Will I hose BOINC to where I have to reinstall? Or will I go back to running cuda90?

Have a great day! :)

Siran
CAPT Siran d'Vel'nahr XO - L L & P _\\//
USS Vre'kasht NCC-33187
Winders 10 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 2024473 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.