The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 94 · Next

AuthorMessage
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 2024358 - Posted: 22 Dec 2019, 13:32:16 UTC

We need a new thread.

ID: 2024358 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024362 - Posted: 22 Dec 2019, 13:41:18 UTC - in response to Message 2024358.  

Argh - I have some news to pass on, and spent some time composing it - in the old thread. Hang on while I re-write it.
ID: 2024362 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024363 - Posted: 22 Dec 2019, 13:42:39 UTC - in response to Message 2024358.  
Last modified: 22 Dec 2019, 14:01:50 UTC

We need a new thread.


Like the scheduler needed some new code... lol.
ID: 2024363 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024365 - Posted: 22 Dec 2019, 13:49:39 UTC - in response to Message 2024364.  
Last modified: 22 Dec 2019, 13:53:54 UTC

Since I am using the All-In-One, I don't even have a stock to revert to. I'd need to archive the BOINC folder, download/install the detested Repository version, reconnect to SETI, download/install all the setup/apps/work including some ancient and slow CUDA50 that takes 10x as long to finish if it doesn't crash, then when this is fixed (which with my luck will happen exactly when I have completed this) wait for the work to complete, uninstall it, unpack the All-In-One back and hope for the best...

... on eight computers.

Or I could just connect to Einstein. Takes about ten seconds apiece. Much easier.
ID: 2024365 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024367 - Posted: 22 Dec 2019, 13:57:35 UTC

The LHC project at CERN now has the responsibility for testing and releasing new server code. Their website says:

Server upgrade
Following a couple of weeks of tests in the LHC@home development project, we are upgrading our production server cluster to BOINC server release 1.2 this afternoon. During the update we will be running with slightly lower server capacity than usual.
30 Sep 2019, 11:56:49 UTC
Despite that "BOINC server release 1.2", their server reports '22/12/2019 12:51:47 | LHC@home | [sched_op] Server version 715', same as ours. I'll be having words about that.

But I ran a test. Cleaned up my account, attached again, and got the statutory single initial task. Then, I wrapped up their application in an app_info.xml file (losing that initial task to a few finger-fumbles along the way - we all do it!), but eventually getting Anonymous Platform working properly.

First off, I got my initial task back as a 'resent lost task' - exactly as I should have done. That bit is installed and working.

But at every work request since then, the LHC server has responded 'internal server error'. It only happens when work is requested and a task is already running.

Bingo! We have a reproduction of the problem here, on an independent project, without all the congestion and delays. And that project is well resourced, and has a vested interest in getting the problem sorted. I'll be writing to the guys once I've got this posted.
ID: 2024367 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2024368 - Posted: 22 Dec 2019, 13:57:41 UTC - in response to Message 2024365.  
Last modified: 22 Dec 2019, 13:59:49 UTC

Since I am using the All-In-One, I don't even have a stock to revert to. I'd need to archive the BOINC folder, download/install the detested Repository version, reconnect to SETI, download/install all the setup/apps/work including some ancient and slow CUDA50 that takes 10x as long to finish if it doesn't crash, then when this is fixed (which with my luck will happen exactly when I have completed this) wait for the work to complete, uninstall it, unpack the All-In-One back and hope for the best...

You aren't quite correct. It's Very simple to switch from Anonymous platform to Stock even with the All-In-One. All you have to do is change the Names on the two files app_info.xml & app_config.xml to something as app_info1.xml & app_config1.xml, that will revert you to Stock. To change back to Anonymous platform rename the files to the original names app_info.xml & app_config.xml .
That's All that needs to be done, Nothing Else...NADA.
ID: 2024368 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024370 - Posted: 22 Dec 2019, 13:58:58 UTC - in response to Message 2024368.  

Thanks TBar... I had tried that on my largest machine but it didn't work. I probably fat fingered something (I think I neglected app_config). I'll try it again.
ID: 2024370 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024371 - Posted: 22 Dec 2019, 14:03:47 UTC - in response to Message 2024368.  

Since I am using the All-In-One, I don't even have a stock to revert to. I'd need to archive the BOINC folder, download/install the detested Repository version, reconnect to SETI, download/install all the setup/apps/work including some ancient and slow CUDA50 that takes 10x as long to finish if it doesn't crash, then when this is fixed (which with my luck will happen exactly when I have completed this) wait for the work to complete, uninstall it, unpack the All-In-One back and hope for the best...

You aren't quite correct. It's Very simple to switch from Anonymous platform to Stock even with the All-In-One. All you have to do is change the Names on the two files app_info.xml & app_config.xml to something as app_info1.xml & app_config1.xml, that will revert you to Stock. To change back to Anonymous platform rename the files to the original names app_info.xml & app_config.xml .
That's All that needs to be done, Nothing Else...NADA.


That is simple enough. If one does that, gets some tasks, can one then return these two files to their original names and run the tasks with the special sauce apps?
ID: 2024371 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024372 - Posted: 22 Dec 2019, 14:05:19 UTC - in response to Message 2024368.  

app_config shouldn't be a problem. You may get some warning messages about unrecognised applications, but you can ignore those. app_config can be used in either stock or AP mode.

More importantly, you need to Restart the client (so it forgets about app_info).

I always find I also need to reset the project - which deletes all files, with recent versions of BOINC. Hence my advice to take a backup...
ID: 2024372 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 2024373 - Posted: 22 Dec 2019, 14:09:07 UTC
Last modified: 22 Dec 2019, 14:14:14 UTC

I may be an old head, but why doesn't someone back out the changes made that caused this catastrophe in the first place and bounce the servers. Then sandbox the new code until the bug is rectified?

At this moment I've 18 hrs (33 CPU tasks) left & 19 tasks that have not been able to report. U/ls have not been a problem. GPUs have been assigned to Milkyway.


I don't buy computers, I build them!!
ID: 2024373 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2024374 - Posted: 22 Dec 2019, 14:11:54 UTC - in response to Message 2024371.  
Last modified: 22 Dec 2019, 14:19:51 UTC

That is simple enough. If one does that, gets some tasks, can one then return these two files to their original names and run the tasks with the special sauce apps?

The Tasks are assigned to a certain <platform>XXXXXXXXXXX</platform>, <version_num>XXX</version_num> and <plan_class>XXXXX</plan_class> in the file client_state.xml. Once the tasks are assigned in the client_state.xml you CAN NOT change any of those values without trashing the Tasks. So NO, you can't simply change from Stock to Anonymous platform IF You have EXISTING Tasks assigned in the client_state.xml file. The ONLY way you can change is to make SURE the 3 values in the client_state.xml file match the values in the app_info.xml file EXACTLY.
ID: 2024374 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024377 - Posted: 22 Dec 2019, 14:22:14 UTC - in response to Message 2024374.  

Thanks, TBar. Once the T3500 finishes its small quota of stock, I'm going back to special sauce and wait for the fix. Need to get Einstein fired up to keep the house warm until then.
ID: 2024377 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024378 - Posted: 22 Dec 2019, 14:22:58 UTC - in response to Message 2024374.  

That is simple enough. If one does that, gets some tasks, can one then return these two files to their original names and run the tasks with the special sauce apps?

The Tasks are assigned to a certain <platform>XXXXXXXXXXX</platform>, <version_num>XXX</version_num> and <plan_class>XXXXX</plan_class> in the file client_state.xml. Once the tasks are assigned in the client_state.xml you CAN NOT change any of those values without trashing the Tasks. So NO, you can't simply change from Stock to Anonymous platform IF You have EXISTING Tasks assigned in the client_state.xml file.
You can, if the platform, version and plan_class strings in app_info match the values for the stock tasks you have received. That's how the Lunatics installer worked: all known platform, version and plan_class combinations were covered in the supplied app_info files.
ID: 2024378 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2024380 - Posted: 22 Dec 2019, 14:30:48 UTC - in response to Message 2024377.  

Thanks, TBar. Once the T3500 finishes its small quota of stock, I'm going back to special sauce and wait for the fix. Need to get Einstein fired up to keep the house warm until then.
I really don't see the difference between running Stock tasks at SETI or Stock tasks at Einstein. All that is required to go from Anonymous platform to Stock is to change the name on Two files, NOTHING ELSE.
ID: 2024380 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024382 - Posted: 22 Dec 2019, 14:37:51 UTC - in response to Message 2024373.  
Last modified: 22 Dec 2019, 15:24:20 UTC

At this moment I've 18 hrs (33 CPU tasks) left & 19 tasks that have not been able to report...


Setting "No new tasks" and then hitting or allowing Update will resolve this. Cause is that the scheduler is now checking for and resending lost tasks on every connection which is causing connects to fail, but only if work is requested.

If that still doesn't resolve it, then edit cc_config.xml in your main BOINC folder and ensure there's a line <max_tasks_reported>##</max_tasks_reported>. The largest possible value for ## is 255 (anything over that it scales back to 255 without advising.) I would try maybe 50 at first.


Edit: I've also tried renaming app_config.xml and app_info.xml, resetting the project and restarting the client. On 2/3 machines it still doesn't work... No tasks available even with half a million in the RTS queue. :^( Some setting is persisting. I guess I detach and reattach next.
ID: 2024382 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024384 - Posted: 22 Dec 2019, 15:27:41 UTC - in response to Message 2024367.  

I have written to the Server Release Manager at CERN, under the title 'Internal server errors in BOINC server release 1.2' - which should get his attention. Copies to Eric and David so they're kept in the loop.

The server guy at CERN has a young family, so he won't be desperately keen to receive a report like this during the holiday season. Progress will be slow, but I'll report any feedback I get.
ID: 2024384 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024386 - Posted: 22 Dec 2019, 15:32:02 UTC - in response to Message 2024382.  

Edit: I've also tried renaming app_config.xml and app_info.xml, resetting the project and restarting the client. On 2/3 machines it still doesn't work... No tasks available even with half a million in the RTS queue. :^( Some setting is persisting. I guess I detach and reattach next.
I think you have to restart the client, then reset the project - in that order - for it to work. Even then, new work isn't guaranteed while the server is being thrashed within an inch of its life, but provided no errors are being reported in the Event Log, it should come through eventually. Detaching certainly isn't necessary.
ID: 2024386 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2024387 - Posted: 22 Dec 2019, 15:34:25 UTC - in response to Message 2024382.  
Last modified: 22 Dec 2019, 15:36:11 UTC

Edit: I've also tried renaming app_config.xml and app_info.xml, resetting the project and restarting the client. On 2/3 machines it still doesn't work... No tasks available even with half a million in the RTS queue. :^( Some setting is persisting. I guess I detach and reattach next.
How long did you wait for tasks? It can take 20 to 30 minutes to get the first tasks. Renaming the files Only, works perfectly fine on all my Macs and Linux machines. I guess I just have those special machines, Oh Well. Yes, you do have to restart BOINC as you do anytime you make a change to the app_info.xml....but I thought everyone knew that.
ID: 2024387 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024388 - Posted: 22 Dec 2019, 15:36:51 UTC - in response to Message 2024387.  

Thanks... I guess I will just be patient. Hard to do as of late. :^p
ID: 2024388 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2024389 - Posted: 22 Dec 2019, 15:37:52 UTC - in response to Message 2024384.  

I have written to the Server Release Manager at CERN, under the title 'Internal server errors in BOINC server release 1.2' - which should get his attention. Copies to Eric and David so they're kept in the loop.

The server guy at CERN has a young family, so he won't be desperately keen to receive a report like this during the holiday season. Progress will be slow, but I'll report any feedback I get.


Thanks for the update and effort, Richard. The following is just me venting and not directed at anyone.

I would be sympathetic to the holiday part, but some yahoo decided to do this right before a weekend, a two week holiday for some. I've spent probably 4 hours paying attention to an issue which shouldn't exist for this hobby on my vacation time. The anonymous problem was apparently well-known on the beta and yet someone thought this was a good idea to shove into production. If I did that on the industrial pilot unit PLCs for which I'm responsible, I'd be in trouble.

Even my T3500 stock was having trouble getting new jobs. I suspect the system is growing more unstable based on that and the growing replica lag. That's why I'm not going to chase it by converting my other boxes. I'll wait till they fix it and help Einstein in the meanwhile. *rant off*
ID: 2024389 · Report as offensive
1 · 2 · 3 · 4 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.