Panic Mode On (115) Server Problems?

Message boards : Number crunching : Panic Mode On (115) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 30 · Next

AuthorMessage
mrchips
Avatar

Send message
Joined: 12 Dec 04
Posts: 17
Credit: 26,590,842
RAC: 8
United States
Message 1983959 - Posted: 7 Mar 2019, 21:48:06 UTC

I pay with electric and computer usage
ID: 1983959 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1983975 - Posted: 7 Mar 2019, 23:04:26 UTC - in response to Message 1983959.  

I pay with electric and computer usage


. . We all contribute that. Some more than others. But do you really need that cup of coffee every day? $10 from each volunteer would really give the project the financial power to fix things in short order if they go bung

Stephen

??
.
ID: 1983975 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984193 - Posted: 8 Mar 2019, 21:04:53 UTC

Upload server is having issues it seems.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984193 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11451
Credit: 29,581,041
RAC: 66
United States
Message 1984194 - Posted: 8 Mar 2019, 21:15:46 UTC - in response to Message 1984193.  

Einstein has had that problem for a while, maybe it is contagious?
ID: 1984194 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984198 - Posted: 8 Mar 2019, 21:33:16 UTC - in response to Message 1984194.  
Last modified: 8 Mar 2019, 21:36:23 UTC

Einstein still does. Shouldn't be bothering Seti. I have <max_file_xfers>8</max_file_xfers>
<max_file_xfers_per_project>8</max_file_xfers_per_project>
which should be enough to handle my 3 projects.

[Edit] Maybe so. I forced some of the stuck Einstein uploads to clear at 100% and the stuck Seti uploads went instantly.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984198 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984211 - Posted: 8 Mar 2019, 22:36:18 UTC

Had to increase <max_file_xfers>N</max_file_xfers> to 16 to prevent the stalled Einstein uploads from impacting my other projects.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984211 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1984213 - Posted: 8 Mar 2019, 22:47:02 UTC - in response to Message 1984211.  

Had to increase <max_file_xfers>N</max_file_xfers> to 16 to prevent the stalled Einstein uploads from impacting my other projects.
Stalled uploads for one project don't (shouldn't) impact other projects. I had 20 Einstein tasks uploading - my limit is at 8 - when this happened:

08/03/2019 22:30:52 | SETI@home | [sched_op] NVIDIA GPU work request: 6298.76 seconds; 0.00 devices
08/03/2019 22:30:55 | SETI@home | Scheduler request completed: got 7 new tasks
08/03/2019 22:30:55 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 6621 seconds

Einstein upload failure should only affect Einstein work fetch.
ID: 1984213 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984214 - Posted: 8 Mar 2019, 22:54:52 UTC - in response to Message 1984213.  
Last modified: 8 Mar 2019, 23:04:48 UTC

Had to increase <max_file_xfers>N</max_file_xfers> to 16 to prevent the stalled Einstein uploads from impacting my other projects.
Stalled uploads for one project don't (shouldn't) impact other projects. I had 20 Einstein tasks uploading - my limit is at 8 - when this happened:

08/03/2019 22:30:52 | SETI@home | [sched_op] NVIDIA GPU work request: 6298.76 seconds; 0.00 devices
08/03/2019 22:30:55 | SETI@home | Scheduler request completed: got 7 new tasks
08/03/2019 22:30:55 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 6621 seconds

Einstein upload failure should only affect Einstein work fetch.

I agree in theory but not what I am observing. I am getting stuck Seti downloads when the Einstein tasks are trying again. It doesn't make sense that Einstein would affect other projects but if I understand the <max_file_xfers>N</max_file_xfers> parameter for cc_config.xml, that is the global number of file transfers for all of BOINC. Since I run Seti, Einstein, MilkyWay and GPUGrid concurrently on 4 of 5 machines, I regularly exceed 8 simultaneous connections in the client when all projects are crunching, reporting and uploading. The docs say 8 is the default for that parameter. So I took a gamble and increased to 16. It seems to have worked.

[Edit] I wasn't talking about work fetch in your example. I was talking about stalled downloads after a Seti workfetch. They were getting stalled and put into backoffs. Work fetch is fine for all projects except for Einstein of course with its stalled uploads.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984214 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1984215 - Posted: 8 Mar 2019, 23:18:58 UTC - in response to Message 1984214.  
Last modified: 8 Mar 2019, 23:21:51 UTC

Even so, if you leave the <max_per_project> at 2, the Einstein stall should leave 6 comms slots free for the other projects to share - and even those two should only be blocked for the 60 second Einstein gateway timeout.

Edit - Gary says the Einstein upload restart worked about 10 minutes ago, and all mine had cleared by the time I looked.
ID: 1984215 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984217 - Posted: 8 Mar 2019, 23:31:18 UTC - in response to Message 1984213.  
Last modified: 8 Mar 2019, 23:33:58 UTC

Had to increase <max_file_xfers>N</max_file_xfers> to 16 to prevent the stalled Einstein uploads from impacting my other projects.
Stalled uploads for one project don't (shouldn't) impact other projects. I had 20 Einstein tasks uploading - my limit is at 8 - when this happened:

08/03/2019 22:30:52 | SETI@home | [sched_op] NVIDIA GPU work request: 6298.76 seconds; 0.00 devices
08/03/2019 22:30:55 | SETI@home | Scheduler request completed: got 7 new tasks
08/03/2019 22:30:55 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 6621 seconds

Einstein upload failure should only affect Einstein work fetch.


. . But if you have your max transfers set to 8 and there are 8 stalled Einstein uploads would that not prevent any further uploads or downloads for any project?

Stephen

? ?
ID: 1984217 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984223 - Posted: 9 Mar 2019, 0:14:56 UTC - in response to Message 1984217.  
Last modified: 9 Mar 2019, 0:31:43 UTC

Had to increase <max_file_xfers>N</max_file_xfers> to 16 to prevent the stalled Einstein uploads from impacting my other projects.
Stalled uploads for one project don't (shouldn't) impact other projects. I had 20 Einstein tasks uploading - my limit is at 8 - when this happened:

08/03/2019 22:30:52 | SETI@home | [sched_op] NVIDIA GPU work request: 6298.76 seconds; 0.00 devices
08/03/2019 22:30:55 | SETI@home | Scheduler request completed: got 7 new tasks
08/03/2019 22:30:55 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 6621 seconds

Einstein upload failure should only affect Einstein work fetch.


. . But if you have your max transfers set to 8 and there are 8 stalled Einstein uploads would that not prevent any further uploads or downloads for any project?

Stephen

? ?

That is apparently what was happening. I have project transfers set to 8 because of my large download counts and many computers vieing for bandwidth. The sooner one computer can get the 100-200 tasks on each request, the sooner the other computers can get access to the download pipe unencumbered. The default of 2 transfers leaves the connection in perpetual download with the number of tasks I download and number of hosts contacting all the project servers.

[Edit] And things go south fast when I need to upload a GPUGrid task. I only have a nominal 1 Mbs upload pipe and it takes 20+minutes to upload the big 80MB result file for each finished task.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984223 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31357
Credit: 53,134,872
RAC: 32
United States
Message 1984232 - Posted: 9 Mar 2019, 2:06:13 UTC - in response to Message 1984213.  

Had to increase <max_file_xfers>N</max_file_xfers> to 16 to prevent the stalled Einstein uploads from impacting my other projects.
Stalled uploads for one project don't (shouldn't) impact other projects. I had 20 Einstein tasks uploading - my limit is at 8 - when this happened:

08/03/2019 22:30:52 | SETI@home | [sched_op] NVIDIA GPU work request: 6298.76 seconds; 0.00 devices
08/03/2019 22:30:55 | SETI@home | Scheduler request completed: got 7 new tasks
08/03/2019 22:30:55 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 6621 seconds

Einstein upload failure should only affect Einstein work fetch.

That is correct. Of course until the transfer times out it does hold one of the slots. Depending on backoffs it might appear as if another project is also being held as I believe BOINC will retry failed downloads before it moves on to others on the theory they are closer to timeouts. I'm not sure how it handles the situation if is also doesn't get a ping back from google and goes into not connected to the internet mode.
ID: 1984232 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984235 - Posted: 9 Mar 2019, 2:35:04 UTC - in response to Message 1984232.  

I had my timeouts set at 90 seconds. But I had my transfers set at 8 and 8. So not enough transfers allotted with 10 or more Einstein tasks trying to upload. Reconfiguring to 16 and 8 was enough to get things flowing again.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984235 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1984259 - Posted: 9 Mar 2019, 9:33:34 UTC - in response to Message 1984217.  

. . But if you have your max transfers set to 8 and there are 8 stalled Einstein uploads would that not prevent any further uploads or downloads for any project?

Stephen

? ?
No. An upload which is stalled (in either transfer backoff or project backoff) doesn't affect anything except the project concerned.

Every time a task completes, BOINC will try the uploads for that task (but not the others) just once, then go back into backoff. But if that new upload gets through, it'll start retrying the others.

It's only when uploads are 'active' (but possibly heading for a timeout) that they get in the way.
ID: 1984259 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984287 - Posted: 9 Mar 2019, 15:58:46 UTC - in response to Message 1984259.  

So it was me repeatedly trying to get the stuck Einstein uploads to finish by hitting retry and having them all in "Active" is what prevented the others from uploading or downloading and going into backoff.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984287 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1984294 - Posted: 9 Mar 2019, 16:28:28 UTC - in response to Message 1984287.  

Sounds plausible.
ID: 1984294 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1647
Credit: 12,921,799
RAC: 89
New Zealand
Message 1984647 - Posted: 12 Mar 2019, 2:36:01 UTC
Last modified: 12 Mar 2019, 2:37:03 UTC

Lots of data files sitting complete at (128) it would be nice if they didn't load any more data after the outage so we can burn through some of the smaller files. Just my thoughts in an ideal world.

It would kinda be nice if we could have a feature that told us how many files we have processed in the last 24 hours. I am aware that you can do it by basic maths but you would need to check it each day at the same time. Also hard to get an accurate number if they have added more data. I am aware that this probably will not happen as it would more strain on the database.
ID: 1984647 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1984683 - Posted: 12 Mar 2019, 5:58:36 UTC - in response to Message 1984647.  

If I remember right, Eric said they do ~1.2TB of data from Greenbank per day.
So 1024 * 1.2 / 52.39GB = 23.5 files.
My best guess there ....
ID: 1984683 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1647
Credit: 12,921,799
RAC: 89
New Zealand
Message 1984700 - Posted: 12 Mar 2019, 8:07:01 UTC - in response to Message 1984683.  

If I remember right, Eric said they do ~1.2TB of data from Greenbank per day.
So 1024 * 1.2 / 52.39GB = 23.5 files.
My best guess there ....

Thank you for the information very interesting. Now if only we could see what files have been processed before they are removed. I am aware as I said in my previous post that there are approximately 9 files sitting at (128) I guess in a way this is showing me they are complete. :) I have a feeling these maybe stuck.
ID: 1984700 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1859
Credit: 268,616,081
RAC: 1,349
United States
Message 1984702 - Posted: 12 Mar 2019, 8:23:29 UTC - in response to Message 1984700.  

Thank you for the information very interesting. Now if only we could see what files have been processed before they are removed. I am aware as I said in my previous post that there are approximately 9 files sitting at (128) I guess in a way this is showing me they are complete. :) I have a feeling these maybe stuck.
No, it means the SSP is borked, as it has been intermittently for a while. If you look, it will also say that there are 28 GBT splitters running (channels in progress), when only 14 are provisioned. If you see that, pretty much everything else is possibly inaccurate as well. This seems to include failing to drop off the page files that have been completely split.
No way to know for sure, but my suspicion is that when they redid the throttle process a while back they failed to account for updating some portion of the SSP, as this seems to happen after the throttle kicks in.
ID: 1984702 · Report as offensive
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 30 · Next

Message boards : Number crunching : Panic Mode On (115) Server Problems?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.