Panic Mode On (63) Server problems?

Message boards : Number crunching : Panic Mode On (63) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1180920 - Posted: 27 Dec 2011, 9:28:02 UTC - in response to Message 1180912.  


And now uploads appear to becomming rather iffy.
Grant
Darwin NT
ID: 1180920 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1180947 - Posted: 27 Dec 2011, 15:30:00 UTC

Regarding stuck WUs that have been around for a long period of time..

Do those have to be manually kicked by one of the staff? I know it can be a painful and cumbersome task. I was thinking maybe we could make a thread for listing as many as can be found, and maybe then, some sort of shell script can be made up once there's a list of taskIDs?
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1180947 · Report as offensive
Richard1949

Send message
Joined: 20 Oct 99
Posts: 18
Credit: 232,635
RAC: 0
United States
Message 1180953 - Posted: 27 Dec 2011, 16:00:50 UTC

"I doubt that SETI is playing favorites.

But on the other hand I have not seen anyone attempt to explain why different host computers can get all the work they want and others get nothing. Since the bandwidth is maxed out anyway there seems to be no interest in persuing this problem."
--------------------------------------------------
Thats what I thought,,,that they would send out work randomly. But now I wonder. Seems the same people over and over get all they want while others can't get so much as one WU.
ID: 1180953 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1180958 - Posted: 27 Dec 2011, 17:04:15 UTC - in response to Message 1180953.  

I doubt that SETI is playing favorites.

But on the other hand I have not seen anyone attempt to explain why different host computers can get all the work they want and others get nothing. Since the bandwidth is maxed out anyway there seems to be no interest in persuing this problem.

Thats what I thought,,,that they would send out work randomly. But now I wonder. Seems the same people over and over get all they want while others can't get so much as one WU.

I've seen the same thing on my own network (not related to the HE issues). Single-core machine had no problem building a 10-day cache while the [at the time] quad-core machine was struggling to keep 2-3 days of cache. I know I'm not the only one that sees/saw things like that.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1180958 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1180961 - Posted: 27 Dec 2011, 17:29:06 UTC - in response to Message 1180958.  
Last modified: 27 Dec 2011, 17:47:26 UTC

I doubt that SETI is playing favorites.

But on the other hand I have not seen anyone attempt to explain why different host computers can get all the work they want and others get nothing. Since the bandwidth is maxed out anyway there seems to be no interest in persuing this problem.

Thats what I thought,,,that they would send out work randomly. But now I wonder. Seems the same people over and over get all they want while others can't get so much as one WU.

I've seen the same thing on my own network (not related to the HE issues). Single-core machine had no problem building a 10-day cache while the [at the time] quad-core machine was struggling to keep 2-3 days of cache. I know I'm not the only one that sees/saw things like that.

What is so odd about that? All things being equal, the quad core is going to do 4 times the amount of work than the single core would. So would have to get and successfully download 4 times the amount of work just to stay even, much less build it's cache. So when up/download and work requests are not flowing well, it's going to be the first one to feel the pain.
Same goes for GPU hosting rigs....the faster they are right now, the worse off they are.

What makes it really tough on the big rigs is that when things are working OK, as in between shorty storms, they are not now allowed to build a large enough cache to carry them through the times when comms tighten up.

That's what really sux for us right now.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1180961 · Report as offensive
Jon

Send message
Joined: 12 Aug 09
Posts: 157
Credit: 139,063,241
RAC: 0
United States
Message 1180964 - Posted: 27 Dec 2011, 17:47:02 UTC - in response to Message 1180961.  


Same goes for GPU hosting rigs....the faster they are right now, the worse off they are.

What makes it really tough on the big rigs is that when things are working OK, as in between shorty storms, they are not now allowed to build a large enough cache to carry them through the times when comms tighten up.

That's what really sux for us right now.



You said it bro...
Jon
ID: 1180964 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1180965 - Posted: 27 Dec 2011, 17:49:18 UTC - in response to Message 1180964.  


Same goes for GPU hosting rigs....the faster they are right now, the worse off they are.

What makes it really tough on the big rigs is that when things are working OK, as in between shorty storms, they are not now allowed to build a large enough cache to carry them through the times when comms tighten up.

That's what really sux for us right now.



You said it bro...

That's why I have been stumping for weeks now for the Admins and Devs to address the Boinc code problems, get them behind us, and get the dang limits lifted.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1180965 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1180967 - Posted: 27 Dec 2011, 17:51:38 UTC - in response to Message 1180966.  

Shouldn't we be in the middle of the usual Tuesday outage by now? Maybe they'll skip the outage this time, because staff is on leave during Christmas/New Year?

That could be the case, but I am not certain.
If they are on hiatus for the whole week, they may let things limp along as they are other than possibly what could be addressed by remote.
Or, possibly an outage later in the week.

Just dunno for sure.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1180967 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1180970 - Posted: 27 Dec 2011, 17:55:03 UTC - in response to Message 1180961.  

What is so odd about that? All things being equal, the quad core is going to do 4 times the amount of work than the single core would. So would have to get and successfully download 4 times the amount of work just to stay even, much less build it's cache. So when up/download and work requests are not flowing well, it's going to be the first one to feel the pain.
Same goes for GPU hosting rigs....the faster they are right now, the worse off they are.

What makes it really tough on the big rigs is that when things are working OK, as in between shorty storms, they are not now allowed to build a large enough cache to carry them through the times when comms tighten up.

That's what really sux for us right now.


I do agree, however the part that I forgot was that the single-core machine would get at least one task about 95% of the time it asked for work. The quad-core machine would have about a 10% success rate. Slow machine would get its ~50 MBs in less than 10 requests, but the quad would have to ask for work 50+ times to get maybe 75.

Something else I'm pondering is if there is any way to speed up the refill rate for the feeder. I've heard that it fills up every two seconds. I wonder if that can be dropped to 1 second if it's even possible? That may alleviate a lot of those "project has no tasks available" messages when server status shows 200,000+ waiting to be assigned.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1180970 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1180971 - Posted: 27 Dec 2011, 17:57:35 UTC - in response to Message 1180970.  

What is so odd about that? All things being equal, the quad core is going to do 4 times the amount of work than the single core would. So would have to get and successfully download 4 times the amount of work just to stay even, much less build it's cache. So when up/download and work requests are not flowing well, it's going to be the first one to feel the pain.
Same goes for GPU hosting rigs....the faster they are right now, the worse off they are.

What makes it really tough on the big rigs is that when things are working OK, as in between shorty storms, they are not now allowed to build a large enough cache to carry them through the times when comms tighten up.

That's what really sux for us right now.


I do agree, however the part that I forgot was that the single-core machine would get at least one task about 95% of the time it asked for work. The quad-core machine would have about a 10% success rate. Slow machine would get its ~50 MBs in less than 10 requests, but the quad would have to ask for work 50+ times to get maybe 75.

Something else I'm pondering is if there is any way to speed up the refill rate for the feeder. I've heard that it fills up every two seconds. I wonder if that can be dropped to 1 second if it's even possible? That may alleviate a lot of those "project has no tasks available" messages when server status shows 200,000+ waiting to be assigned.

I think optimizing the scheduler is a moot point until such time as there is bandwidth available to support it. My view is that has to happen first, then scheduler or other server based bottlenecks can be addressed as they are identified. You can schedule all the work you want, but if the hosts cannot get it downloaded, it cannot be processed.

"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1180971 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1180972 - Posted: 27 Dec 2011, 18:05:24 UTC - in response to Message 1180971.  

I think optimizing the scheduler is a moot point until such time as there is bandwidth available to support it. My view is that has to happen first, then scheduler or other server based bottlenecks can be addressed as they are identified. You can schedule all the work you want, but if the hosts cannot get it downloaded, it cannot be processed.

That is true. And I've stated a few times that if we can get more bandwidth, it may create a whole new pile of problems all by itself by allowing more successful contacts to the database. It's one of those things that we'll just have to wait and see what happens and have some contingency plans lined up for some of the possible scenarios.

However, the good news is that with all of the enterprise-class networking equipment that is in place, we can get an actual gigabit link, but still rate-limit it to 100mbit, or 150mbit, whatever seems to allow the smoothest data transfer while keeping the database from getting DDoS'ed.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1180972 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1180974 - Posted: 27 Dec 2011, 18:09:30 UTC - in response to Message 1180972.  
Last modified: 27 Dec 2011, 18:09:53 UTC

I think optimizing the scheduler is a moot point until such time as there is bandwidth available to support it. My view is that has to happen first, then scheduler or other server based bottlenecks can be addressed as they are identified. You can schedule all the work you want, but if the hosts cannot get it downloaded, it cannot be processed.

That is true. And I've stated a few times that if we can get more bandwidth, it may create a whole new pile of problems all by itself by allowing more successful contacts to the database. It's one of those things that we'll just have to wait and see what happens and have some contingency plans lined up for some of the possible scenarios.

However, the good news is that with all of the enterprise-class networking equipment that is in place, we can get an actual gigabit link, but still rate-limit it to 100mbit, or 150mbit, whatever seems to allow the smoothest data transfer while keeping the database from getting DDoS'ed.


Well, if you peruse the information in the GPUUG fundraising thread, you will see than many hardware upgrades are well on their way to being completed. With more to come.

As far as I know, we still do not have a real path in place for upgrading the bandwidth, other than having the project's pleas fall on the deaf ears of the Berk IT admins.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1180974 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1181033 - Posted: 28 Dec 2011, 0:01:54 UTC - in response to Message 1180966.  

Shouldn't we be in the middle of the usual Tuesday outage by now? Maybe they'll skip the outage this time, because staff is on leave during Christmas/New Year?


Hmm, the UC Berkeley Academic Calendar shows Monday, Tuesday, Thursday, and Friday as "Academic and Administrative Holiday".
                                                                   Joe
ID: 1181033 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1181034 - Posted: 28 Dec 2011, 0:11:53 UTC - in response to Message 1181033.  

Shouldn't we be in the middle of the usual Tuesday outage by now? Maybe they'll skip the outage this time, because staff is on leave during Christmas/New Year?


Hmm, the UC Berkeley Academic Calendar shows Monday, Tuesday, Thursday, and Friday as "Academic and Administrative Holiday".
                                                                   Joe

Ahhh....
So it looks like some of our indentured servants may be in the lab tomorrow for an outage party.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1181034 · Report as offensive
Richard1949

Send message
Joined: 20 Oct 99
Posts: 18
Credit: 232,635
RAC: 0
United States
Message 1181044 - Posted: 28 Dec 2011, 0:38:23 UTC

"I do agree, however the part that I forgot was that the single-core machine would get at least one task about 95% of the time it asked for work."
---------------------------------------------------
I can't even get anything for my single core machine.
ID: 1181044 · Report as offensive
Richard1949

Send message
Joined: 20 Oct 99
Posts: 18
Credit: 232,635
RAC: 0
United States
Message 1181046 - Posted: 28 Dec 2011, 0:41:18 UTC

"Something else I'm pondering is if there is any way to speed up the refill rate for the feeder. I've heard that it fills up every two seconds. I wonder if that can be dropped to 1 second if it's even possible? That may alleviate a lot of those "project has no tasks available" messages when server status shows 200,000+ waiting to be assigned."
----------------------------------------------
I keep getting "not requesting any tasks."
ID: 1181046 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13855
Credit: 208,696,464
RAC: 304
Australia
Message 1181111 - Posted: 28 Dec 2011, 8:21:06 UTC


15min to download 1 WU is a bit of a PITA when it takes less than 3min to do 2.
Grant
Darwin NT
ID: 1181111 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36828
Credit: 261,360,520
RAC: 489
Australia
Message 1181112 - Posted: 28 Dec 2011, 8:56:51 UTC - in response to Message 1181111.  


15min to download 1 WU is a bit of a PITA when it takes less than 3min to do 2.

Personally I still put the current problems on the connection itself between the USA side of our undersea cable and HE as using a proxy here quickly clears any backlogs that occur.

Cheers.
ID: 1181112 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1181113 - Posted: 28 Dec 2011, 9:05:51 UTC - in response to Message 1181046.  

I keep getting "not requesting any tasks."

And why would that be a server problem when your client doesn't ask for work?

Gruß,
Gundolf
ID: 1181113 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1181114 - Posted: 28 Dec 2011, 9:15:31 UTC - in response to Message 1181112.  
Last modified: 28 Dec 2011, 9:16:15 UTC


15min to download 1 WU is a bit of a PITA when it takes less than 3min to do 2.

Personally I still put the current problems on the connection itself between the USA side of our undersea cable and HE as using a proxy here quickly clears any backlogs that occur.

Cheers.

Yeah, that certainly would appear to be that under-sea cable. I noticed in my messages tab last night that between "starting download" and "finished download" for an AP, 19 seconds elapsed (~430KB/sec). Of course it was a B3_P1 WU, so it took 24 seconds to error out once processing started. Go figure.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1181114 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Panic Mode On (63) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.