Panic Mode On (19) Server problems

Message boards : Number crunching : Panic Mode On (19) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

AuthorMessage
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 914351 - Posted: 5 Jul 2009, 15:51:37 UTC - in response to Message 914341.  

I get 200 of very fast (15 min) MB wu this wu have the same size of a normal and was returned fast too increasing the network traffic. why the splitter produce it?

Because the telescope was moving quickly when the raw data was recorded. That data needs to be looked at just like any other, but it takes less time because very little data is gathered from each position on the sky.

It is not a Server problem, it's simply a limitation which comes with gathering data while other observers control the telescope motion.
                                                               Joe
ID: 914351 · Report as offensive
Profile DPRGI - Luivul

Send message
Joined: 24 Jan 03
Posts: 17
Credit: 20,639,801
RAC: 0
Italy
Message 914364 - Posted: 5 Jul 2009, 16:23:54 UTC - in response to Message 914351.  

Thanks.

ID: 914364 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914374 - Posted: 5 Jul 2009, 16:54:41 UTC - in response to Message 914299.  


...

Relax and let the system work.

Ian

Posters on this topic can be split into 2 categories:

For those who complain about the performance of the system, I would agree with your sentiment. Others have an "engineering" bent to their nature (whether or not it is included in their job title) and they are genetically incapable of relaxing when observing a system that seems to be performing sub-optimally. They have to poke it, tweak it, try to understand how it works and how it can be made to work "better". I know this for a fact...

F.

Yep. Engineering..... The only part of me that is frustrated is with all of this work I have to do to bring in cash that I can't ignore and work directly on BOINC.
ID: 914374 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914375 - Posted: 5 Jul 2009, 16:55:52 UTC - in response to Message 914279.  


Warning totally off topic
Is that Molly helping you?

Back on topic
Best of luck on your next project.

Yes. She is the Inspector.... Always inspect your work.
ID: 914375 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914378 - Posted: 5 Jul 2009, 17:14:26 UTC - in response to Message 914294.  

It would be relatively simple for the Seti staff to keep things running smoothly at their end, simply by throttling the splitters and schedulers to keep the bandwidth and database size down to levels that the system works 99.9%reliably.

Ian,

Part of what you're suggesting is controlling the rate at which work is returned by throttling the work going out, and that is simple if all of the work units take about the same amount of time.

The problem (as Joe mentioned indirectly) is that different studies generate different telescope movements, and that's why were in a run of "shorties" right now.

If you have a run of "long" multibeam work units, followed by batches of ever shorter work units, and ending with shorties, you can get a situation where weeks of work going out wants to all come back in a day or two -- and there you go.

Throttling work outbound to allow for that kind of "worst case" inbound is difficult.

SETI can theoretically sort "tapes" by the study controlling the telescope and try to keep a mix, but some studies get more time than others, and, well, you get the idea.

One thing I've noticed is the "random backoff" -- I just triggered a set of uploads, and I got backoff times ranging from 1 minute to 3 1/2 hours, with an average on my little cruncher of two hours.

If all of the 180,000 active hosts average about 4 uploads waiting, and they all try each upload every couple of hours, that's 360,000 upload attempts per hour, or 6,000 per minute.

It seems to me that giving a project the ability to tune the random backoff would be very helpful. Raise the average back-off to eight hours and you're down to 1,500 per minute, and most of those are likely to be successful.

Another thought is to have each failed backoff treated like a failure for all of the pending uploads.

At this point these are just ideas. They'd have to be tried.

I think they'd work better than throttling at the splitters, because they'd let the project(s) react to the actual load, and not try to predict the load next week based on what the splitters are doing on a Thursday.

-- Ned


ID: 914378 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 914411 - Posted: 5 Jul 2009, 18:50:27 UTC - in response to Message 914378.  


One thing I've noticed is the "random backoff" -- I just triggered a set of uploads, and I got backoff times ranging from 1 minute to 3 1/2 hours, with an average on my little cruncher of two hours.

If all of the 180,000 active hosts average about 4 uploads waiting, and they all try each upload every couple of hours, that's 360,000 upload attempts per hour, or 6,000 per minute.

It seems to me that giving a project the ability to tune the random backoff would be very helpful.


There may be something like this in place already. Speaking as an amateur (even though I have engineer in my job title, its a different kind): I have been forcing transfers every few hours for the last day, getting one or two or three of 10 to 15 through each time. When the unsuccessful transfers "randomly" set a wait time, those that have been there the longest get the longest wait (several hours), new ones get a few minutes wait.

ID: 914411 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 914414 - Posted: 5 Jul 2009, 18:58:38 UTC - in response to Message 914411.  

All the people who feel they must sit there and manually hit "retry now" (its easy to get frustrated when you sit there all day doing that) are actually making the connection problems worse. The back-off period is there to ease the comms on the server, and forcing the comms to happen sooner at the same time everyone else is doing it only creates a distributed Denial of Service attack on the SETI servers.

This may be an excellent argument for removing any option of forcing communications within BOINC.
ID: 914414 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914425 - Posted: 5 Jul 2009, 19:33:15 UTC - in response to Message 914411.  


One thing I've noticed is the "random backoff" -- I just triggered a set of uploads, and I got backoff times ranging from 1 minute to 3 1/2 hours, with an average on my little cruncher of two hours.

If all of the 180,000 active hosts average about 4 uploads waiting, and they all try each upload every couple of hours, that's 360,000 upload attempts per hour, or 6,000 per minute.

It seems to me that giving a project the ability to tune the random backoff would be very helpful.


There may be something like this in place already. Speaking as an amateur (even though I have engineer in my job title, its a different kind): I have been forcing transfers every few hours for the last day, getting one or two or three of 10 to 15 through each time. When the unsuccessful transfers "randomly" set a wait time, those that have been there the longest get the longest wait (several hours), new ones get a few minutes wait.

Yes, but if you had 20 work units, each one of the 20 work units retries each time you force a transfer.

Those 20 transfers are adding load. I'm talking about reducing load to somewhere near the optimal level -- the point at which every upload attempt completes at near maximum speed.

Let's take as our example someone on a DSL line with 768k up.

When no one is uploading, his uploads go at 768k.

As uploads increase, but the bandwidth is below about 80mb his uploads go at 768k.

Grow about 50% from that point, and his uploads go through, but at something less than his peak speed, maybe 400k to 500k, due to congestion.

Go up dramatically from there, and the odds are good that he can't connect due to the number of others connecting. If he can connect, he may not be able to get enough frames through with all the competition for bandwidth.

Slow the clients way down, reduce the number of quick retries, and you can push back to somewhere between the second and third scenario above -- reduce the congestion and you increase throughput, and the higher throughput means everything gets caught up much faster.

As I write this, most of the bandwidth is wasted because it is divided across so many users that most slices are just too small.
ID: 914425 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65757
Credit: 55,293,173
RAC: 49
United States
Message 914465 - Posted: 5 Jul 2009, 20:49:20 UTC - in response to Message 914283.  

Next subject..........

New color.......

Next raality.........

WTF........

Where are the Fish?


Really. Where ARE the freakin' fish?

No hope........no numbers........no nothing............

Yes, I will crunch forever........because I must......or choose to


And I hope the rest of you will too............
Hon.........it's a lot.

I get done having a problem with Verizon DSL(Line Card went poof I think, Verizon didn't say) for about 24 hours, I find S@H is doing this:

7/5/2009 1:47:05 PM SETI@home Temporarily failed upload of 01mr09aa.26781.24612.3.8.56_2_0: HTTP error
7/5/2009 1:47:05 PM SETI@home Backing off 4 min 28 sec on upload of 01mr09aa.26781.24612.3.8.56_2_0
7/5/2009 1:47:05 PM SETI@home Started upload of 20no08ag.15033.2118.11.8.109_0_0
7/5/2009 1:47:06 PM Internet access OK - project servers may be temporarily down.
7/5/2009 1:47:12 PM Project communication failed: attempting access to reference site


At least everything here is good, Finally. And yeah I'm trying to upload. :D
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 914465 · Report as offensive
Profile Balveda*
Avatar

Send message
Joined: 20 Oct 08
Posts: 310
Credit: 376,456
RAC: 0
Message 914466 - Posted: 5 Jul 2009, 20:49:54 UTC

Panic looks over downloads are now flooding in! Including those lovely Cudas!
Balveda :D
ID: 914466 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65757
Credit: 55,293,173
RAC: 49
United States
Message 914467 - Posted: 5 Jul 2009, 20:52:20 UTC - in response to Message 914466.  

Panic looks over downloads are now flooding in! Including those lovely Cudas!
Balveda :D

Yeah, I post and everything goes from haywire to to working, Wow. Oh well It's working, My time in Verizon DSL Purgatory is over, I hope...
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 914467 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914468 - Posted: 5 Jul 2009, 20:53:01 UTC - in response to Message 914467.  

Panic looks over downloads are now flooding in! Including those lovely Cudas!
Balveda :D

Yeah, I post and everything goes from haywire to to working, Wow. Oh well It's working, My time in Verizon DSL Purgatory is over, I hope...

This is the same Verizon that had Darth Vader as their celebrity spokesperson, right?
ID: 914468 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65757
Credit: 55,293,173
RAC: 49
United States
Message 914471 - Posted: 5 Jul 2009, 20:55:28 UTC - in response to Message 914468.  

Panic looks over downloads are now flooding in! Including those lovely Cudas!
Balveda :D

Yeah, I post and everything goes from haywire to to working, Wow. Oh well It's working, My time in Verizon DSL Purgatory is over, I hope...

This is the same Verizon that had Darth Vader as their celebrity spokesperson, right?

When was this? If It was previous to 2004, I don't know.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 914471 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914473 - Posted: 5 Jul 2009, 21:02:07 UTC - in response to Message 914471.  


This is the same Verizon that had Darth Vader as their celebrity spokesperson, right?

When was this? If It was previous to 2004, I don't know.

I haven't seen one for a while, but James Earl Jones was the voice of Darth Vader, and (later) pitchman for Verizon.
ID: 914473 · Report as offensive
Ianab
Volunteer tester

Send message
Joined: 11 Jun 08
Posts: 732
Credit: 20,635,586
RAC: 5
New Zealand
Message 914479 - Posted: 5 Jul 2009, 21:27:31 UTC

I think they'd work better than throttling at the splitters, because they'd let the project(s) react to the actual load, and not try to predict the load next week based on what the splitters are doing on a Thursday.


That would certainly help with the current upload problems, but many of the recent glitches have been other issues. Database speed and reliability etc.

But at the end of the day I think that most methods of throttling will be a bit pointless. People are annoyed because they cant upload and get new work units. Would they be any less annoyed if they could upload, and then the server kept saying "no new work" because it was being throttled to keep the uploads working?

Also, like you say, a simple throttling system would have to cater for a worst case situation, lots of short workunits coming online after a scheduled outage maybe? Then when the system was running normally, it would be throttled much slower than it's max throughput. And we dont want that.

The other option would be some sort of smart throttling system, but that would have to monitor several parameters, like network traffic (both ways), database size and responce times. Chances are it would cause more outages than it fixed anyway.

Last thought, if any one bottleneck does get fixed, it's just going to highlight the next one.

Upgrade the network link - Upload/download servers fold under peakload - Upgrade those servers - the database runs out of disk space - upgrade the disk space then the database servers crash trying to run the bigger database.

We have now spend 1/2mil$ that the project doesn't have.

Or we can accept that there will be some outages and the team are tweaking the hardware and the whole system to work as well as can reasonably be expected.

That doesn't mean we cant complain a little when things aren't running right of course ;-)

Ian
ID: 914479 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914485 - Posted: 5 Jul 2009, 21:38:24 UTC - in response to Message 914479.  

I think they'd work better than throttling at the splitters, because they'd let the project(s) react to the actual load, and not try to predict the load next week based on what the splitters are doing on a Thursday.


That would certainly help with the current upload problems, but many of the recent glitches have been other issues. Database speed and reliability etc.

But at the end of the day I think that most methods of throttling will be a bit pointless. People are annoyed because they cant upload and get new work units. Would they be any less annoyed if they could upload, and then the server kept saying "no new work" because it was being throttled to keep the uploads working?

The theory is, unthrottled, we see load that exceeds the available resources, and when that happens, efficiency goes down.

If (as an example) 80% of the bandwidth is going to failed transactions, one could throttle and get that down to 20% waste (or less) -- and we'd actually get through the bad times 4 times faster.

Faster recovery = less frustration.


Also, like you say, a simple throttling system would have to cater for a worst case situation, lots of short workunits coming online after a scheduled outage maybe? Then when the system was running normally, it would be throttled much slower than it's max throughput. And we dont want that.

The other option would be some sort of smart throttling system, but that would have to monitor several parameters, like network traffic (both ways), database size and responce times. Chances are it would cause more outages than it fixed anyway.

I'm thinking of a "knob" that the project could adjust in (reasonably) real time. Take the recent events. Matt posted about this on Thursday, and the best he could do was say "it's probably best to wait it out" -- if there was a way to tell clients "please slow down" he'd be able to do that, and turn it back up when the problem was gone.

There needs to be a way to broadcast that information that doesn't rely on any one BOINC component, or the current "wire" and there is: DNS.

I'm willing to take a shot at the code to do that. I think most if it is there (through libcurl).

A project could (for example) write a script that updated the DNS (TXT) record dynamically, adjusting up and down based on the cricket graphs -- but before you can automate the knob, you have to have the knob.

Last thought, if any one bottleneck does get fixed, it's just going to highlight the next one.

Upgrade the network link - Upload/download servers fold under peakload - Upgrade those servers - the database runs out of disk space - upgrade the disk space then the database servers crash trying to run the bigger database.

We have now spend 1/2mil$ that the project doesn't have.

Or we can accept that there will be some outages and the team are tweaking the hardware and the whole system to work as well as can reasonably be expected.

That doesn't mean we cant complain a little when things aren't running right of course ;-)

Ian

While I agree that solving one issue is likely to show others, I also think that each time we get just a little bit farther -- so I think it's worth doing.
ID: 914485 · Report as offensive
Nemesis

Send message
Joined: 14 Mar 07
Posts: 129
Credit: 31,295,655
RAC: 0
Canada
Message 914511 - Posted: 5 Jul 2009, 22:59:23 UTC

My 2 cents worth...

I have have no work left to run, none, zilch, nada...I have burned through my whole 4 day cache while trying to get either new WU's or the ones I have completed to finish uploading...the cruncher is hungry and not very happy..

I do have 30 WU's left trying to upload and most of them have been trying for the last 2 days, not a happy state of affairs to say the least...
ID: 914511 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 914523 - Posted: 5 Jul 2009, 23:43:18 UTC - in response to Message 914269.  

OK -- I think I was responding to the comment wondering about the SETI project folks looking for new users, not a comment wondering about the BOINC group looking for new projects. At the BOINC group level, getting more projects makes a lot of sense. At the SETI project level, it isn't clear to me that pushing for additional users to support is on order at the moment (unless the focus is to get more users donating money and hardware).

As to interest in other projects -- fair point, if the only thing which strikes ones fancy is SETI then other projects won't interest. That single project focus was what drove SETI classic. For some folks the BOINC multi-project approach, with the attendant complications and foibles of the BOINC client in its scores of iterations over the years resulted in a significant drop off and certainly a shift in the nature of the overall user community.

Some folks are very much single project folks (especially if they are SETI
folks), some are not (particularly if they are not primarily SETI folks to begin with).



From a BOINC point of view, their users are projects, not individual crunchers.

It's also hard to sell someone like me on protein folding or climate prediction because they just don't stir the imagination.


ID: 914523 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 914525 - Posted: 5 Jul 2009, 23:54:19 UTC - in response to Message 914414.  

Ah yes, control those silly users <smile>.

Before doing that, I'd love to see the BOINC developers clean up existing open tickets on the BOINC client -- there are enough of those. And then of course fulfill the plan for ATI GPU support -- that would be very nice.

I take your point about the retry button when uploads are 'stuck'. I'd note that I don't encounter this with the other projects I work with though. There was a time I hit that retry button for uploads, but these days instead of doing that, given I'm one of those multiple project folks, I use a different approach - I temporarily suspend SETI to reduce the number of 'stuck' uploads being generated for SETI until the existing stuck uploads clear. Since SETI is the ONLY project I am working with in that form of overload mode and response, this works for me.

Similarly, as I mentioned earlier in this thread, I've added projects and reduced the SETI share generally. This project doesn't really need extra user CPU cycles right now, and other BOINC projects seem to be handling their much smaller workload just fine.





This may be an excellent argument for removing any option of forcing communications within BOINC.


ID: 914525 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914533 - Posted: 6 Jul 2009, 0:17:21 UTC - in response to Message 914523.  

OK -- I think I was responding to the comment wondering about the SETI project folks looking for new users, not a comment wondering about the BOINC group looking for new projects.

I read your comment about SETI trying to attract more crunchers to BOINC in general, and I think that ignores the fundamental fact that SETI@Home exists for one purpose alone, to search for signals in their recorded data.

They use BOINC as their platform and recommend joining more projects because that helps even out the load, but that isn't their primary goal.

... and that's true of every project.

It doesn't make sense for SETI to use their resources to promote others.

ID: 914533 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (19) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.