Panic Mode On (12) Server problems

Message boards : Number crunching : Panic Mode On (12) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11

AuthorMessage
Zydor

Send message
Joined: 4 Oct 03
Posts: 172
Credit: 491,111
RAC: 0
United Kingdom
Message 868075 - Posted: 22 Feb 2009, 15:43:07 UTC - in response to Message 868067.  

People are now going from single to multi-core, in half the cases combined with a nVidea card. I can't put a multiplier on that one but I'll bet it's more than 3x
............ This percentage will only increase
............ bandwidth will be an even bigger issue in the future.
............ But you have to start somewhere.


- I recently went from a FX60/7800GTX to Phenom II/9800GTX (and because of that, came back to SETI using the 9800GTX for it - cpu's crunch for ClimatePrediction), and overall my output went up 9x

- The percentage will rocket as the GPU will crunch up to 10x faster than a cpu, in general terms. Until ATI comeout with something similar - NVidia will do "very nicely thank you" - it was a good strategic marketing move.

- Bandwidth will be the biggest issue with the sheer quantity of WUs flowing.

- Absoutely, there will always be bottlenecks, including bandwidth whatever is bought. However to get a large scale system moving, bandwidth is always the pole it revolves around in terms of moving forward as mo matter the upgrade elsewhere, its wasted resource unless the bandwidth can accommodate the increase. Its easy to make a case that X sub-system also has an effect yaddie yadda, but the common factor in every system enhancement is getting the information in and out, for that, in large scale systems, means bandwidth.
ID: 868075 · Report as offensive
Rob.B

Send message
Joined: 23 Jul 99
Posts: 157
Credit: 1,439,682
RAC: 0
United Kingdom
Message 868078 - Posted: 22 Feb 2009, 15:45:53 UTC - in response to Message 868029.  


Rob.B is absolutely right: loop until whenever. 'Whenever', in this context, is the daily quota, which doesn't distinguish between AP and MB. For instance, my quads (with one CUDA card each) have a daily quota of 900 tasks. If they were trashing AP (which they're not), I could request 7 gigabytes per day of AP tasks. And provided I uploaded and reported just 7 multibeam (e.g. CUDA) tasks per day, the quota would be reset to maximum. That's another safety which has been short-circuited by the multi-application model.


This reset of quota after 7 units as described in the quote could easily be improved upon, not a total solution but it would limit any damage.

Instead of reinstating the whole quota after the "clean" return of a small number of WU's, stage the quota retore. Five clean units you get 10% back, 10 clwac units you get 20% back and so on. With this in place, if a client begns to recover (or follows the quoted senario) then looses the plot again. The mean time to block is vastly reduced.

Not a solution, but better than the mess that's happening now.
ID: 868078 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 868080 - Posted: 22 Feb 2009, 15:56:10 UTC - in response to Message 868078.  


Rob.B is absolutely right: loop until whenever. 'Whenever', in this context, is the daily quota, which doesn't distinguish between AP and MB. For instance, my quads (with one CUDA card each) have a daily quota of 900 tasks. If they were trashing AP (which they're not), I could request 7 gigabytes per day of AP tasks. And provided I uploaded and reported just 7 multibeam (e.g. CUDA) tasks per day, the quota would be reset to maximum. That's another safety which has been short-circuited by the multi-application model.


This reset of quota after 7 units as described in the quote could easily be improved upon, not a total solution but it would limit any damage.

Instead of reinstating the whole quota after the "clean" return of a small number of WU's, stage the quota retore. Five clean units you get 10% back, 10 clwac units you get 20% back and so on. With this in place, if a client begns to recover (or follows the quoted senario) then looses the plot again. The mean time to block is vastly reduced.

Not a solution, but better than the mess that's happening now.

The way it actually works is one bad task = -1 quota, one good task = 2x current quota. I think it should be more along the lines of +2 and not 2x. That would greatly help the "damage control" efforts that the quota tries to do anyway.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 868080 · Report as offensive
Profile suki quin
Avatar

Send message
Joined: 12 Oct 08
Posts: 81
Credit: 1,053,392
RAC: 0
United States
Message 868091 - Posted: 22 Feb 2009, 16:30:46 UTC - in response to Message 867946.  
Last modified: 22 Feb 2009, 16:41:15 UTC

Thank you littlegreenmanfrommars for posting this a while ago:
"Best advice in such a situation is to turn off network activity for a while.
(BOINC tool menu > select Network activity suspended)."
- i have wanted to know what to do in this situation since i joined.
Competition is so intense; am happy to just slow down a little for the server's sake.
keep telescopic listening devices aimed at the Zenith of the Horizon
ID: 868091 · Report as offensive
Rob.B

Send message
Joined: 23 Jul 99
Posts: 157
Credit: 1,439,682
RAC: 0
United Kingdom
Message 868099 - Posted: 22 Feb 2009, 16:51:29 UTC - in response to Message 868080.  

The way it actually works is one bad task = -1 quota, one good task = 2x current quota. I think it should be more along the lines of +2 and not 2x. That would greatly help the "damage control" efforts that the quota tries to do anyway.


Sounds good to me, block errant clients quickly, then restore quota as they once more prove themselves stable.
ID: 868099 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 868103 - Posted: 22 Feb 2009, 16:56:46 UTC - in response to Message 867985.  

... a temporary stop to AP downloads, until the cause of the anomaly can be investigated and corrected. Once the runaway download train is brought under control, uploads will look after themselves.

After having a few more looks at Scarecrow's AP graphs thoughout the day, this measure gets my vote. It's all those AP units being downloaded that's clogging up the pipe.

OK, had a night's sleep and I think I've found the problem - well, the next stage in the chain.

Have a look at WU 417685549. Downloaded seven times, mine is the only one which is running - every other copy failed because they couldn't download the executable file. All my recent AP allocations look like that, though this is the most extreme.

Eric needs to turn on the 'proxy server' distribution channel used when new MB executables threaten to clog the pipes - or AP distribution needs to be restricted to those who have manually downloaded and installed the new Lunatics r112 optimisation for Astropulse_v5 (plug!).


As far as I can tell, they have Coral Cache turned on for the AP v5 as this link still works. The problem is a lot of the antivirus apps mark the redirect as suspect and do not allow the download to happen.

Of course then then we get the task errors because there is no app to process the work.


ID: 868103 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11

Message boards : Number crunching : Panic Mode On (12) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.