Panic Mode On (74) Server problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (74) Server problems?

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next
Author Message
msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37308
Credit: 499,207,246
RAC: 505,917
United States
Message 1229549 - Posted: 9 May 2012, 15:14:34 UTC

My guess is that we are again seeing some kind of scheduler/feeder limitation.
I agree that even with no AP using bandwidth, MB alone has shown the capability of fully saturating the bandwidth.

On the other hand, NOT saturating the bandwidth may actually be making better use of it......
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3570
Credit: 98,000,723
RAC: 79,184
United States
Message 1229581 - Posted: 9 May 2012, 16:33:38 UTC - in response to Message 1229549.

My guess is that we are again seeing some kind of scheduler/feeder limitation.
I agree that even with no AP using bandwidth, MB alone has shown the capability of fully saturating the bandwidth.

On the other hand, NOT saturating the bandwidth may actually be making better use of it......

My machines are no longer uploading/requesting tasks 1 or 2 at a time. As they seem to have filled to their cache settings. So we may be looking at a normal bandwidth graph again. Which is how it would often look in the days before limits sans AP or shorties.
Not to say all requests are being fulfilled. Just that there are not so many transfers in progress to keep the bandwidth pegged 24/7.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37308
Credit: 499,207,246
RAC: 505,917
United States
Message 1230054 - Posted: 10 May 2012, 16:36:54 UTC
Last modified: 10 May 2012, 16:37:54 UTC

With the increased limits and the scheduler/feeder not having tasks available all the time....

The dang Boinc scheduler bug is kicking up again.

My #1 rig, not banging up against the limits anymore, is getting plenty of work for the GPU, but the scheduler is once again letting the CPUs go idle, not sending them a drop of work because the GPU cache is not full yet.
So the CPUs are twiddling their thumbs.

Dang it, DA....please quit starving the slower resources completely just because the fastest ones do not have their caches full!!!
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3570
Credit: 98,000,723
RAC: 79,184
United States
Message 1230066 - Posted: 10 May 2012, 17:00:42 UTC - in response to Message 1230054.

With the increased limits and the scheduler/feeder not having tasks available all the time....

The dang Boinc scheduler bug is kicking up again.

My #1 rig, not banging up against the limits anymore, is getting plenty of work for the GPU, but the scheduler is once again letting the CPUs go idle, not sending them a drop of work because the GPU cache is not full yet.
So the CPUs are twiddling their thumbs.

Dang it, DA....please quit starving the slower resources completely just because the fastest ones do not have their caches full!!!

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37308
Credit: 499,207,246
RAC: 505,917
United States
Message 1230068 - Posted: 10 May 2012, 17:07:14 UTC - in response to Message 1230066.

With the increased limits and the scheduler/feeder not having tasks available all the time....

The dang Boinc scheduler bug is kicking up again.

My #1 rig, not banging up against the limits anymore, is getting plenty of work for the GPU, but the scheduler is once again letting the CPUs go idle, not sending them a drop of work because the GPU cache is not full yet.
So the CPUs are twiddling their thumbs.

Dang it, DA....please quit starving the slower resources completely just because the fastest ones do not have their caches full!!!

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.


I don't believe this has ANYTHING to do with the Boinc client.
The host continually asks for GPU 'AND' CPU tasks. But is repeatedly ONLY sent GPU work.
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Profile Alex Storey
Volunteer tester
Avatar
Send message
Joined: 14 Jun 04
Posts: 533
Credit: 1,575,159
RAC: 476
Greece
Message 1230070 - Posted: 10 May 2012, 17:11:00 UTC - in response to Message 1230068.

I don't believe ANYTHING that has to do with the Boinc client.


There, I fixed it:)

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,024,991
RAC: 0
United Kingdom
Message 1230072 - Posted: 10 May 2012, 17:13:57 UTC - in response to Message 1230068.
Last modified: 10 May 2012, 17:52:33 UTC

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.

No, I have 7.0.25 on my QX6700 and it's got the same problem, so having V7 does not help with this server issue.

I would like to see a bigger fifo so fewer requests are needed to replenish the cache.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37308
Credit: 499,207,246
RAC: 505,917
United States
Message 1230073 - Posted: 10 May 2012, 17:15:03 UTC - in response to Message 1230072.

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.

No, I have 7.0.25 on my QX6700 and it's got the same problem.

It's not the client....
It's the what the scheduler logic does with the client request.
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,946,776
RAC: 13,604
United Kingdom
Message 1230078 - Posted: 10 May 2012, 17:33:06 UTC - in response to Message 1230073.

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.

No, I have 7.0.25 on my QX6700 and it's got the same problem.

It's not the client....
It's the what the scheduler logic does with the client request.

And by scheduler, Mark means the scheduler that runs on the server - that is indeed where this particular problem lies.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37308
Credit: 499,207,246
RAC: 505,917
United States
Message 1230081 - Posted: 10 May 2012, 17:34:51 UTC - in response to Message 1230078.
Last modified: 10 May 2012, 17:51:47 UTC

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.

No, I have 7.0.25 on my QX6700 and it's got the same problem.

It's not the client....
It's the what the scheduler logic does with the client request.

And by scheduler, Mark means the scheduler that runs on the server - that is indeed where this particular problem lies.

Thank you, Richard.

Of my top 3 rigs, 2 are now running GPU only due to this bug.
The only reason the 3rd is not is that the CPU is running on cached AP work with the manually installed AP app. Otherwise, it would be in the same boat.
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,024,991
RAC: 0
United Kingdom
Message 1230092 - Posted: 10 May 2012, 17:57:42 UTC - in response to Message 1230081.
Last modified: 10 May 2012, 18:00:21 UTC

If you stop BOINC and set a bigish duration_correction_factor you will just get CPU work for a while. The reason my 980X gets CPU WUs is the DCF jumps to 6 when a slow GPU finishes and the system just asks for CPU WUs 'till it drops.

Wow, the 980X hast just hit 4,000 WUs cached.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37308
Credit: 499,207,246
RAC: 505,917
United States
Message 1230096 - Posted: 10 May 2012, 18:05:48 UTC - in response to Message 1230092.

If you stop BOINC and set a bigish duration_correction_factor you will just get CPU work for a while. The reason my 980X gets CPU WUs is the DCF jumps to 6 when a slow GPU finishes and the system just asks for CPU WUs 'till it drops.

Wow, the 980X hast just hit 4,000 WUs cached.

I have enough GPU work to last a bit, so I am going to do the 'uncheck use nvidia GPU' trick to get some CPU work flowing.

But, that is a workaround, and should not be necessary.


____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

LadyL
Volunteer tester
Avatar
Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1230097 - Posted: 10 May 2012, 18:06:25 UTC - in response to Message 1230072.

I would like to see a bigger fifo so fewer requests are needed to replenish the cache.


It's called the feeder.

The usual workaround is to disable the resource in the project prefs that is getting all the tasks, until the 'slower' has some sort of cache.

The other option would be to reduce cache, allow the slower resource to catch up and then gradually increase cache again.

It will eventually get sorted by itself, but if you have a large cache to fill, it may take quite a while until you have single resource requests again instead of double ones.
____________
I'm not the Pope. I don't speak Ex Cathedra!

Sten-Arne
Volunteer tester
Send message
Joined: 1 Nov 08
Posts: 3307
Credit: 16,307,047
RAC: 13,783
Sweden
Message 1230136 - Posted: 10 May 2012, 19:29:33 UTC

Who stole all APs? Or, who stole the AP splitters?
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2204
Credit: 8,013,519
RAC: 4,280
United States
Message 1230235 - Posted: 10 May 2012, 21:43:44 UTC - in response to Message 1230136.

Who stole all APs? Or, who stole the AP splitters?

It may be that since AP_v505 is now down to 8 still out in the field, new tapes will be held until the last of those 505's comes in and they can do that DB kick thing (which may have already been done since "awaiting validation" isn't 10k+ anymore).

Or.. it could just be that there were so many tapes loaded up in the first place that we're now at that point where we have to sit around and wait for the MB splitters to catch up to get some new tapes loaded for AP to split.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile arkayn
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3543
Credit: 46,151,801
RAC: 30,644
United States
Message 1230303 - Posted: 10 May 2012, 23:27:22 UTC - in response to Message 1230235.

Who stole all APs? Or, who stole the AP splitters?

It may be that since AP_v505 is now down to 8 still out in the field, new tapes will be held until the last of those 505's comes in and they can do that DB kick thing (which may have already been done since "awaiting validation" isn't 10k+ anymore).

Or.. it could just be that there were so many tapes loaded up in the first place that we're now at that point where we have to sit around and wait for the MB splitters to catch up to get some new tapes loaded for AP to split.


I am down to 18 from 65 a couple of days ago.
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2204
Credit: 8,013,519
RAC: 4,280
United States
Message 1230426 - Posted: 11 May 2012, 6:05:02 UTC

Well this is just starting to be almost slightly irritating. Because of the adjustments to the estimates, I ended up with like a 22-day AP-only cache and therefore, my average turnaround time was in the high teens. The result of this was that most of my wingmates were waiting for me, so I ended up with nearly every reported result being validated immediately.

But since there hasn't been new work going out and my cache is now down in the ~8-day range, I'm starting to pick up more and more pendings when I report. Oh well. That's the way it goes.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5564
Credit: 51,346,693
RAC: 40,582
Australia
Message 1230437 - Posted: 11 May 2012, 6:27:44 UTC


I still reckon something's not quite right.
I'm not getting as many "Project has no tasks available" messages as i was, but i'm still getting more than i usually do even when network traffic is maxed out. Given how (relatively) low the traffic has been i would expect to get hardly any, if any, such messages when requesting work.
____________
Grant
Darwin NT.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,946,776
RAC: 13,604
United Kingdom
Message 1230442 - Posted: 11 May 2012, 6:33:22 UTC - in response to Message 1230437.


I still reckon something's not quite right.
I'm not getting as many "Project has no tasks available" messages as i was, but i'm still getting more than i usually do even when network traffic is maxed out. Given how (relatively) low the traffic has been i would expect to get hardly any, if any, such messages when requesting work.

Well, the tasks are going out, because we're now over 5.5 million out in the field. I don't know how big that figure can be before the database starts slowing down...

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,024,991
RAC: 0
United Kingdom
Message 1230458 - Posted: 11 May 2012, 8:24:49 UTC - in response to Message 1230437.
Last modified: 11 May 2012, 8:51:52 UTC

I still reckon something's not quite right.
I'm not getting as many "Project has no tasks available" messages as i was, but i'm still getting more than i usually do even when network traffic is maxed out. Given how (relatively) low the traffic has been i would expect to get hardly any, if any, such messages when requesting work.

Now there are no limits I expect many hosts are asking for and getting the entire of the feeder buffer. Getting WUs is going to be a problem 'till all the caches are full. I feel it would help a lot if the feeder could have a bigger buffer.

I am puzzled as to why the Result average turnaround is dropping though.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (74) Server problems?

Copyright © 2014 University of California