Panic Mode On (74) Server problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (74) Server problems?

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next
Author Message
msattler - meow!!
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 45135
Credit: 798,150,048
RAC: 103,159
United States
Message 1229549 - Posted: 9 May 2012, 15:14:34 UTC

My guess is that we are again seeing some kind of scheduler/feeder limitation.
I agree that even with no AP using bandwidth, MB alone has shown the capability of fully saturating the bandwidth.

On the other hand, NOT saturating the bandwidth may actually be making better use of it......
____________
The Seti all time #1 home contributor. With help from my kitties and kitty friends.

Have made a few friends in life.
Most were cats.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 5963
Credit: 153,347,086
RAC: 1,955
United States
Message 1229581 - Posted: 9 May 2012, 16:33:38 UTC - in response to Message 1229549.

My guess is that we are again seeing some kind of scheduler/feeder limitation.
I agree that even with no AP using bandwidth, MB alone has shown the capability of fully saturating the bandwidth.

On the other hand, NOT saturating the bandwidth may actually be making better use of it......

My machines are no longer uploading/requesting tasks 1 or 2 at a time. As they seem to have filled to their cache settings. So we may be looking at a normal bandwidth graph again. Which is how it would often look in the days before limits sans AP or shorties.
Not to say all requests are being fulfilled. Just that there are not so many transfers in progress to keep the bandwidth pegged 24/7.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

msattler - meow!!
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 45135
Credit: 798,150,048
RAC: 103,159
United States
Message 1230054 - Posted: 10 May 2012, 16:36:54 UTC
Last modified: 10 May 2012, 16:37:54 UTC

With the increased limits and the scheduler/feeder not having tasks available all the time....

The dang Boinc scheduler bug is kicking up again.

My #1 rig, not banging up against the limits anymore, is getting plenty of work for the GPU, but the scheduler is once again letting the CPUs go idle, not sending them a drop of work because the GPU cache is not full yet.
So the CPUs are twiddling their thumbs.

Dang it, DA....please quit starving the slower resources completely just because the fastest ones do not have their caches full!!!
____________
The Seti all time #1 home contributor. With help from my kitties and kitty friends.

Have made a few friends in life.
Most were cats.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 5963
Credit: 153,347,086
RAC: 1,955
United States
Message 1230066 - Posted: 10 May 2012, 17:00:42 UTC - in response to Message 1230054.

With the increased limits and the scheduler/feeder not having tasks available all the time....

The dang Boinc scheduler bug is kicking up again.

My #1 rig, not banging up against the limits anymore, is getting plenty of work for the GPU, but the scheduler is once again letting the CPUs go idle, not sending them a drop of work because the GPU cache is not full yet.
So the CPUs are twiddling their thumbs.

Dang it, DA....please quit starving the slower resources completely just because the fastest ones do not have their caches full!!!

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

msattler - meow!!
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 45135
Credit: 798,150,048
RAC: 103,159
United States
Message 1230068 - Posted: 10 May 2012, 17:07:14 UTC - in response to Message 1230066.

With the increased limits and the scheduler/feeder not having tasks available all the time....

The dang Boinc scheduler bug is kicking up again.

My #1 rig, not banging up against the limits anymore, is getting plenty of work for the GPU, but the scheduler is once again letting the CPUs go idle, not sending them a drop of work because the GPU cache is not full yet.
So the CPUs are twiddling their thumbs.

Dang it, DA....please quit starving the slower resources completely just because the fastest ones do not have their caches full!!!

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.


I don't believe this has ANYTHING to do with the Boinc client.
The host continually asks for GPU 'AND' CPU tasks. But is repeatedly ONLY sent GPU work.
____________
The Seti all time #1 home contributor. With help from my kitties and kitty friends.

Have made a few friends in life.
Most were cats.

Profile Alex Storey
Volunteer tester
Avatar
Send message
Joined: 14 Jun 04
Posts: 965
Credit: 1,937,099
RAC: 281
Greece
Message 1230070 - Posted: 10 May 2012, 17:11:00 UTC - in response to Message 1230068.

I don't believe ANYTHING that has to do with the Boinc client.


There, I fixed it:)

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1230072 - Posted: 10 May 2012, 17:13:57 UTC - in response to Message 1230068.
Last modified: 10 May 2012, 17:52:33 UTC

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.

No, I have 7.0.25 on my QX6700 and it's got the same problem, so having V7 does not help with this server issue.

I would like to see a bigger fifo so fewer requests are needed to replenish the cache.

msattler - meow!!
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 45135
Credit: 798,150,048
RAC: 103,159
United States
Message 1230073 - Posted: 10 May 2012, 17:15:03 UTC - in response to Message 1230072.

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.

No, I have 7.0.25 on my QX6700 and it's got the same problem.

It's not the client....
It's the what the scheduler logic does with the client request.
____________
The Seti all time #1 home contributor. With help from my kitties and kitty friends.

Have made a few friends in life.
Most were cats.

Richard HaselgroveProject Donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 10656
Credit: 77,563,917
RAC: 34,213
United Kingdom
Message 1230078 - Posted: 10 May 2012, 17:33:06 UTC - in response to Message 1230073.

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.

No, I have 7.0.25 on my QX6700 and it's got the same problem.

It's not the client....
It's the what the scheduler logic does with the client request.

And by scheduler, Mark means the scheduler that runs on the server - that is indeed where this particular problem lies.

msattler - meow!!
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 45135
Credit: 798,150,048
RAC: 103,159
United States
Message 1230081 - Posted: 10 May 2012, 17:34:51 UTC - in response to Message 1230078.
Last modified: 10 May 2012, 17:51:47 UTC

I thought there was talk about that being corrected in the v7 client, but then there is the odd high/low work fetch system it uses.

No, I have 7.0.25 on my QX6700 and it's got the same problem.

It's not the client....
It's the what the scheduler logic does with the client request.

And by scheduler, Mark means the scheduler that runs on the server - that is indeed where this particular problem lies.

Thank you, Richard.

Of my top 3 rigs, 2 are now running GPU only due to this bug.
The only reason the 3rd is not is that the CPU is running on cached AP work with the manually installed AP app. Otherwise, it would be in the same boat.
____________
The Seti all time #1 home contributor. With help from my kitties and kitty friends.

Have made a few friends in life.
Most were cats.

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1230092 - Posted: 10 May 2012, 17:57:42 UTC - in response to Message 1230081.
Last modified: 10 May 2012, 18:00:21 UTC

If you stop BOINC and set a bigish duration_correction_factor you will just get CPU work for a while. The reason my 980X gets CPU WUs is the DCF jumps to 6 when a slow GPU finishes and the system just asks for CPU WUs 'till it drops.

Wow, the 980X hast just hit 4,000 WUs cached.

msattler - meow!!
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 45135
Credit: 798,150,048
RAC: 103,159
United States
Message 1230096 - Posted: 10 May 2012, 18:05:48 UTC - in response to Message 1230092.

If you stop BOINC and set a bigish duration_correction_factor you will just get CPU work for a while. The reason my 980X gets CPU WUs is the DCF jumps to 6 when a slow GPU finishes and the system just asks for CPU WUs 'till it drops.

Wow, the 980X hast just hit 4,000 WUs cached.

I have enough GPU work to last a bit, so I am going to do the 'uncheck use nvidia GPU' trick to get some CPU work flowing.

But, that is a workaround, and should not be necessary.


____________
The Seti all time #1 home contributor. With help from my kitties and kitty friends.

Have made a few friends in life.
Most were cats.

LadyL
Volunteer tester
Avatar
Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1230097 - Posted: 10 May 2012, 18:06:25 UTC - in response to Message 1230072.

I would like to see a bigger fifo so fewer requests are needed to replenish the cache.


It's called the feeder.

The usual workaround is to disable the resource in the project prefs that is getting all the tasks, until the 'slower' has some sort of cache.

The other option would be to reduce cache, allow the slower resource to catch up and then gradually increase cache again.

It will eventually get sorted by itself, but if you have a large cache to fill, it may take quite a while until you have single resource requests again instead of double ones.
____________
I'm not the Pope. I don't speak Ex Cathedra!

Tutankhamon "Communist"
Volunteer tester
Avatar
Send message
Joined: 1 Nov 08
Posts: 5963
Credit: 37,224,457
RAC: 1,031
Sweden
Message 1230136 - Posted: 10 May 2012, 19:29:33 UTC

Who stole all APs? Or, who stole the AP splitters?
____________
Too much hormone treated meat.
Too much Monsanto veggies.
Too old, and outdated constitution.

Yeah, you do have a "crazy" problem, no doubt about that...


Why defend a culture of death?

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2815
Credit: 10,554,541
RAC: 795
United States
Message 1230235 - Posted: 10 May 2012, 21:43:44 UTC - in response to Message 1230136.

Who stole all APs? Or, who stole the AP splitters?

It may be that since AP_v505 is now down to 8 still out in the field, new tapes will be held until the last of those 505's comes in and they can do that DB kick thing (which may have already been done since "awaiting validation" isn't 10k+ anymore).

Or.. it could just be that there were so many tapes loaded up in the first place that we're now at that point where we have to sit around and wait for the MB splitters to catch up to get some new tapes loaded for AP to split.
____________
Linux laptop:
Current uptime: 1491d 05h 06m (as of 20160713_0202 UTC)

Profile arkaynProject Donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 4034
Credit: 51,008,381
RAC: 45
United States
Message 1230303 - Posted: 10 May 2012, 23:27:22 UTC - in response to Message 1230235.

Who stole all APs? Or, who stole the AP splitters?

It may be that since AP_v505 is now down to 8 still out in the field, new tapes will be held until the last of those 505's comes in and they can do that DB kick thing (which may have already been done since "awaiting validation" isn't 10k+ anymore).

Or.. it could just be that there were so many tapes loaded up in the first place that we're now at that point where we have to sit around and wait for the MB splitters to catch up to get some new tapes loaded for AP to split.


I am down to 18 from 65 a couple of days ago.
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2815
Credit: 10,554,541
RAC: 795
United States
Message 1230426 - Posted: 11 May 2012, 6:05:02 UTC

Well this is just starting to be almost slightly irritating. Because of the adjustments to the estimates, I ended up with like a 22-day AP-only cache and therefore, my average turnaround time was in the high teens. The result of this was that most of my wingmates were waiting for me, so I ended up with nearly every reported result being validated immediately.

But since there hasn't been new work going out and my cache is now down in the ~8-day range, I'm starting to pick up more and more pendings when I report. Oh well. That's the way it goes.
____________
Linux laptop:
Current uptime: 1491d 05h 06m (as of 20160713_0202 UTC)

Grant (SSSF)
Volunteer tester
Send message
Joined: 19 Aug 99
Posts: 7106
Credit: 85,527,649
RAC: 13,045
Australia
Message 1230437 - Posted: 11 May 2012, 6:27:44 UTC


I still reckon something's not quite right.
I'm not getting as many "Project has no tasks available" messages as i was, but i'm still getting more than i usually do even when network traffic is maxed out. Given how (relatively) low the traffic has been i would expect to get hardly any, if any, such messages when requesting work.
____________
Grant
Darwin NT.

Richard HaselgroveProject Donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 10656
Credit: 77,563,917
RAC: 34,213
United Kingdom
Message 1230442 - Posted: 11 May 2012, 6:33:22 UTC - in response to Message 1230437.


I still reckon something's not quite right.
I'm not getting as many "Project has no tasks available" messages as i was, but i'm still getting more than i usually do even when network traffic is maxed out. Given how (relatively) low the traffic has been i would expect to get hardly any, if any, such messages when requesting work.

Well, the tasks are going out, because we're now over 5.5 million out in the field. I don't know how big that figure can be before the database starts slowing down...

Profile red-ray
Avatar
Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1230458 - Posted: 11 May 2012, 8:24:49 UTC - in response to Message 1230437.
Last modified: 11 May 2012, 8:51:52 UTC

I still reckon something's not quite right.
I'm not getting as many "Project has no tasks available" messages as i was, but i'm still getting more than i usually do even when network traffic is maxed out. Given how (relatively) low the traffic has been i would expect to get hardly any, if any, such messages when requesting work.

Now there are no limits I expect many hosts are asking for and getting the entire of the feeder buffer. Getting WUs is going to be a problem 'till all the caches are full. I feel it would help a lot if the feeder could have a bigger buffer.

I am puzzled as to why the Result average turnaround is dropping though.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (74) Server problems?

Copyright © 2016 University of California