holy cow! 20 timeouts


log in

Advanced search

Message boards : Number crunching : holy cow! 20 timeouts

Author Message
N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11601
Credit: 14,350,917
RAC: 13,477
United States
Message 1283568 - Posted: 14 Sep 2012, 13:18:14 UTC

I know the explanation for short deadline timeouts, but I have a question about the explanation.

The explanation is that the host asks for work for, say, GPU, and the Scheduler responds by assigning a bunch of tasks for GPU, but the message never reaches the host so it can start downloading them. A few minutes later, the host again asks for work, but this time only for CPU, and it sends in the list of what it has on hand. The Scheduler looks at the list and says, "hey, I assigned this other bunch of work to you, but you don't have it, and I can't assign it again because you're not asking for GPU this time, so I have no choice but to time it out on you." Okay, fine. BUT..... if the host asked for GPU and (as far as it knows) didn't get any, why wouldn't it ask for GPU again the next time?

While typing the above, I began to wonder something... Does the list the host sends the Scheduler include everything that it knows to have been assigned, even if it has't been downloaded yet, or is it only what's been downloaded? If the latter, then download slowness could be at the root of many of the short timeouts we all seem to experience at one time or another. It could probably also be fixed with a fairly minor tweak of the code somewhere (says the guy who knows nothing about coding, but who is smart enough to know such a tweak might have consequences I don't see).

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4232
Credit: 115,871,811
RAC: 146,325
United States
Message 1283592 - Posted: 14 Sep 2012, 13:48:37 UTC

Last time I checked the ghost task VLAR timeout was more opposite of your description.

1) Host requests work for CPU or CPU & GPU.
2) Server assigns VLAR tasks to the CPU, but the host doesn't receive the list.
3) Host requests work for just GPU.
4) Server wants to send the tasks the host doesn't have.
5) The check for not allowing VLAR tasks on the GPU is tripped.
6) The tasks are marked as "Timed out - no response".

I think there was some talk about trying to prevent this, but I don't know if that was just talk or not.

IIRC: When a host makes a request it sends its list of all tasks in the client_state.xml. So if they are being processes, waiting, or in a transfer state it will tell the server "Hey I have these tasks".


____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8549
Credit: 50,348,611
RAC: 50,410
United Kingdom
Message 1283593 - Posted: 14 Sep 2012, 13:48:43 UTC - in response to Message 1283568.
Last modified: 14 Sep 2012, 14:07:10 UTC

It's more likely to be the other way round. Your computer asks for CPU tasks, and is assigned VLAR workunits (somebody was saying yesterday that there were a lot around at the moment).

If that message gets lost, and the next request is for GPU work, that's when the deadlines get fudged (because of the "don't send VLAR to nVidia" rule).

Apart from the 'no VLAR to NV' situation, there's nothing (except possibly your preferences) to stop a "lost" result being issued to a different computing resource the second time round - when VLARs were being issued to GPUs because of a bug recently, I published instructions for a technique of deliberately losing them and getting them resent to CPU instead. That's the situation you suggest might cause problems, but it worked well for those who tried it.

As regards the client telling the server about work it knows has been allocated, but not yet downloaded - I don't know. I'll take a look next time I have some downloads stuck. Murphy has decreed that I'm all green, just at the moment...

Edit - got some. Yes, confirming Hal's post - allocated but not yet downloaded do get reported to the server.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 74,097,690
RAC: 68,371
Argentina
Message 1283623 - Posted: 14 Sep 2012, 14:41:45 UTC - in response to Message 1283568.

if the host asked for GPU and (as far as it knows) didn't get any, why wouldn't it ask for GPU again the next time?

As said before, it asks for CPU tasks, they are lost and then, at least 5 mins before, they asks again for more tasks...
Well, in those 5 mins it may happen that the GPU Cache reach the "request more work" trigger so it asks for CPU and GPU, as the scheduller will try to fill first the GPU cache it will try to assign the losts tasks (originally intended for the CPU) to the GPU...

There are something else, if you are also attached to other projects, after the request fails to receive the needed CPU tasks, it migh ask for CPU tasks to another project while waiting for the 5 min delay, and if it gets tasks then it may not need CPU tasks anymore.
____________

Profile SliverProject donor
Avatar
Send message
Joined: 18 May 11
Posts: 281
Credit: 7,189,694
RAC: 4,920
United States
Message 1283808 - Posted: 14 Sep 2012, 22:37:53 UTC

Check out all my timeouts. I've never experienced this before. Can someone explain to me what might be causing this all of the sudden, and what steps I can take to prevent it from happening in the future?
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8549
Credit: 50,348,611
RAC: 50,410
United Kingdom
Message 1283811 - Posted: 14 Sep 2012, 22:44:56 UTC - in response to Message 1283808.

Check out all my timeouts. I've never experienced this before. Can someone explain to me what might be causing this all of the sudden, and what steps I can take to prevent it from happening in the future?

Reposting with a HostID link - other volunteers aren't allowed to follow a UserID link.

Error tasks for computer 5967851

Look at all those VLARs, as previously discussed.

Profile SliverProject donor
Avatar
Send message
Joined: 18 May 11
Posts: 281
Credit: 7,189,694
RAC: 4,920
United States
Message 1283813 - Posted: 14 Sep 2012, 22:54:23 UTC - in response to Message 1283811.

Look at all those VLARs, as previously discussed.


I don't understand any of that jargon. Riddle me this simply: Is it something that I did or that shows that there is something wrong with my computer that I can fix/change?

____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8549
Credit: 50,348,611
RAC: 50,410
United Kingdom
Message 1283818 - Posted: 14 Sep 2012, 23:05:05 UTC - in response to Message 1283813.

Look at all those VLARs, as previously discussed.

I don't understand any of that jargon. Riddle me this simply: Is it something that I did or that shows that there is something wrong with my computer that I can fix/change?

Neither. It's a slightly quirky part of the SETI system. It wasn't caused by anything you did: there's nothing wrong with your computer: and there's nothing you can do to fix it.

Just hang on tight and enjoy the ride.

Profile SliverProject donor
Avatar
Send message
Joined: 18 May 11
Posts: 281
Credit: 7,189,694
RAC: 4,920
United States
Message 1283826 - Posted: 14 Sep 2012, 23:23:28 UTC - in response to Message 1283818.

Just hang on tight and enjoy the ride.


Thanks for helping me again, Richard :)

____________

Profile bj
Send message
Joined: 11 Oct 00
Posts: 163
Credit: 50,429,507
RAC: 0
United States
Message 1283871 - Posted: 15 Sep 2012, 2:19:23 UTC

Did have almost 300. Went doen tp 152 and now back to 226 timed-outs.

Like they say: hang in for the ride.


bj
____________

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 326
Credit: 144,704,788
RAC: 214,639
Australia
Message 1283895 - Posted: 15 Sep 2012, 3:22:53 UTC
Last modified: 15 Sep 2012, 3:23:46 UTC

Yep, I have a sharp increase in errors due to VLARs misdirected to GPUs as well.

Somewhat-related question: With the recent changes to the scheduler, are VLARs no longer sent to ATI GPUs now? I'm certainly not complaining - while VLARs don't suffer with ATI GPUs nearly as badly as NV GPUs, I do often get a drop in the responsiveness of the OS GUI. I just haven't noticed any VLARs assigned to the ATI GPUs over the past several weeks and I'm enjoying this change, if it was indeed a deliberate change in the scheduler code.
____________
Soli Deo Gloria

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4101
Credit: 33,140,318
RAC: 8,738
United Kingdom
Message 1283987 - Posted: 15 Sep 2012, 9:30:29 UTC - in response to Message 1283895.

Somewhat-related question: With the recent changes to the scheduler, are VLARs no longer sent to ATI GPUs now? I'm certainly not complaining - while VLARs don't suffer with ATI GPUs nearly as badly as NV GPUs, I do often get a drop in the responsiveness of the OS GUI. I just haven't noticed any VLARs assigned to the ATI GPUs over the past several weeks and I'm enjoying this change, if it was indeed a deliberate change in the scheduler code.

I noticed a couple of weeks ago while looking through someone's ATI cache that he didn't have any VLARs on the ATI, so i expect it's true, don't know whether it was a deliberate change or not,

Claggy

disco_nnected
Volunteer tester
Send message
Joined: 19 Dec 06
Posts: 15
Credit: 7,125,464
RAC: 808
Croatia
Message 1284029 - Posted: 15 Sep 2012, 11:40:43 UTC

I'm also getting a bunch of timeouts last few days....

chromespringerProject donor
Avatar
Send message
Joined: 3 Dec 05
Posts: 269
Credit: 21,753,576
RAC: 47,512
United States
Message 1284142 - Posted: 15 Sep 2012, 17:02:47 UTC - in response to Message 1283808.

Check out all my timeouts. I've never experienced this before. Can someone explain to me what might be causing this all of the sudden, and what steps I can take to prevent it from happening in the future?

Samten, if you look @ your task manager under application, all error-ed tasks appear to be cpu applications as do mine. They appear to time out shortly after they download.
____________

chromespringerProject donor
Avatar
Send message
Joined: 3 Dec 05
Posts: 269
Credit: 21,753,576
RAC: 47,512
United States
Message 1284153 - Posted: 15 Sep 2012, 17:18:08 UTC - in response to Message 1284142.

Check out all my timeouts. I've never experienced this before. Can someone explain to me what might be causing this all of the sudden, and what steps I can take to prevent it from happening in the future?

Samten, if you look @ your task manager under application, all error-ed tasks appear to be cpu applications as do mine. They appear to time out shortly after they download.

All are "recent lost task" .. i expect they hung around in limbo too long
____________

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11601
Credit: 14,350,917
RAC: 13,477
United States
Message 1284781 - Posted: 17 Sep 2012, 13:39:05 UTC - in response to Message 1283623.

if the host asked for GPU and (as far as it knows) didn't get any, why wouldn't it ask for GPU again the next time?

As said before, it asks for CPU tasks, they are lost and then, at least 5 mins before, they asks again for more tasks...
Well, in those 5 mins it may happen that the GPU Cache reach the "request more work" trigger so it asks for CPU and GPU, as the scheduller will try to fill first the GPU cache it will try to assign the losts tasks (originally intended for the CPU) to the GPU...

Ah hah. That was the answer I was looking for (instead of just correcting me on the details). I forgot that on a dual request, it tries to fill GPU first.

There are something else, if you are also attached to other projects, after the request fails to receive the needed CPU tasks, it migh ask for CPU tasks to another project while waiting for the 5 min delay, and if it gets tasks then it may not need CPU tasks anymore.

And I didn't think of that. Although in my case, I have only one other project, Einstein, and all of my computers have been shying away from it lately (one of the three has a couple of tasks due in three days and the others haven't contacted it since they reported their last ones about a week ago).

As to the list including undownloaded tasks, thanks to the guys who answered. It was just an idea.

BTW, I'm up to 53 timeouts now.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile James Sotherden
Avatar
Send message
Joined: 16 May 99
Posts: 8832
Credit: 34,892,261
RAC: 60,674
United States
Message 1285128 - Posted: 18 Sep 2012, 12:20:25 UTC - in response to Message 1284781.

if the host asked for GPU and (as far as it knows) didn't get any, why wouldn't it ask for GPU again the next time?

As said before, it asks for CPU tasks, they are lost and then, at least 5 mins before, they asks again for more tasks...
Well, in those 5 mins it may happen that the GPU Cache reach the "request more work" trigger so it asks for CPU and GPU, as the scheduller will try to fill first the GPU cache it will try to assign the losts tasks (originally intended for the CPU) to the GPU...

Ah hah. That was the answer I was looking for (instead of just correcting me on the details). I forgot that on a dual request, it tries to fill GPU first.

There are something else, if you are also attached to other projects, after the request fails to receive the needed CPU tasks, it migh ask for CPU tasks to another project while waiting for the 5 min delay, and if it gets tasks then it may not need CPU tasks anymore.

And I didn't think of that. Although in my case, I have only one other project, Einstein, and all of my computers have been shying away from it lately (one of the three has a couple of tasks due in three days and the others haven't contacted it since they reported their last ones about a week ago).

As to the list including undownloaded tasks, thanks to the guys who answered. It was just an idea.

BTW, I'm up to 53 timeouts now.


I now have 81 time outs. I cant even buy a download let alone report right now.
____________

Old James

Message boards : Number crunching : holy cow! 20 timeouts

Copyright © 2014 University of California