holy cow! 20 timeouts

Message boards : Number crunching : holy cow! 20 timeouts
Message board moderation

To post messages, you must log in.

AuthorMessage
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1283568 - Posted: 14 Sep 2012, 13:18:14 UTC

I know the explanation for short deadline timeouts, but I have a question about the explanation.

The explanation is that the host asks for work for, say, GPU, and the Scheduler responds by assigning a bunch of tasks for GPU, but the message never reaches the host so it can start downloading them. A few minutes later, the host again asks for work, but this time only for CPU, and it sends in the list of what it has on hand. The Scheduler looks at the list and says, "hey, I assigned this other bunch of work to you, but you don't have it, and I can't assign it again because you're not asking for GPU this time, so I have no choice but to time it out on you." Okay, fine. BUT..... if the host asked for GPU and (as far as it knows) didn't get any, why wouldn't it ask for GPU again the next time?

While typing the above, I began to wonder something... Does the list the host sends the Scheduler include everything that it knows to have been assigned, even if it has't been downloaded yet, or is it only what's been downloaded? If the latter, then download slowness could be at the root of many of the short timeouts we all seem to experience at one time or another. It could probably also be fixed with a fairly minor tweak of the code somewhere (says the guy who knows nothing about coding, but who is smart enough to know such a tweak might have consequences I don't see).

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1283568 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1283592 - Posted: 14 Sep 2012, 13:48:37 UTC

Last time I checked the ghost task VLAR timeout was more opposite of your description.

1) Host requests work for CPU or CPU & GPU.
2) Server assigns VLAR tasks to the CPU, but the host doesn't receive the list.
3) Host requests work for just GPU.
4) Server wants to send the tasks the host doesn't have.
5) The check for not allowing VLAR tasks on the GPU is tripped.
6) The tasks are marked as "Timed out - no response".

I think there was some talk about trying to prevent this, but I don't know if that was just talk or not.

IIRC: When a host makes a request it sends its list of all tasks in the client_state.xml. So if they are being processes, waiting, or in a transfer state it will tell the server "Hey I have these tasks".


SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1283592 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1283593 - Posted: 14 Sep 2012, 13:48:43 UTC - in response to Message 1283568.  
Last modified: 14 Sep 2012, 14:07:10 UTC

It's more likely to be the other way round. Your computer asks for CPU tasks, and is assigned VLAR workunits (somebody was saying yesterday that there were a lot around at the moment).

If that message gets lost, and the next request is for GPU work, that's when the deadlines get fudged (because of the "don't send VLAR to nVidia" rule).

Apart from the 'no VLAR to NV' situation, there's nothing (except possibly your preferences) to stop a "lost" result being issued to a different computing resource the second time round - when VLARs were being issued to GPUs because of a bug recently, I published instructions for a technique of deliberately losing them and getting them resent to CPU instead. That's the situation you suggest might cause problems, but it worked well for those who tried it.

As regards the client telling the server about work it knows has been allocated, but not yet downloaded - I don't know. I'll take a look next time I have some downloads stuck. Murphy has decreed that I'm all green, just at the moment...

Edit - got some. Yes, confirming Hal's post - allocated but not yet downloaded do get reported to the server.
ID: 1283593 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1283623 - Posted: 14 Sep 2012, 14:41:45 UTC - in response to Message 1283568.  

if the host asked for GPU and (as far as it knows) didn't get any, why wouldn't it ask for GPU again the next time?

As said before, it asks for CPU tasks, they are lost and then, at least 5 mins before, they asks again for more tasks...
Well, in those 5 mins it may happen that the GPU Cache reach the "request more work" trigger so it asks for CPU and GPU, as the scheduller will try to fill first the GPU cache it will try to assign the losts tasks (originally intended for the CPU) to the GPU...

There are something else, if you are also attached to other projects, after the request fails to receive the needed CPU tasks, it migh ask for CPU tasks to another project while waiting for the 5 min delay, and if it gets tasks then it may not need CPU tasks anymore.
ID: 1283623 · Report as offensive
Profile Akio
Avatar

Send message
Joined: 18 May 11
Posts: 375
Credit: 32,129,242
RAC: 0
United States
Message 1283808 - Posted: 14 Sep 2012, 22:37:53 UTC

Check out all my timeouts. I've never experienced this before. Can someone explain to me what might be causing this all of the sudden, and what steps I can take to prevent it from happening in the future?
ID: 1283808 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1283811 - Posted: 14 Sep 2012, 22:44:56 UTC - in response to Message 1283808.  

Check out all my timeouts. I've never experienced this before. Can someone explain to me what might be causing this all of the sudden, and what steps I can take to prevent it from happening in the future?

Reposting with a HostID link - other volunteers aren't allowed to follow a UserID link.

Error tasks for computer 5967851

Look at all those VLARs, as previously discussed.
ID: 1283811 · Report as offensive
Profile Akio
Avatar

Send message
Joined: 18 May 11
Posts: 375
Credit: 32,129,242
RAC: 0
United States
Message 1283813 - Posted: 14 Sep 2012, 22:54:23 UTC - in response to Message 1283811.  

Look at all those VLARs, as previously discussed.


I don't understand any of that jargon. Riddle me this simply: Is it something that I did or that shows that there is something wrong with my computer that I can fix/change?

ID: 1283813 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1283818 - Posted: 14 Sep 2012, 23:05:05 UTC - in response to Message 1283813.  

Look at all those VLARs, as previously discussed.

I don't understand any of that jargon. Riddle me this simply: Is it something that I did or that shows that there is something wrong with my computer that I can fix/change?

Neither. It's a slightly quirky part of the SETI system. It wasn't caused by anything you did: there's nothing wrong with your computer: and there's nothing you can do to fix it.

Just hang on tight and enjoy the ride.
ID: 1283818 · Report as offensive
Profile Akio
Avatar

Send message
Joined: 18 May 11
Posts: 375
Credit: 32,129,242
RAC: 0
United States
Message 1283826 - Posted: 14 Sep 2012, 23:23:28 UTC - in response to Message 1283818.  

Just hang on tight and enjoy the ride.


Thanks for helping me again, Richard :)

ID: 1283826 · Report as offensive
Profile bj

Send message
Joined: 11 Oct 00
Posts: 163
Credit: 50,429,507
RAC: 0
United States
Message 1283871 - Posted: 15 Sep 2012, 2:19:23 UTC

Did have almost 300. Went doen tp 152 and now back to 226 timed-outs.

Like they say: hang in for the ride.


bj
ID: 1283871 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1283895 - Posted: 15 Sep 2012, 3:22:53 UTC
Last modified: 15 Sep 2012, 3:23:46 UTC

Yep, I have a sharp increase in errors due to VLARs misdirected to GPUs as well.

Somewhat-related question: With the recent changes to the scheduler, are VLARs no longer sent to ATI GPUs now? I'm certainly not complaining - while VLARs don't suffer with ATI GPUs nearly as badly as NV GPUs, I do often get a drop in the responsiveness of the OS GUI. I just haven't noticed any VLARs assigned to the ATI GPUs over the past several weeks and I'm enjoying this change, if it was indeed a deliberate change in the scheduler code.
Soli Deo Gloria
ID: 1283895 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1283987 - Posted: 15 Sep 2012, 9:30:29 UTC - in response to Message 1283895.  

Somewhat-related question: With the recent changes to the scheduler, are VLARs no longer sent to ATI GPUs now? I'm certainly not complaining - while VLARs don't suffer with ATI GPUs nearly as badly as NV GPUs, I do often get a drop in the responsiveness of the OS GUI. I just haven't noticed any VLARs assigned to the ATI GPUs over the past several weeks and I'm enjoying this change, if it was indeed a deliberate change in the scheduler code.

I noticed a couple of weeks ago while looking through someone's ATI cache that he didn't have any VLARs on the ATI, so i expect it's true, don't know whether it was a deliberate change or not,

Claggy
ID: 1283987 · Report as offensive
disco_nnected Project Donor
Volunteer tester

Send message
Joined: 19 Dec 06
Posts: 16
Credit: 13,654,017
RAC: 66
Croatia
Message 1284029 - Posted: 15 Sep 2012, 11:40:43 UTC

I'm also getting a bunch of timeouts last few days....
ID: 1284029 · Report as offensive
chromespringer
Avatar

Send message
Joined: 3 Dec 05
Posts: 296
Credit: 55,183,482
RAC: 0
United States
Message 1284142 - Posted: 15 Sep 2012, 17:02:47 UTC - in response to Message 1283808.  

Check out all my timeouts. I've never experienced this before. Can someone explain to me what might be causing this all of the sudden, and what steps I can take to prevent it from happening in the future?

Samten, if you look @ your task manager under application, all error-ed tasks appear to be cpu applications as do mine. They appear to time out shortly after they download.
ID: 1284142 · Report as offensive
chromespringer
Avatar

Send message
Joined: 3 Dec 05
Posts: 296
Credit: 55,183,482
RAC: 0
United States
Message 1284153 - Posted: 15 Sep 2012, 17:18:08 UTC - in response to Message 1284142.  

Check out all my timeouts. I've never experienced this before. Can someone explain to me what might be causing this all of the sudden, and what steps I can take to prevent it from happening in the future?

Samten, if you look @ your task manager under application, all error-ed tasks appear to be cpu applications as do mine. They appear to time out shortly after they download.

All are "recent lost task" .. i expect they hung around in limbo too long
ID: 1284153 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1284781 - Posted: 17 Sep 2012, 13:39:05 UTC - in response to Message 1283623.  

if the host asked for GPU and (as far as it knows) didn't get any, why wouldn't it ask for GPU again the next time?

As said before, it asks for CPU tasks, they are lost and then, at least 5 mins before, they asks again for more tasks...
Well, in those 5 mins it may happen that the GPU Cache reach the "request more work" trigger so it asks for CPU and GPU, as the scheduller will try to fill first the GPU cache it will try to assign the losts tasks (originally intended for the CPU) to the GPU...

Ah hah. That was the answer I was looking for (instead of just correcting me on the details). I forgot that on a dual request, it tries to fill GPU first.

There are something else, if you are also attached to other projects, after the request fails to receive the needed CPU tasks, it migh ask for CPU tasks to another project while waiting for the 5 min delay, and if it gets tasks then it may not need CPU tasks anymore.

And I didn't think of that. Although in my case, I have only one other project, Einstein, and all of my computers have been shying away from it lately (one of the three has a couple of tasks due in three days and the others haven't contacted it since they reported their last ones about a week ago).

As to the list including undownloaded tasks, thanks to the guys who answered. It was just an idea.

BTW, I'm up to 53 timeouts now.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1284781 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1285128 - Posted: 18 Sep 2012, 12:20:25 UTC - in response to Message 1284781.  

if the host asked for GPU and (as far as it knows) didn't get any, why wouldn't it ask for GPU again the next time?

As said before, it asks for CPU tasks, they are lost and then, at least 5 mins before, they asks again for more tasks...
Well, in those 5 mins it may happen that the GPU Cache reach the "request more work" trigger so it asks for CPU and GPU, as the scheduller will try to fill first the GPU cache it will try to assign the losts tasks (originally intended for the CPU) to the GPU...

Ah hah. That was the answer I was looking for (instead of just correcting me on the details). I forgot that on a dual request, it tries to fill GPU first.

There are something else, if you are also attached to other projects, after the request fails to receive the needed CPU tasks, it migh ask for CPU tasks to another project while waiting for the 5 min delay, and if it gets tasks then it may not need CPU tasks anymore.

And I didn't think of that. Although in my case, I have only one other project, Einstein, and all of my computers have been shying away from it lately (one of the three has a couple of tasks due in three days and the others haven't contacted it since they reported their last ones about a week ago).

As to the list including undownloaded tasks, thanks to the guys who answered. It was just an idea.

BTW, I'm up to 53 timeouts now.


I now have 81 time outs. I cant even buy a download let alone report right now.
[/quote]

Old James
ID: 1285128 · Report as offensive

Message boards : Number crunching : holy cow! 20 timeouts


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.