To Whomever this concerns: Backoffs and error -6

Message boards : Number crunching : To Whomever this concerns: Backoffs and error -6
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8964
Credit: 12,678,685
RAC: 0
United States
Message 887002 - Posted: 21 Apr 2009, 21:33:50 UTC

SJ,

I have sent your concerns directly to Eric and Matt for a response from them on this issue.


ID: 887002 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 887004 - Posted: 21 Apr 2009, 21:34:53 UTC - in response to Message 887000.  

Sorry, but if you choose to use an unofficial App that kills VLAR's with an error condition, then, as far as I am concerned, that is your choice. Don't blame the project for the consequences.

F.
ID: 887004 · Report as offensive
Andy Williams
Volunteer tester
Avatar

Send message
Joined: 11 May 01
Posts: 187
Credit: 112,464,820
RAC: 0
United States
Message 887007 - Posted: 21 Apr 2009, 21:38:40 UTC - in response to Message 887000.  

I had one WU on the gpu here take over 5 hours.


I just aborted three. One was over 11 1/2 hours, two were over ten hours. Nowhere near done.
--
Classic 82353 WU / 400979 h
ID: 887007 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 887010 - Posted: 21 Apr 2009, 21:42:30 UTC
Last modified: 21 Apr 2009, 21:43:30 UTC

First, you are using a development version of BOINC. second running seti and making donations are voluntary. I guess thats 2 and 3. 4th the way it works is the way it works. Seti apps dont look at teh angle range before assigning them to a cpu or GPU. It just assigns them. If you have hanging WU's on your GPU you should try raistmers v.11 app. This app is supposed to eliminate the hanging GPU WU's. As I recall BOINC 6.6.20 already has the VLAR killer in it. I'm not sure how anyone would work around the -6 error but VLARs are still a problm for a few folks


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 887010 · Report as offensive
Andy Williams
Volunteer tester
Avatar

Send message
Joined: 11 May 01
Posts: 187
Credit: 112,464,820
RAC: 0
United States
Message 887014 - Posted: 21 Apr 2009, 21:51:45 UTC - in response to Message 887010.  
Last modified: 21 Apr 2009, 21:52:01 UTC

First, you are using a development version of BOINC.


I'm not sure who this is addressed to, but I personally am running a development version on one non-CUDA machine. 6.6.20 on all others. If the VLAR killer is in 6.6.20 (first time I've heard that), it clearly is buggy. I am already using Raistmer's apps on all 6.6.20 machines.
--
Classic 82353 WU / 400979 h
ID: 887014 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 887015 - Posted: 21 Apr 2009, 21:55:57 UTC - in response to Message 887010.  

As I recall BOINC 6.6.20 already has the VLAR killer in it. I'm not sure how anyone would work around the -6 error but VLARs are still a problm for a few folks

Nope. VLAR killer is only in Raistmer's apps. 6.6.20 will feed you VLAR's if they are next in the queue when you ask.

F.
ID: 887015 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 887020 - Posted: 21 Apr 2009, 22:04:58 UTC - in response to Message 887016.  


Actually It's in the apps Raistmer makes(but not all), It's not in any version of Boinc.

Don't bother Fred W, Yer blocked.

He does have a point. The application is intentionally throwing errors, and what BOINC does on an error was known beforehand.

The effort to "fix" this could/should go into the application itself to find out why the CUDA card does these slowly and address it there, not by returning work because it doesn't pay well enough.

I understand Nvidia did the port, and they'd likely be the best ones to fix it.
ID: 887020 · Report as offensive
Andy Williams
Volunteer tester
Avatar

Send message
Joined: 11 May 01
Posts: 187
Credit: 112,464,820
RAC: 0
United States
Message 887024 - Posted: 21 Apr 2009, 22:10:02 UTC - in response to Message 887020.  
Last modified: 21 Apr 2009, 22:11:33 UTC

The effort to "fix" this could/should go into the application itself to find out why the CUDA card does these slowly and address it there, not by returning work because it doesn't pay well enough.


The point is not the payment or credits. It's about using the proper tool for the job. Wasting 24 hours, or worse hanging the video card completely, on a VLAR is an enormous waste of computational resources.
--
Classic 82353 WU / 400979 h
ID: 887024 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8964
Credit: 12,678,685
RAC: 0
United States
Message 887040 - Posted: 21 Apr 2009, 22:39:56 UTC - in response to Message 887032.  

Backoffs as in Communication Deferred on the Projects Tab in Boinc and It can be as much as 24 hours, You want to punish somebody? What's next Banishment for using the VLAR killer? Until the project can come up with a better idea that works, Stop punishing the users for this.


SJ--please relax. Your concern has been escalated to the Admins and I am sure they will address it when they can.


ID: 887040 · Report as offensive
Moabiter

Send message
Joined: 9 Dec 02
Posts: 79
Credit: 215,029
RAC: 0
Germany
Message 887042 - Posted: 21 Apr 2009, 22:44:43 UTC

Hi there!
I'm not sure of how many VLARS are around, but this problem will be fade away, when they are all crunched?
How long will it take to deal with 2 month (Dec/Jan) of problem WUs?

In my opinion, yes, it is a waste of resources if the WU could be done 5 times faster.
Non-Optimized SSEx Astropulse Apps are 10 times slower, also a waste of resources (for todays CPUs).

I think, if the coder are working on future apps, they should not waste time to 'optimize' 'old' apps.
You need a new mainboard if you want more CPU-power (mostly).

To contribute to Seti as a small cruncher more effectively would be a big motivation boost.

Btw. OT: Can someone point me to the Raimster + .xml and any needed .dlls to run optimized AP + S@Home Enhanched but without killing the VLARs, please?
Browsing the forums I only come across VLAR-killers.

Thanks =)
During browsing berkeley's homepage I came along this statement that makes me say:
"ME!, ME!, ME!, wants to know, what patch was used and is it available for windows?"
ID: 887042 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 887044 - Posted: 21 Apr 2009, 22:47:05 UTC - in response to Message 887020.  

He does have a point. The application is intentionally throwing errors, and what BOINC does on an error was known beforehand.

The effort to "fix" this could/should go into the application itself to find out why the CUDA card does these slowly and address it there, not by returning work because it doesn't pay well enough.

I understand Nvidia did the port, and they'd likely be the best ones to fix it.

I agree with you and Fred. The VLAR kill is merely a workround - and you might say a clumsy workround - for a problem which is proving surprisingly hard to fix. If you use a workround, you should be aware of the consequences, and be prepared to live with them.

As it happens, I had an email today from the nVidia developer who did a major part of the port. He said:

I am aware of the VLAR sluggishness and have spent some time trying different way to resolve it given the current pulse detection algorithm with not much luck. The bottom line is that when using CUDA on the same GPU that also used as the graphical display, it is difficult to prioritize one type of GPU client over another on the current HW architecture. SETI@home is not alone, this issue can affect any CUDA application that demands high amounts of time from CUDA.

And BTW, BOINC does not impose a communications backoff as a result of a single reported error. If you report 100 in a row, with no valid work at all, then quota limits will kick in to protect the project servers (which heaven knows are in a fragile enough state). But if you have valid work to return, backoff can always be overcome.
ID: 887044 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 887055 - Posted: 21 Apr 2009, 23:05:10 UTC

The other point that occurs to me is "Has SJ actually run out of work because of the 24 hour back-off?" (He obviously can't answer this himself since I am blacklisted). As Richard says, if the VLAR killer returns 100+ consecutive VLARs then the back-off will be 24 hours but only until a non-error is returned and there should be plenty of those well within that 24 hour period so all it needs is a click on the "Update" button to report them isn't it?

F.
ID: 887055 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 887056 - Posted: 21 Apr 2009, 23:08:06 UTC - in response to Message 887022.  


Actually It's in the apps Raistmer makes(but not all), It's not in any version of Boinc.

Don't bother Fred W, Yer blocked.

He does have a point. The application is intentionally throwing errors, and what BOINC does on an error was known beforehand.

The effort to "fix" this could/should go into the application itself to find out why the CUDA card does these slowly and address it there, not by returning work because it doesn't pay well enough.

I understand Nvidia did the port, and they'd likely be the best ones to fix it.

pay, smay, It just takes too damn long, Sure It should be done in the app, So far no one knows how to fix this I've read, Not a clue, So reassign the dumb things and allow the WU to be aborted by the app in question, Seti has already had a lot of users depart for various reasons, Do they need more to depart forever? Before something positive on the subject is done?

I've got an AP work unit which has 266 hours on it so far. I think it'll finish sometime tomorrow.

It's not a problem, it is the perception of a problem.

My question is: is there a problem, or are people leaving because of the constant complaining?
ID: 887056 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 887059 - Posted: 21 Apr 2009, 23:12:49 UTC - in response to Message 887044.  


As it happens, I had an email today from the nVidia developer who did a major part of the port. He said:

I am aware of the VLAR sluggishness and have spent some time trying different way to resolve it given the current pulse detection algorithm with not much luck. The bottom line is that when using CUDA on the same GPU that also used as the graphical display, it is difficult to prioritize one type of GPU client over another on the current HW architecture. SETI@home is not alone, this issue can affect any CUDA application that demands high amounts of time from CUDA.

That adds an interesting question: is there a way to reduce the "graphics" demand on the card if the owner is mostly interested in crunching.

Setting the display to the minimum resolution and the minimum color depth might make a difference -- and would be an interesting experiment.

I wonder if some of the same kind optimizations (rearranging execution order) that help the CPU apps would help the GPU apps.

ID: 887059 · Report as offensive
Moabiter

Send message
Joined: 9 Dec 02
Posts: 79
Credit: 215,029
RAC: 0
Germany
Message 887062 - Posted: 21 Apr 2009, 23:18:38 UTC - in response to Message 887059.  


As it happens, I had an email today from the nVidia developer who did a major part of the port. He said:

I am aware of the VLAR sluggishness and have spent some time trying different way to resolve it given the current pulse detection algorithm with not much luck. The bottom line is that when using CUDA on the same GPU that also used as the graphical display, it is difficult to prioritize one type of GPU client over another on the current HW architecture. SETI@home is not alone, this issue can affect any CUDA application that demands high amounts of time from CUDA.

That adds an interesting question: is there a way to reduce the "graphics" demand on the card if the owner is mostly interested in crunching.

Setting the display to the minimum resolution and the minimum color depth might make a difference -- and would be an interesting experiment.

I wonder if some of the same kind optimizations (rearranging execution order) that help the CPU apps would help the GPU apps.

Reminds me of hybrid sli.
ID: 887062 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 887065 - Posted: 21 Apr 2009, 23:24:16 UTC - in response to Message 887058.  


Actually It's in the apps Raistmer makes(but not all), It's not in any version of Boinc.

Don't bother Fred W, Yer blocked.

He does have a point. The application is intentionally throwing errors, and what BOINC does on an error was known beforehand.

The effort to "fix" this could/should go into the application itself to find out why the CUDA card does these slowly and address it there, not by returning work because it doesn't pay well enough.

I understand Nvidia did the port, and they'd likely be the best ones to fix it.

pay, smay, It just takes too damn long, Sure It should be done in the app, So far no one knows how to fix this I've read, Not a clue, So reassign the dumb things and allow the WU to be aborted by the app in question, Seti has already had a lot of users depart for various reasons, Do they need more to depart forever? Before something positive on the subject is done?

I've got an AP work unit which has 266 hours on it so far. I think it'll finish sometime tomorrow.

It's not a problem, it is the perception of a problem.

My question is: is there a problem, or are people leaving because of the constant complaining?

I don't do AP, AP is and has been set for a good while to OFF.

Yer tryin to do an Apples to Oranges comparison, AP is not true Seti, Only Seti is Seti.

No, I'm comparing minutes to minutes.

As far as I can tell, your complaint is that some CUDA takes longer than others. If you don't care about credit, then it's just runtime.
ID: 887065 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 887067 - Posted: 21 Apr 2009, 23:26:02 UTC - in response to Message 887056.  

My question is: is there a problem, or are people leaving because of the constant complaining?

I doubt people are leaving just because of the complaining.

I would have sympathy with people who leave because they are confused by the plethora of threads, acronyms, in-jokes, technical jargon and conflicting advice - such as skildude's absurd suggestion earlier in this thread that "BOINC 6.6.20 already has the VLAR killer in it". No it doesn't, and never will do: BOINC versions don't have project-specific workrounds in them.

I would also have sympathy with a new user who leaves after receiving the stock AP application and an AP task as their first experience of SETI. Without calm, clear advice that the initial estimate is overstated by roughly 2.5 times - AP on a Core2 class processor is deliberated designed for a target TDCF of 0.4 [there, I'm doing the technical jargon already] - and that the new user should at least watch the 'To completion' estimate for an hour or so before clicking the 'abort' button, my reaction might be the same.

My understanding is that the next version of Astropulse - currently in testing at Beta - will be deployed here in a way which ensures that AP will not be issued as the first task to a newly-joined host or user. That should help.
ID: 887067 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 887077 - Posted: 21 Apr 2009, 23:40:32 UTC - in response to Message 887067.  

My question is: is there a problem, or are people leaving because of the constant complaining?

I doubt people are leaving just because of the complaining.

I would have sympathy with people who leave because they are confused by the plethora of threads, acronyms, in-jokes, technical jargon and conflicting advice - such as skildude's absurd suggestion earlier in this thread that "BOINC 6.6.20 already has the VLAR killer in it". No it doesn't, and never will do: BOINC versions don't have project-specific workrounds in them.

I would also have sympathy with a new user who leaves after receiving the stock AP application and an AP task as their first experience of SETI. Without calm, clear advice that the initial estimate is overstated by roughly 2.5 times - AP on a Core2 class processor is deliberated designed for a target TDCF of 0.4 [there, I'm doing the technical jargon already] - and that the new user should at least watch the 'To completion' estimate for an hour or so before clicking the 'abort' button, my reaction might be the same.

My understanding is that the next version of Astropulse - currently in testing at Beta - will be deployed here in a way which ensures that AP will not be issued as the first task to a newly-joined host or user. That should help.

The part that bothers me most is the fact that people seem to have such wild expectations.

The project discloses a number of things right off the bat, like the fact that they will be out of work at times. We then see people complaining loudly that the project is out of work.

They build machines that crunch incredibly fast (and spend real money on them) and then complain that the project can't keep their cache full.

I understand the complaints, but the project is not to blame for the project doing what it said it would do.

I think it's all about expectations. If people see what they expect, they'll stay. If people come in with unreasonable expectations, they'll be disappointed and they'll leave.

I crunch beta (AP & MB) using their stock apps. AP 5.05 is a lot faster.
ID: 887077 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 887078 - Posted: 21 Apr 2009, 23:40:33 UTC - in response to Message 887059.  

As it happens, I had an email today from the nVidia developer who did a major part of the port. He said:

I am aware of the VLAR sluggishness and have spent some time trying different way to resolve it given the current pulse detection algorithm with not much luck. The bottom line is that when using CUDA on the same GPU that also used as the graphical display, it is difficult to prioritize one type of GPU client over another on the current HW architecture. SETI@home is not alone, this issue can affect any CUDA application that demands high amounts of time from CUDA.

That adds an interesting question: is there a way to reduce the "graphics" demand on the card if the owner is mostly interested in crunching.

Setting the display to the minimum resolution and the minimum color depth might make a difference -- and would be an interesting experiment.

I wonder if some of the same kind optimizations (rearranging execution order) that help the CPU apps would help the GPU apps.

There are two problems with the VLAR tasks.

1) They take absolutely ages

2) If they are running on the GUI GPU, they cause sluggishness and lagging on the display and other applications using that display.

SJ's problem is with (1). I have every sympathy. I don't do VLAR on CUDA either - it's not worth it. But I use a more sophisticated workround - one developed, as it happens, by Fred W, who is too modest by half. SJ, you should learn who your friends are, and reconsider your decision not to listen to Fred W. You might learn something to your advantage.

Problem (1) is the hard one, and nobody seems to have a solution. Jason Gee is working on it - he found a reference today which may help: "WooHoo, Found it in Bailey's paper. "FFTs in External or Hierarchical Memory", David H. Bailey December 30, 1989. Ref: Journal of Supercomputing, vol. 4, no. 1 (March 1990), p. 23{35". But don't hold your breath.

The nVidia developer's concern - today - was with problem (2), the screen lag. He had a useful suggestion - make the new "don't use GPU while computer is in use" option in BOINC v6.6.20 apply only to GPU cards driving the user interface display - which I have forwarded to Eric and David. Watch this space, as they say.
ID: 887078 · Report as offensive
Moabiter

Send message
Joined: 9 Dec 02
Posts: 79
Credit: 215,029
RAC: 0
Germany
Message 887080 - Posted: 21 Apr 2009, 23:42:39 UTC - in response to Message 887066.  

From what I've read Hybrid SLI has been discontinued by Nvidia.

Realy? bummer.. was planing to get one when it's more developed.
Anyways, an additional $20 non CUDA-GPU might help?
Sadly, if someone has a system setup like that without having performance problems won't post here to verify my theory, right? =)
But buying another card would not be the best workaround.

And I don't know why people leave. What I know is that people seem to join when the movie "Contact" plays on tv. ^^ (maybe this is the case only in Germany)
During browsing berkeley's homepage I came along this statement that makes me say:
"ME!, ME!, ME!, wants to know, what patch was used and is it available for windows?"
ID: 887080 · Report as offensive
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Number crunching : To Whomever this concerns: Backoffs and error -6


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.