computation errors


log in

Advanced search

Message boards : Number crunching : computation errors

1 · 2 · Next
Author Message
mdawson
Send message
Joined: 21 May 99
Posts: 37
Credit: 1,632,056
RAC: 0
United States
Message 962978 - Posted: 13 Jan 2010, 12:30:51 UTC

All of a sudden I'm getting lots of computation errors associated with the CUDA files. They run for about 2 secs. and then BAM!, computation error. I've had a dozen or so in the last 24 hrs.

I have 2 GPU's in my system, although I don't know which one is at fault, if there is a fault. The EVGA Precision app says GPU temps are ~64-67 degrees centigrade. That seems to be in the range that others are experiencing. I'm not overclocking at all.

So why is this happening? Any ideas out there?
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8632
Credit: 51,509,827
RAC: 47,977
United Kingdom
Message 962981 - Posted: 13 Jan 2010, 12:53:44 UTC

VLAR tasks in the mix.
You run VLAR killer application.
It kills them.

QED

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 359,338
RAC: 33
Germany
Message 962982 - Posted: 13 Jan 2010, 12:56:34 UTC - in response to Message 962978.

That seems to be a batch of VLARs. I found

VLAR WU (AR: 0.010784 )detected... autokill initialised SETI@home error -6 Bad workunit header
in several of erroneous tasks from Jan 13 (I only checked a few). Time for a rescheduler run? ;-)

Gruß,
Gundolf
____________
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4433
Credit: 118,881,931
RAC: 138,546
United States
Message 962985 - Posted: 13 Jan 2010, 13:12:48 UTC - in response to Message 962978.

All of a sudden I'm getting lots of computation errors associated with the CUDA files. They run for about 2 secs. and then BAM!, computation error. I've had a dozen or so in the last 24 hrs.

I have 2 GPU's in my system, although I don't know which one is at fault, if there is a fault. The EVGA Precision app says GPU temps are ~64-67 degrees centigrade. That seems to be in the range that others are experiencing. I'm not overclocking at all.

So why is this happening? Any ideas out there?


Have you already done the standard steps of restarting BOINC & your computer? I had a computer that just does CPU tasks go wonky & ate up it's whole cache doing computation errors. Nothing looked wrong on it, but after rebooting it was ok again.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile gizbar
Avatar
Send message
Joined: 7 Jan 01
Posts: 586
Credit: 21,087,774
RAC: 0
United Kingdom
Message 963016 - Posted: 13 Jan 2010, 18:36:49 UTC

I get that with one of my systems, but it is the cuda card that goes funny (like mdawson, unlike HAL9000). Quite often happens if I miss rescheduling a Vlar, but sometimes just goes funny. It is in my kids system, a 9800GTX+. My other card (a GTX260) is fine.

I normally reschedule any Vlar tasks, and then have to reboot the system, because even if I reschedule, it will fail or kill all subsequent tasks once one goes bad. I try to make sure I reschedule once a day or so now...

regards, Gizbar.
____________


A proud GPU User Server Donor!

mdawson
Send message
Joined: 21 May 99
Posts: 37
Credit: 1,632,056
RAC: 0
United States
Message 963019 - Posted: 13 Jan 2010, 19:09:21 UTC - in response to Message 963016.

Yes, I've rebooted recently. In fact, I've been rebooting a lot lately as I am still installing software on this new drive. 'Bout every couple of days or so I reboot.

Question: How would one identify a VLAR? The file names have been similar to what's in my cache now. Someone else mentioned rescheduling them. How do you schedule anything? I run SETI cuda on my GPU's, and Einstein on my CPU's. Up until this incident, everything has worked fine.
____________

Profile hiamps
Volunteer tester
Avatar
Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 963028 - Posted: 13 Jan 2010, 19:24:18 UTC

Overnight I downloaded over 100 Vlar so they are out in force, yes do check rescheduler.
____________
Official Abuser of Boinc Buttons...
And no good credit hound!

Profile gizbar
Avatar
Send message
Joined: 7 Jan 01
Posts: 586
Credit: 21,087,774
RAC: 0
United Kingdom
Message 963033 - Posted: 13 Jan 2010, 19:35:45 UTC - in response to Message 963019.

Yes, I've rebooted recently. In fact, I've been rebooting a lot lately as I am still installing software on this new drive. 'Bout every couple of days or so I reboot.

Question: How would one identify a VLAR? The file names have been similar to what's in my cache now. Someone else mentioned rescheduling them. How do you schedule anything? I run SETI cuda on my GPU's, and Einstein on my CPU's. Up until this incident, everything has worked fine.


They are identified by the 'angle-range' of the task (hoping someone will explain it better than I can!). I don't know how you can identify it before it runs. Some of us are using optimised seti apps which kill off any vlar task queued for the GPU. The reschedule app does exactly what it says on the tin. It takes the vlar GPU tasks and re-allocates them to the CPU. The reason for this is that the GPU is very inefficient at processing vlar tasks. Some people just let them be killed and they get allocated to somebody else. Some people will reschedule them so that they crunch all the work they have downloaded. It can be run manually (I do) or it can be set to run automatically, but I found that it sometimes doesn't restart Boinc correctly (asks for a password).

The optimised apps (pre-empting another question, lol!) have had the maths functions refined so that they will operate better on the newer cpu's, but you have to choose the right version for your cpu and maintain and update it if necessary. The stock app is designed to run on all cpu's which may not be the quickest way to process the task but is guaranteed to work on all cpu's.

You can install the optimised version to just run 'cuda' tasks if you want to, and let the 'vlar kill' kill off the vlar tasks, as you are running Einstein on your cpus and thus not available for the seti tasks.

Make any sense?

regards, Gizbar.

____________


A proud GPU User Server Donor!

mdawson
Send message
Joined: 21 May 99
Posts: 37
Credit: 1,632,056
RAC: 0
United States
Message 963040 - Posted: 13 Jan 2010, 19:47:12 UTC - in response to Message 963033.

Yeah, sorta. I do use an optimized app for SETI which is why my RAC is 3x higher than it used to be. I did see a SETI running on a CPU the other day, perhaps that was a VLAR.

So from what you are saying, there is another app that auto kills the VLARs and reschedules them for the CPU, correct?

Another thought just occurred to me. MS released some patches the other day. I installed them, but didn't immediately do a reboot. Perhaps those failed files are a result of that?
____________

Profile gizbar
Avatar
Send message
Joined: 7 Jan 01
Posts: 586
Credit: 21,087,774
RAC: 0
United Kingdom
Message 963047 - Posted: 13 Jan 2010, 20:00:50 UTC - in response to Message 963040.
Last modified: 13 Jan 2010, 20:02:33 UTC

Possibly.

There are 2 optimised apps - One for 32-bit and one for 64-bit. In here you can choose what to run, AP, MB, or MB Cuda. Any combo that you want to. This includes the 'Vlar kill'. It just kills the task from running on the GPU, gives a computation error. and starts another one. If the next one is Vlar, that gets killed too, and so on, until it gets one that it can do. This doesn't send work to the CPU.

The reschedule app is different again. It allows you to juggle your work around. IF you have Vlars queued for the GPU, it can 'rebrand' them so that they are transferred to the CPU. In addition to this, if you are running low on work for the GPU or CPU, you can get it to transfer work from one to the other, but it should always keep the Vlar work for the CPU. As I said before, I have problems running it automatically, so I always run it manually.

The optimised apps and the reschedule app is available from the Lunatics website. Looking over the post from Gundolf Jahn, it looks like you have already found it. There are a lot of Vlars coming through at the moment. I rescheduled 77 today on one machine, and about 40 on my other... [Teach me to read the thread properly before sticking my beak in, lol!]

HTH, Gizbar.
____________


A proud GPU User Server Donor!

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4433
Credit: 118,881,931
RAC: 138,546
United States
Message 963049 - Posted: 13 Jan 2010, 20:02:41 UTC - in response to Message 963040.

Yeah, sorta. I do use an optimized app for SETI which is why my RAC is 3x higher than it used to be. I did see a SETI running on a CPU the other day, perhaps that was a VLAR.

So from what you are saying, there is another app that auto kills the VLARs and reschedules them for the CPU, correct?

Another thought just occurred to me. MS released some patches the other day. I installed them, but didn't immediately do a reboot. Perhaps those failed files are a result of that?


With Microsoft it could be. One of the automatic updates borked the antivirus on our labs pirmary DHCP server. I went though uninstalling each update until I found which one it was & I make sure to do manual updates from now on.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 359,338
RAC: 33
Germany
Message 963051 - Posted: 13 Jan 2010, 20:05:14 UTC - in response to Message 963040.

Yeah, sorta. I do use an optimized app for SETI which is why my RAC is 3x higher than it used to be. I did see a SETI running on a CPU the other day, perhaps that was a VLAR.

No, that was a task originally assigned to the CPU.

So from what you are saying, there is another app that auto kills the VLARs and reschedules them for the CPU, correct?

Yes, it's the one you are running! The errors you see are killed VLARs

Another thought just occurred to me. MS released some patches the other day. I installed them, but didn't immediately do a reboot. Perhaps those failed files are a result of that?

I'm not sure, but I don't think so.

Gruß,
Gundolf

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,923,290
RAC: 11,871
United States
Message 963060 - Posted: 13 Jan 2010, 20:33:23 UTC - in response to Message 963051.

So from what you are saying, there is another app that auto kills the VLARs and reschedules them for the CPU, correct?


Yes, it's the one you are running! The errors you see are killed VLARs


Not quite, the app you're running kills the VLARs but it returns them to the SETI server to be sent back out to someone else. It is the rescheduler that rather than kill them, sends them to your CPU to work on. The errors you see marked as -6 is the VLARKill at work, they are sent back to SETI.

I'm running the rescheduler every two hours automagically but I also have the VLARKill laying in wait to ambush any that accidentally get through. (this happened right after the last power outage. I had run out of CUDA work and when we came back up a bunch were downloaded and a VLAR got started before I could run the rescheduler.)
____________


PROUD MEMBER OF Team Starfire World BOINC

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 359,338
RAC: 33
Germany
Message 963065 - Posted: 13 Jan 2010, 20:42:01 UTC - in response to Message 963060.

So from what you are saying, there is another app that auto kills the VLARs and reschedules them for the CPU, correct?


Yes, it's the one you are running! The errors you see are killed VLARs


Not quite, the app you're running kills the VLARs but it returns them to the SETI server to be sent back out to someone else. It is the rescheduler that rather than kill them, sends them to your CPU to work on. The errors you see marked as -6 is the VLARKill at work, they are sent back to SETI.

That's quite the same as I wrote! On the machine where VLARkill runs, the tasks are killed, nothing else.

Gruß,
Gundolf

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,923,290
RAC: 11,871
United States
Message 963077 - Posted: 13 Jan 2010, 21:04:33 UTC - in response to Message 963065.

Gundolf,

I was just trying to say it a little differently. With the rescheduler they get changed from 6.08 to 6.03 without being run on the GPU. With the VLARKill they first have to start on the GPU and then are stopped with a -6 error and sent back to SETI to be reissued to someone else. I read his comment as he was thinking there was an app that killed them on the GPU then rescheduled them to his CPU.

Hope that makes it a bit clearer, I'm starting to confuse myself! :-)
____________


PROUD MEMBER OF Team Starfire World BOINC

Mox
Send message
Joined: 8 Apr 09
Posts: 31
Credit: 372,650
RAC: 0
Message 963833 - Posted: 16 Jan 2010, 16:40:21 UTC - in response to Message 963077.

who is VLAR and why does he kills tasks!?

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,923,290
RAC: 11,871
United States
Message 963849 - Posted: 16 Jan 2010, 17:16:38 UTC - in response to Message 963833.

VLAR is Very Low Angle Range referring to the angle of the telescope in relation to the earth. For some reason these WUs are very hard on the GPUs. The Lunatics crew came up with a nifty little app that detects these VLAR WUs as soon as they start on a GPU, stops them and reports them back to the server as -6 errors. This is great if they get reassigned to someone on their CPU but runs the risk of the work not getting done if it gets sent out to too many people on their GPUs.

Raistmer came up with another idea to detect VLARs before they start on the GPU and reschedule them to run on the CPU. This way the work gets done by the same person. Though the rescheduler has been improved by the guys at Lunatics there are still a couple of small problems so that you sort of have to keep an eye on it. You can set it to run automagically but sometimes it has a problem with forgetting the password after it shuts down to do the reschedule. When this happens you have to manually restart the client. As I understand it they are working on a new one that should fix these little annoyances.
____________


PROUD MEMBER OF Team Starfire World BOINC

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8632
Credit: 51,509,827
RAC: 47,977
United Kingdom
Message 963852 - Posted: 16 Jan 2010, 17:34:16 UTC - in response to Message 963849.

VLAR is Very Low Angle Range referring to the angle of the telescope in relation to the earth...

Strictly speaking, it isn't anything to do with where the telescope is pointing in relation to the earth: it's the change in the aiming point of the telescope over the 107 seconds of radio recording contained within each task.

A high Angle Range means that the telescope was sweeping (relatively) quickly across the sky: a VLAR task means that the telescope was pointing steadily at a single point in the sky, with the movement of the telescope compensating for the turning of the earth.

Mox
Send message
Joined: 8 Apr 09
Posts: 31
Credit: 372,650
RAC: 0
Message 963918 - Posted: 16 Jan 2010, 20:46:12 UTC

so it is better to use VLARautoKill application instead of nonAutoKill?

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,923,290
RAC: 11,871
United States
Message 963927 - Posted: 16 Jan 2010, 21:19:46 UTC - in response to Message 963918.

Since the VLARS are hard on the GPUs it is best to either kill them off or use the rescheduler to move them to your CPU. I have both because it is possible for a VLAR to sneak past the rescheduler. During one of the outages I ran out of GPU WUs and when the server came back up it sent me a bunch of VLARs. One started on my GPU but the killer caught it. I got the rest moved over with the rescheduler.
____________


PROUD MEMBER OF Team Starfire World BOINC

1 · 2 · Next

Message boards : Number crunching : computation errors

Copyright © 2014 University of California