Computation Errors ATI GPU


log in

Advanced search

Message boards : Number crunching : Computation Errors ATI GPU

Author Message
awdorrin
Send message
Joined: 27 Sep 99
Posts: 26
Credit: 16,976,792
RAC: 35,565
United States
Message 1357903 - Posted: 17 Apr 2013, 22:13:46 UTC

For the past few days I have been getting a lot of errors on work units:

http://setiathome.berkeley.edu/results.php?hostid=4226847&offset=0&show_names=0&state=6&appid=

http://setiathome.berkeley.edu/result.php?resultid=2934756090

I'm not sure if I have something wrong with the new 7850 GPU I added to my system, or if these are work unit issues.

Also not really sure how to find out - spent some time searching google and this site for ideas and didn't have any luck.

Anyone have and ideas for what my issue might be?

Thanks!


____________

Profile Gatekeeper
Avatar
Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1357906 - Posted: 17 Apr 2013, 22:51:24 UTC - in response to Message 1357903.

For the past few days I have been getting a lot of errors on work units:

http://setiathome.berkeley.edu/results.php?hostid=4226847&offset=0&show_names=0&state=6&appid=

http://setiathome.berkeley.edu/result.php?resultid=2934756090

I'm not sure if I have something wrong with the new 7850 GPU I added to my system, or if these are work unit issues.

Also not really sure how to find out - spent some time searching google and this site for ideas and didn't have any luck.

Anyone have and ideas for what my issue might be?

Thanks!



See here
____________

awdorrin
Send message
Joined: 27 Sep 99
Posts: 26
Credit: 16,976,792
RAC: 35,565
United States
Message 1357922 - Posted: 18 Apr 2013, 1:29:20 UTC - in response to Message 1357906.

I'm still running the 7.0.28 BOINC manager, not 7.0.33

Was this change pushed back somehow?

I looked at my app_info.xml and I do not have any 'flops' entries.

I was wondering if these time outs are a result of the different speeds of the three graphics cards I have in my system (a 7850 and two 5770 cards)?
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,130,078
RAC: 38,303
Argentina
Message 1357930 - Posted: 18 Apr 2013, 2:57:23 UTC - in response to Message 1357922.

Yes, if the GPUs are in the same host and they are too far apart on relative speeds, as BOINC keeps only one average speed estimation for all the GPUs, then that average may be too fast for the slower cards (it shouldnt fail if the GPU goes faster than expected) but then the errors should be only on results from the 5770s...
If this is the case, there is nothing really effective to fix it (other than putting the different GPUs in different hosts), you can try using the flops value to see if it gets stable using a slower value for it than the APR, but on the long time BOINC will start to reduce (on server side) the estimated size of the tasks assigned to that host if it is finishing them faster than expected and then you will be in trouble again...
____________

awdorrin
Send message
Joined: 27 Sep 99
Posts: 26
Credit: 16,976,792
RAC: 35,565
United States
Message 1358300 - Posted: 18 Apr 2013, 22:17:06 UTC

I wouldn't have thought that the 5770s would be that much slower than the 7850 to cause a problem like this. Seems like the BOINC client is throwing out the results only because it thinks they have run too long, without any consideration of the speed of the cards?

From the BOINC event log I saw:

ATI GPU 0: Pitcairn (CAL version 1.4.1741, 2048MB, 2008MB available, 4403 GFLOPS peak)
ATI GPU 1: Juniper (CAL version 1.4.1741, 1024MB, 991MB available, 2752 GFLOPS peak)
ATI GPU 2: Juniper (CAL version 1.4.1741, 1024MB, 991MB available, 2752 GFLOPS peak)

Is there really no way to use the <flops> setting to increase the allowable time the tasks run on the 5770 cards?

I don't have another system that I can move the 5770 cards into, and removing the 5770s from my system would result in losing 5504 GFLOPS of potential Seti crunching.

I have been researching the <flops> setting, but I don't quite understand how you calculate a good value.
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,130,078
RAC: 38,303
Argentina
Message 1358308 - Posted: 18 Apr 2013, 23:11:27 UTC - in response to Message 1358300.

It is considering the speed of the cards, but not for each one independently, BOINC uses only an average of the speed.
If the average goes higher due to one card going much faster than the others then this could happen...
This is a BOINC issue, the whole plattaform assumes from coding that if you have more than one GPU you will use only the faster one, and if you choose to run the others then BOINC assume they are of closer speeds...

For the flops, you can use as starting point the APR shown for each app in the host detail page...
Ive looked there and for the MB tasks it shows an APR of 697 (GFlops) while for the AP it shows 343.

For a start, add only the flops tag for the app that is giving failures
So the starting value for the flops should be less than 697000000000 (or 697e9) for MB and less than 343e9 for the APs...
____________

awdorrin
Send message
Joined: 27 Sep 99
Posts: 26
Credit: 16,976,792
RAC: 35,565
United States
Message 1358697 - Posted: 19 Apr 2013, 22:39:46 UTC - in response to Message 1358308.

Setting the <flops> value to 350,000,000 seems to give me a time limit of 9057.8s
While 3,500,000,000 gave me 905.78s. I was getting timeouts after 2900s or 1800s.

The 9057.8 is probably too high of a setting, but I figured I'd see how it worked for a few days then try to fine tune further.

I figure a setting of 500,000,000 should give me around 6340s, if I'm understanding this correctly.


____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,130,078
RAC: 38,303
Argentina
Message 1358751 - Posted: 20 Apr 2013, 2:06:30 UTC - in response to Message 1358697.

It's not lineal, but is close, so if you double the flops, the estimated time will be around half...

Keep in mind that already received work have a maximun allowed time that was assigned under another speed and for those tasks the flops value wont help, you should see what happens with new tasks received after changing the flops... it may be easier if you set the project to no new tasks and wait until the cache is empty before changing the flops...

And then you will need to keep doing fine tunnings until you find the right value that is not too fast for the slow gpus (so they dont fail) but no so slow for the faster one to avoid very wrong estimations (which speed up the BOINC mechanism that adpats the lenght of the tasks)... Just be patient, dont try to change the flops several times a day, if there is no more timeouts, let it run at least for a week before changing the flops value again, and it will be better if you empty the cache first.


____________

Message boards : Number crunching : Computation Errors ATI GPU

Copyright © 2014 University of California