Message boards :
Number crunching :
Computation Errors ATI GPU
Message board moderation
Author | Message |
---|---|
awdorrin Send message Joined: 27 Sep 99 Posts: 71 Credit: 106,424,089 RAC: 261 |
For the past few days I have been getting a lot of errors on work units: http://setiathome.berkeley.edu/results.php?hostid=4226847&offset=0&show_names=0&state=6&appid= http://setiathome.berkeley.edu/result.php?resultid=2934756090 I'm not sure if I have something wrong with the new 7850 GPU I added to my system, or if these are work unit issues. Also not really sure how to find out - spent some time searching google and this site for ideas and didn't have any luck. Anyone have and ideas for what my issue might be? Thanks! |
Gatekeeper Send message Joined: 14 Jul 04 Posts: 887 Credit: 176,479,616 RAC: 0 |
For the past few days I have been getting a lot of errors on work units: See here |
awdorrin Send message Joined: 27 Sep 99 Posts: 71 Credit: 106,424,089 RAC: 261 |
I'm still running the 7.0.28 BOINC manager, not 7.0.33 Was this change pushed back somehow? I looked at my app_info.xml and I do not have any 'flops' entries. I was wondering if these time outs are a result of the different speeds of the three graphics cards I have in my system (a 7850 and two 5770 cards)? |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Yes, if the GPUs are in the same host and they are too far apart on relative speeds, as BOINC keeps only one average speed estimation for all the GPUs, then that average may be too fast for the slower cards (it shouldnt fail if the GPU goes faster than expected) but then the errors should be only on results from the 5770s... If this is the case, there is nothing really effective to fix it (other than putting the different GPUs in different hosts), you can try using the flops value to see if it gets stable using a slower value for it than the APR, but on the long time BOINC will start to reduce (on server side) the estimated size of the tasks assigned to that host if it is finishing them faster than expected and then you will be in trouble again... |
awdorrin Send message Joined: 27 Sep 99 Posts: 71 Credit: 106,424,089 RAC: 261 |
I wouldn't have thought that the 5770s would be that much slower than the 7850 to cause a problem like this. Seems like the BOINC client is throwing out the results only because it thinks they have run too long, without any consideration of the speed of the cards? From the BOINC event log I saw: ATI GPU 0: Pitcairn (CAL version 1.4.1741, 2048MB, 2008MB available, 4403 GFLOPS peak) ATI GPU 1: Juniper (CAL version 1.4.1741, 1024MB, 991MB available, 2752 GFLOPS peak) ATI GPU 2: Juniper (CAL version 1.4.1741, 1024MB, 991MB available, 2752 GFLOPS peak) Is there really no way to use the <flops> setting to increase the allowable time the tasks run on the 5770 cards? I don't have another system that I can move the 5770 cards into, and removing the 5770s from my system would result in losing 5504 GFLOPS of potential Seti crunching. I have been researching the <flops> setting, but I don't quite understand how you calculate a good value. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
It is considering the speed of the cards, but not for each one independently, BOINC uses only an average of the speed. If the average goes higher due to one card going much faster than the others then this could happen... This is a BOINC issue, the whole plattaform assumes from coding that if you have more than one GPU you will use only the faster one, and if you choose to run the others then BOINC assume they are of closer speeds... For the flops, you can use as starting point the APR shown for each app in the host detail page... Ive looked there and for the MB tasks it shows an APR of 697 (GFlops) while for the AP it shows 343. For a start, add only the flops tag for the app that is giving failures So the starting value for the flops should be less than 697000000000 (or 697e9) for MB and less than 343e9 for the APs... |
awdorrin Send message Joined: 27 Sep 99 Posts: 71 Credit: 106,424,089 RAC: 261 |
Setting the <flops> value to 350,000,000 seems to give me a time limit of 9057.8s While 3,500,000,000 gave me 905.78s. I was getting timeouts after 2900s or 1800s. The 9057.8 is probably too high of a setting, but I figured I'd see how it worked for a few days then try to fine tune further. I figure a setting of 500,000,000 should give me around 6340s, if I'm understanding this correctly. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
It's not lineal, but is close, so if you double the flops, the estimated time will be around half... Keep in mind that already received work have a maximun allowed time that was assigned under another speed and for those tasks the flops value wont help, you should see what happens with new tasks received after changing the flops... it may be easier if you set the project to no new tasks and wait until the cache is empty before changing the flops... And then you will need to keep doing fine tunnings until you find the right value that is not too fast for the slow gpus (so they dont fail) but no so slow for the faster one to avoid very wrong estimations (which speed up the BOINC mechanism that adpats the lenght of the tasks)... Just be patient, dont try to change the flops several times a day, if there is no more timeouts, let it run at least for a week before changing the flops value again, and it will be better if you empty the cache first. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.