Upgrading Cuda App - BOINC Trashing Units

Message boards : Number crunching : Upgrading Cuda App - BOINC Trashing Units
Message board moderation

To post messages, you must log in.

AuthorMessage
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1018765 - Posted: 23 Jul 2010, 17:38:33 UTC
Last modified: 23 Jul 2010, 17:41:23 UTC

Just tried to install my GTX470. As usual with a change of app, BOINC immediately trashed all the units on the machine including those completed units waiting to upload. It does this because it sees that the units have been assigned to a particular version of the app and if it can't find that version, in this case V6.09, it deletes the WU files. It even deleted the GPU units I had moved to the CPU prior to fitting the new card. The app_info file is good, I triple checked that before I started BOINC. Is there any work around for this ??

At least this time I remembered to make a backup of the BOINC and BOINC_DATA folders so that everything was there when I rolled back to the old cards :-S

T.A.
ID: 1018765 · Report as offensive
Profile Tim Norton
Volunteer tester
Avatar

Send message
Joined: 2 Jun 99
Posts: 835
Credit: 33,540,164
RAC: 0
United Kingdom
Message 1018772 - Posted: 23 Jul 2010, 17:47:46 UTC

I have also been caught out with this and lost a few units - fortunately only ones in my cache not completed ones :)

it would be nice if Boinc actually asked you what you wanted to do rather than delete everything but that would mean being user friendly rather than functional

i have not found a work around other than backup your directories but finish all the work you want to do and abort the rest before apply new hardware or an optimised app

sometimes you forget and leave a poor wingman hanging

would be good if you could mark a wu unit as "opps sorry" so it could be sent out again quickly

not a real help but my two penny worth :)
Tim

ID: 1018772 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 1018775 - Posted: 23 Jul 2010, 17:49:41 UTC

I always suspend boinc when updating...
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 1018775 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1018785 - Posted: 23 Jul 2010, 18:00:36 UTC - in response to Message 1018775.  

I always suspend boinc when updating...


I did all the right things, suspended BOINC and Networking but the problem occurs at the initial startup, I think when it's reading the client_state file, it compares what's in the client state file with the app_info. If the app version for a WU listed in the client_state doesn't agree with the version in the app_info BOINC says "bye, bye" to the WU, even it's been completed.

T.A.
ID: 1018785 · Report as offensive
TheFreshPrince a.k.a. BlueTooth76
Avatar

Send message
Joined: 4 Jun 99
Posts: 210
Credit: 10,315,944
RAC: 0
Netherlands
Message 1018789 - Posted: 23 Jul 2010, 18:13:49 UTC
Last modified: 23 Jul 2010, 18:14:25 UTC

Before you make any changes:

- shutdown Boinc
- backup both Boinc folders
- disconnect from the internet
- try if a change works
- if it works; connect to the internet again
- if it doesnt work; delete both Boinc folders, restore the ones from the backup and try again

:)
Rig name: "x6Crunchy"
OS: Win 7 x64
MB: Asus M4N98TD EVO
CPU: AMD X6 1055T 2.8(1,2v)
GPU: 2x Asus GTX560ti
Member of: Dutch Power Cows
ID: 1018789 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1018802 - Posted: 23 Jul 2010, 18:28:19 UTC

Guys, pardon this old non-Fermi character butting in but I think I might see a problem or two.

He is installing a new 470 card in place of "older" (probably non-Fremi) cards. He has an app_info with 6.09 but probably doesn't have an entry for 6.10 Fermi so of course it will trash all his 6.09s. Try reading up on what needs to be done here http://setiathome.berkeley.edu/forum_thread.php?id=60079


PROUD MEMBER OF Team Starfire World BOINC
ID: 1018802 · Report as offensive
Profile Questor Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 3 Sep 04
Posts: 471
Credit: 230,506,401
RAC: 157
United Kingdom
Message 1018807 - Posted: 23 Jul 2010, 18:39:06 UTC - in response to Message 1018802.  
Last modified: 23 Jul 2010, 18:41:36 UTC

Guys, pardon this old non-Fermi character butting in but I think I might see a problem or two.

He is installing a new 470 card in place of "older" (probably non-Fremi) cards. He has an app_info with 6.09 but probably doesn't have an entry for 6.10 Fermi so of course it will trash all his 6.09s. Try reading up on what needs to be done here http://setiathome.berkeley.edu/forum_thread.php?id=60079


As long as the new fermi app is referenced in app_info.xml as 609 then it wont matter - the workunit details will match with existing 609 WUs. OTOH if he has put an entry as 610 for the fermi card in app_info and removed the 609 reference then all 609 WUs will be trashed because no apps match.

Possibly post app_info.xml? - it will be easier to analyze what the problem might be. (an example <workunit> and <result> section for CPU and GPU would help as well).

Also have to watch that platform entries match - especially between CPU and GPU for when rescheduling. You might need a couple of very similar looking entries in app_info to cope with the possble combinations especially during the transisiton between the cards.


John.
GPU Users Group



ID: 1018807 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1018833 - Posted: 23 Jul 2010, 19:37:39 UTC

Update.
With some app_info fiddling I've managed to stop it deleting V6.09 files and it is actually crunching GPU units on all cards.
But there is a bug in it where it doesn't see AK-v8_win_SSSE3x.exe entries and so deletes the CPU files. This bug was in the original app_info file I copied from the Fermi thread. The AK-v8 file is in the SAH directory and as far as I can tell the entries for it are the same as in the original V6.09 app_info which works. The error message is

"State file error missing application AK_v8_win_ssse3x.exe"

The file is posted below. One other bug is that it is trying to run 3 tasks on all cards not just the Fermi. What sort of entry do I need to limit the non-fermi cards to one unit at a time?

<app_info>
<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<file_name>AK_v8_win_SSSE3x.exe</file_name>
<executable/>
</file_info>
<file_info>
<name>setiathome_6.09_windows_intelx86__cuda23.exe</name>
<executable/>
</file_info>
<file_info>
<name>cudart.dll</name>
<executable/>
</file_info>
<file_info>
<name>cufft.dll</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>603</version_num>
<platform>windows_intelx86</platform>
<flops>6051935388.55510675</flops>
<file_ref>
<file_name>AK_v8_win_SSSE3x.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>603</version_num>
<platform>windows_x86</platform>
<file_ref>
<file_name>AK_v8_win_SSSE3x.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<name>libfftw3f-3-1-1a_upx.dll</name>
<executable/>
</file_info>
<file_info>
<name>setiathome_6.10_windows_intelx86__cuda_fermi.exe</name>
<executable/>
</file_info>
<file_info>
<name>cudart32_30_14.dll</name>
<executable/>
</file_info>
<file_info>
<name>cufft32_30_14.dll</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>610</version_num>
<avg_ncpus>0.200000</avg_ncpus>
<max_ncpus>0.200000</max_ncpus>
<flops>57462450464</flops>
<plan_class>cuda_fermi</plan_class>
<file_ref>
<file_name>setiathome_6.10_windows_intelx86__cuda_fermi.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>cudart32_30_14.dll</file_name>
</file_ref>
<file_ref>
<file_name>cufft32_30_14.dll</file_name>
</file_ref>
<file_ref>
<file_name>libfftw3f-3-1-1a_upx.dll</file_name>
</file_ref>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>609</version_num>
<platform>windows_intelx86</platform>
<avg_ncpus>0.05</avg_ncpus>
<max_ncpus>0.05</max_ncpus>
<flops>320000000000</flops>
<plan_class>cuda</plan_class>
<file_ref>
<file_name>setiathome_6.09_windows_intelx86__cuda23.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>cudart.dll</file_name>
</file_ref>
<file_ref>
<file_name>cufft.dll</file_name>
</file_ref>
<file_ref>
<file_name>libfftw3f-3-1-1a_upx.dll</file_name>
</file_ref>
<coproc>
<coproc>
<type>CUDA</type>
<count>0.33</count>
</coproc>
</app_version>
</app_info>

TIA
Brodo
ID: 1018833 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1018840 - Posted: 23 Jul 2010, 20:06:27 UTC - in response to Message 1018833.  

Update.
With some app_info fiddling I've managed to stop it deleting V6.09 files and it is actually crunching GPU units on all cards.
But there is a bug in it where it doesn't see AK-v8_win_SSSE3x.exe entries and so deletes the CPU files. This bug was in the original app_info file I copied from the Fermi thread. The AK-v8 file is in the SAH directory and as far as I can tell the entries for it are the same as in the original V6.09 app_info which works. The error message is

"State file error missing application AK_v8_win_ssse3x.exe"

The file is posted below. One other bug is that it is trying to run 3 tasks on all cards not just the Fermi. What sort of entry do I need to limit the non-fermi cards to one unit at a time?
...
TIA
Brodo

Let's see, there's an extra <app_version> section for CPU, a missing </app_version> tag for the cuda_fermi section which led to the original trashing of cuda23 tasks. The following has stuff needing deletion in red and needed additions in green:

<app_info>
<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<file_name>AK_v8_win_SSSE3x.exe</file_name>
<executable/>
</file_info>
<file_info>
<name>setiathome_6.09_windows_intelx86__cuda23.exe</name>
<executable/>
</file_info>
<file_info>
<name>cudart.dll</name>
<executable/>
</file_info>
<file_info>
<name>cufft.dll</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>603</version_num>
<platform>windows_intelx86</platform>
<flops>6051935388.55510675</flops>
<file_ref>
<file_name>AK_v8_win_SSSE3x.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>603</version_num>
<platform>windows_x86</platform>
<file_ref>
<file_name>AK_v8_win_SSSE3x.exe</file_name>
<main_program/>
</file_ref>
</app_version>

<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<name>libfftw3f-3-1-1a_upx.dll</name>
<executable/>
</file_info>
<file_info>
<name>setiathome_6.10_windows_intelx86__cuda_fermi.exe</name>
<executable/>
</file_info>
<file_info>
<name>cudart32_30_14.dll</name>
<executable/>
</file_info>
<file_info>
<name>cufft32_30_14.dll</name>
<executable/>
</file_info>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>610</version_num>
<avg_ncpus>0.200000</avg_ncpus>
<max_ncpus>0.200000</max_ncpus>
<flops>57462450464</flops>
<plan_class>cuda_fermi</plan_class>
<file_ref>
<file_name>setiathome_6.10_windows_intelx86__cuda_fermi.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>cudart32_30_14.dll</file_name>
</file_ref>
<file_ref>
<file_name>cufft32_30_14.dll</file_name>
</file_ref>
<file_ref>
<file_name>libfftw3f-3-1-1a_upx.dll</file_name>
</file_ref>
</app_version>
<app_version>
<app_name>setiathome_enhanced</app_name>
<version_num>609</version_num>
<platform>windows_intelx86</platform>
<avg_ncpus>0.05</avg_ncpus>
<max_ncpus>0.05</max_ncpus>
<flops>320000000000</flops>
<plan_class>cuda</plan_class>
<file_ref>
<file_name>setiathome_6.09_windows_intelx86__cuda23.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>cudart.dll</file_name>
</file_ref>
<file_ref>
<file_name>cufft.dll</file_name>
</file_ref>
<file_ref>
<file_name>libfftw3f-3-1-1a_upx.dll</file_name>
</file_ref>
<coproc>
<coproc>
<type>CUDA</type>
<count>0.33</count>
</coproc>
</app_version>
</app_info>
                                                                 Joe
ID: 1018840 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1018854 - Posted: 23 Jul 2010, 20:39:58 UTC - in response to Message 1018833.  
Last modified: 23 Jul 2010, 20:46:46 UTC

In addition to Joe's points, I'd update Boinc to 6.10.58, it won't delete files that aren't in the app_info, and should correctly state the Fermi's GFlop's, as well as numerous other fixes and improvements,

I'm not so sure about having both Fermi and Cuda23 apps in the app_info, doubt that's going to work properly,

Claggy
ID: 1018854 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1018860 - Posted: 23 Jul 2010, 21:01:56 UTC

Sorry Joe
Still getting this error
24/07/2010 06:16:55 SETI@home [error] State file error: missing application file AK_v8_win_SSSE3x.exe

And I copied your file twice just to make sure :-(

Any ideas about the...

<type>CUDA</type>
<count>0.33</count>

entry so it doesn't run multiple WU's on the 2xx series GPU's ?

@ Claggy I don't think it's using the V6.09 app on any GPU. I get this in the start up
24/07/2010 06:16:55 NVIDIA GPU 0: GeForce GTX 470 (driver version 25896, CUDA version 3010, compute capability 2.0, 1280MB, 272 GFLOPS peak)
24/07/2010 06:16:55 NVIDIA GPU 1: GeForce GTS 250 (driver version 25896, CUDA version 3010, compute capability 1.1, 512MB, 471 GFLOPS peak)
24/07/2010 06:16:55 NVIDIA GPU 2: GeForce GTS 250 (driver version 25896, CUDA version 3010, compute capability 1.1, 512MB, 442 GFLOPS peak)

But work is suspended atm till I can upload and report the finished units. Then I'll play a bit more.

Regards
T.A,



ID: 1018860 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1018863 - Posted: 23 Jul 2010, 21:07:37 UTC - in response to Message 1018860.  

And I copied your file twice just to make sure :-(

And did you remove the red sections? ;-)

Gruß,
Gundolf
ID: 1018863 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1018864 - Posted: 23 Jul 2010, 21:09:01 UTC - in response to Message 1018863.  

And I copied your file twice just to make sure :-(

And did you remove the red sections? ;-)

Gruß,
Gundolf


Yes, I changed them to purple :-P
ID: 1018864 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1018867 - Posted: 23 Jul 2010, 21:13:12 UTC - in response to Message 1018860.  
Last modified: 23 Jul 2010, 21:15:19 UTC

Sorry Joe
Still getting this error
24/07/2010 06:16:55 SETI@home [error] State file error: missing application file AK_v8_win_SSSE3x.exe

And I copied your file twice just to make sure :-(

Any ideas about the...

<type>CUDA</type>
<count>0.33</count>

entry so it doesn't run multiple WU's on the 2xx series GPU's ?

@ Claggy I don't think it's using the V6.09 app on any GPU. I get this in the start up
24/07/2010 06:16:55 NVIDIA GPU 0: GeForce GTX 470 (driver version 25896, CUDA version 3010, compute capability 2.0, 1280MB, 272 GFLOPS peak)
24/07/2010 06:16:55 NVIDIA GPU 1: GeForce GTS 250 (driver version 25896, CUDA version 3010, compute capability 1.1, 512MB, 471 GFLOPS peak)
24/07/2010 06:16:55 NVIDIA GPU 2: GeForce GTS 250 (driver version 25896, CUDA version 3010, compute capability 1.1, 512MB, 442 GFLOPS peak)

But work is suspended atm till I can upload and report the finished units. Then I'll play a bit more.

Regards
T.A,




You're also missing the following from the <plan_class>cuda_fermi</plan_class> <app_version> section:

<coproc>
<type>CUDA</type>
<count>0.33</count>
</coproc>

Put it straight before the green </app_version>

Claggy
ID: 1018867 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1018870 - Posted: 23 Jul 2010, 21:18:41 UTC - in response to Message 1018867.  

You're also missing the following from the <plan_class>cuda_fermi</plan_class> <app_version> section:

<coproc>
<type>CUDA</type>
<count>0.33</count>
</coproc>

Put it straight before the green </app_version>

Claggy


As an additional entry or move it up from the bottom ??
ID: 1018870 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1018885 - Posted: 23 Jul 2010, 21:44:25 UTC - in response to Message 1018870.  

You're also missing the following from the <plan_class>cuda_fermi</plan_class> <app_version> section:

<coproc>
<type>CUDA</type>
<count>0.33</count>
</coproc>

Put it straight before the green </app_version>

Claggy


As an additional entry or move it up from the bottom ??

An additional entry,

Claggy
ID: 1018885 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1018920 - Posted: 23 Jul 2010, 23:41:59 UTC - in response to Message 1018860.  
Last modified: 24 Jul 2010, 0:01:20 UTC

Sorry Joe
Still getting this error
24/07/2010 06:16:55 SETI@home [error] State file error: missing application file AK_v8_win_SSSE3x.exe
...
Regards
T.A,

Hmm, the 0.2 and later installers have the newer AK_v8b_win_SSSE3x.exe file. Make sure the file is there and double check whether it's the v8 or v8b version, make the app_info.xml match...

Edit: As to running multiple instances on the Fermi and only single instances on the others, I doubt it's possible. Having the <count>0.33</count> only in the cuda_fermi app_version might possibly help, but with the mix of cards I think BOINC will launch any of the cuda tasks on any of the cards. The inability to deal with the details for different card types is why BOINC's default is to use only the most capable card and any which are a close match. The <use_all_gpus> option allows human judgement to override the default, but in this case you'll have to decide whether the increased productivity on the GTX 470 more than offsets the decreased productivity on the other cards.

Additional edit: It looks like the other cards don't have sufficient memory for multiple tasks, so IMO you'll either have to settle for single instances on the GTX 470 or remove the other cards.
                                                                 Joe
ID: 1018920 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1019235 - Posted: 24 Jul 2010, 16:12:31 UTC

Well I gave up trying to get that app_info working. There's a bug there somewhere that nobody was able to spot. I made another by taking my working V6.09 app_info and carefully going through it substituting the V6.10 components for the originals. It worked straight off the bat. I'll post it in the Fermi thread for anyone who's interested

I gave up on the idea of running the the 470 and the 285 in the same box as while the 285 could handle 2 threads at once quite well, it choked on 3, taking around 45 minutes per unit compared to the 470's 20 minutes. With 1 or 2 threads they ran neck and neck.

Reasonably impressed with the 470 now that I've got it working and stable. Running 3 threads at stock speed it takes around 20 to 25 minutes per unit, in otherwords its the equivalent of 3 GTS250's and my machine with 3 GTS250's runs a slightly higher RAC than machines with 1 GTX295.

I found the way to stop units being trashed was to use Reschedule to move all the units to the CPU, then run S@NL-Fred's tool over the lot. After that you can use either Fred's tool or Reschedule 1.9 to move them back to the GPU. Don't ask me why it works, it just does :-)

Thanks to all those who offered help and advice along the way.

The Terror

ID: 1019235 · Report as offensive

Message boards : Number crunching : Upgrading Cuda App - BOINC Trashing Units


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.