v7 cuda23 WUs getting ERR_TOO_MANY_EXITS


log in

Advanced search

Message boards : Number crunching : v7 cuda23 WUs getting ERR_TOO_MANY_EXITS

1 · 2 · 3 · 4 · Next
Author Message
Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 259
Credit: 30,651,002
RAC: 78,978
United States
Message 1379184 - Posted: 10 Jun 2013, 2:50:00 UTC

Yesterday, the scheduler sent 7 v7 cuda23 WUs to one of my machines (http://setiathome.berkeley.edu/results.php?hostid=6949656) that's been happily running cuda32, cuda42, and cuda50. All 7 promptly crapped out with "-226 (0xffffffffffffff1e) ERR_TOO_MANY_EXITS" as soon as they tried to run. The GPU is an NVIDIA 8600 GT with the latest driver (314.22) and had no problem with cuda23 under v6, having successfully processed over 1,100 of them in the 2 months prior to the v7 rollout. Following that group of 7, only cuda42 and cuda32 have downloaded and my card is happily chewing on those again.

Is there anything I should be doing at my end to ensure this doesn't happen again, or were these cuda23s just a bad lot, or perhaps a scheduler screw-up?

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1580
Credit: 9,460,369
RAC: 7,261
Message 1379303 - Posted: 10 Jun 2013, 11:48:09 UTC

nothing printed to stderr - app didn't even start, looks like some initialisation error.
boinc obviously tried to get the task to run a few times and then gave up.
the app itself should be ok, maybe your dll's are corrupted (wrong size).

you could check the size against the copies for v6 (named plain cudart and cufft) - if it's wrong a project reset should download fresh copies (or just just copy and rename).

Besides occasional hiccups can happen, only worry if it's continuous.
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 259
Credit: 30,651,002
RAC: 78,978
United States
Message 1379425 - Posted: 10 Jun 2013, 16:28:27 UTC - in response to Message 1379303.

Thanks, William. I looked at the various cudart and cufft dll's and, although none are specifically identified as v6 vs. v7, the file creation dates seem to tell me that the cudart.dll (188 KB), cudart_23_win32.dll (279 KB), cufft.dll (380 KB) and cufft_23_win32.dll (8,435 KB) all came along with v6, while the cudart23_00_00x.dll (113 KB) and cufft23_00_00x.dll (1,138 KB) came with v7.

Clearly the file sizes for the apparent v7 dll's are very different from the ones I have for v6, but so are the file names. Are you saying that they should be the same? If so, why would there even be new dll's with new names if they're not v6/v7 specific? I don't think I should rush into a copy/rename without understanding more about whether there's actually anything wrong here or not.

By the way, since none of my other machines have downloaded any v7 cuda23 tasks, I can't do any sort of cross-machine file comparisons.

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1580
Credit: 9,460,369
RAC: 7,261
Message 1379437 - Posted: 10 Jun 2013, 16:51:50 UTC - in response to Message 1379425.

Thanks, William. I looked at the various cudart and cufft dll's and, although none are specifically identified as v6 vs. v7, the file creation dates seem to tell me that the cudart.dll (188 KB), cudart_23_win32.dll (279 KB), cufft.dll (380 KB) and cufft_23_win32.dll (8,435 KB) all came along with v6, while the cudart23_00_00x.dll (113 KB) and cufft23_00_00x.dll (1,138 KB) came with v7.

Clearly the file sizes for the apparent v7 dll's are very different from the ones I have for v6, but so are the file names. Are you saying that they should be the same? If so, why would there even be new dll's with new names if they're not v6/v7 specific? I don't think I should rush into a copy/rename without understanding more about whether there's actually anything wrong here or not.

By the way, since none of my other machines have downloaded any v7 cuda23 tasks, I can't do any sort of cross-machine file comparisons.


ah right cudart23_00_00x.dll should be same as (same size) of cudart_23_win32.dll and cufft23_00_00x.dll should be the size of cufft_23_win32.dll .

your v7 copies are both too small - you can either reset and hope you get the right files this time or you copy and rename the above (correct) v6 versions to their v7 counterparts.

We've had reports of wrong file sizes before.
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 259
Credit: 30,651,002
RAC: 78,978
United States
Message 1379464 - Posted: 10 Jun 2013, 17:34:15 UTC - in response to Message 1379437.

Hmmmm. Since I don't currently have any more cuda23 tasks in the queue, it's not urgent and the nuclear option (project Reset) would seem to be a bit of overkill. I can probably just wait to see if any more cuda23 tasks download before I try the copy/rename approach.

However, just out of curiosity, I decided to try downloading the 2 v7 dll's directly (from boinc2.ssl.berkeley.edu/sah/download_fanout/) on my daily driver, and got the exact same shorter cudart (113 KB) and cufft (1,138 KB) files that were downloaded automatically with the cuda23 tasks. That would seem to indicate a server problem that might need to be looked into.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8436
Credit: 47,953,571
RAC: 59,629
United Kingdom
Message 1379486 - Posted: 10 Jun 2013, 18:01:22 UTC - in response to Message 1379464.

Hmmmm. Since I don't currently have any more cuda23 tasks in the queue, it's not urgent and the nuclear option (project Reset) would seem to be a bit of overkill. I can probably just wait to see if any more cuda23 tasks download before I try the copy/rename approach.

However, just out of curiosity, I decided to try downloading the 2 v7 dll's directly (from boinc2.ssl.berkeley.edu/sah/download_fanout/) on my daily driver, and got the exact same shorter cudart (113 KB) and cufft (1,138 KB) files that were downloaded automatically with the cuda23 tasks. That would seem to indicate a server problem that might need to be looked into.

It does indeed. The cuda23 versions of those two files should be at a minimum (*):

cudart.dll: 285,696 bytes
cufft.dll: 8,636,928 bytes

(*) possibly slightly larger, because the server copies are digitally signed.

If you feel confident with this, could you help with some diagnostics, please?

Find the file 'client_state.xml' in your BOINC data directory, and open it, read only, with notepad or some similar text-only tool.

Find the setiathome project section, and within that, one or more <app_version> sections. For this, you'll want to concentrate on the one which has

<app_name>setiathome_v7
<plan_class>cuda23

It should have file references for the cuda DLLs. Each <file_ref> should have

<file_name>cu***** - probably with some extra version numbers
<open_name>cu[dart|fft].dll - with no extra version numbers.

Please post both those file_ref sections, so we can check the server is set up right.

You might want to check whether the files named in the <file_name> lines have been downloaded, and have the sizes I gave above. If not, try downloading them manually from the fanout as you did before, and see what size you get.

Once you have the files with version numbers - which should be recognisably '23' in some form - you should delete every unversioned cudart.dll and cufft.dll you may have lying around - especially if they have that smaller file size. Try running the app again then.

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 259
Credit: 30,651,002
RAC: 78,978
United States
Message 1379510 - Posted: 10 Jun 2013, 18:31:24 UTC - in response to Message 1379486.


It does indeed. The cuda23 versions of those two files should be at a minimum (*):

cudart.dll: 285,696 bytes
cufft.dll: 8,636,928 bytes

(*) possibly slightly larger, because the server copies are digitally signed.

If you feel confident with this, could you help with some diagnostics, please?

Find the file 'client_state.xml' in your BOINC data directory, and open it, read only, with notepad or some similar text-only tool.

Find the setiathome project section, and within that, one or more <app_version> sections. For this, you'll want to concentrate on the one which has

<app_name>setiathome_v7
<plan_class>cuda23

It should have file references for the cuda DLLs. Each <file_ref> should have

<file_name>cu***** - probably with some extra version numbers
<open_name>cu[dart|fft].dll - with no extra version numbers.

Please post both those file_ref sections, so we can check the server is set up right.

You might want to check whether the files named in the <file_name> lines have been downloaded, and have the sizes I gave above. If not, try downloading them manually from the fanout as you did before, and see what size you get.

Once you have the files with version numbers - which should be recognisably '23' in some form - you should delete every unversioned cudart.dll and cufft.dll you may have lying around - especially if they have that smaller file size. Try running the app again then.

Okay, here are the two file_ref sections:

<file_ref>
<file_name>cudart23_00_00x.dll</file_name>
<open_name>cudart.dll</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>cufft23_00_00x.dll</file_name>
<open_name>cufft.dll</open_name>
<copy_file/>
</file_ref>

Those are definitely the dll's that were downloaded with the first v7 cuda23 tasks and that I downloaded again manually just for comparison.

Speaking of comparison, although I don't have the expertise to really pick apart the dll's, I was curious to see if the cudart23_00_00x.dll really was just a truncated version of the cudart_23_win32.dll, as William's reply sort of implied, so I just eyeballed them side-by-side in a hex editor. They are definitely very different dll's, not just a complete one and a truncated one. They both appear to export the same functions, but one is just shorter than the other, though still complete.

Can't really try running the app again, though, unless some new cuda23 tasks come down the pipe. That GPU is happily gnawing on an AP task at the moment, with another AP in the queue. Wasn't able to get any more tasks this morning.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8436
Credit: 47,953,571
RAC: 59,629
United Kingdom
Message 1379515 - Posted: 10 Jun 2013, 18:54:44 UTC

I've just set up a machine to run as stock, and I'm seeing similar undersized DLL downloads with cuda50, too. Investigating.

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 259
Credit: 30,651,002
RAC: 78,978
United States
Message 1379520 - Posted: 10 Jun 2013, 18:59:25 UTC - in response to Message 1379515.

I've just set up a machine to run as stock, and I'm seeing similar undersized DLL downloads with cuda50, too. Investigating.

Haven't had a problem with cuda50 (or cuda42, or cuda32, though I did have my one and only cuda22 crap out in similar fashion on another machine), just cuda23 under v7.

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4058
Credit: 32,803,628
RAC: 5,169
United Kingdom
Message 1379526 - Posted: 10 Jun 2013, 19:07:37 UTC - in response to Message 1379510.
Last modified: 10 Jun 2013, 19:20:06 UTC

Looking at my E8500/98000GTX+ host, all the recently received cufft & cudart files are a lot smaller than their Beta equivalent, Eric Compressed them didn't he?
all the Cuda22/Cuda23/Cuda32/Cuda42 apps received so far have worked O.K

Seti Main
cudart22_00_00x.dll 114,176 bytes
cufft22_00_00x.dll 228,864 bytes
cudart23_00_00x.dll 115,200 bytes
cufft23_00_00x.dll 1,165,312 bytes
cudart32_32_16.dll 128,616 bytes
cufft32_32_16.dll 2,983,016 bytes
cudart32_42_9.dll 135,168 bytes
cufft32_42_9.dll 4,446,208 bytes

Seti Beta
cudart22_00_00x.dll 281,600 bytes
cufft22_00_00x.dll 1,175,552 bytes
cudart23_00_00x.dll 285,696 bytes
cufft23_00_00x.dll 8,636,928 bytes
cudart32_42_9.dll 384,616 bytes
cufft32_32_16.dll 28,551,272 bytes
cudart32_42_9.dll 446,976 bytes
cufft32_42_9.dll 29,353,984 bytes

Claggy

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1580
Credit: 9,460,369
RAC: 7,261
Message 1379527 - Posted: 10 Jun 2013, 19:08:28 UTC - in response to Message 1379520.

I've just set up a machine to run as stock, and I'm seeing similar undersized DLL downloads with cuda50, too. Investigating.

Haven't had a problem with cuda50 (or cuda42, or cuda32, though I did have my one and only cuda22 crap out in similar fashion on another machine), just cuda23 under v7.

Cuda 22 is a lot more fragile than the later versions.
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1580
Credit: 9,460,369
RAC: 7,261
Message 1379528 - Posted: 10 Jun 2013, 19:09:56 UTC

Just occureed to me - is there anything in the EventLog for those tasks?
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1580
Credit: 9,460,369
RAC: 7,261
Message 1379530 - Posted: 10 Jun 2013, 19:11:15 UTC - in response to Message 1379526.

Looking at my E8500/98000GTX+ host, all the recently received cufft & cudart files are a lot smaller than their Beta equivalent, Eric Compressed them didn't he?
all the Cuda22/Cuda23/Cuda32/Cuda42 apps received so far have worked O.K

Claggy

Yes, we've just realised he's using upx compression.
Explains why my main folder was so much smaller than my beta one :D
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8436
Credit: 47,953,571
RAC: 59,629
United Kingdom
Message 1379531 - Posted: 10 Jun 2013, 19:15:23 UTC

Yes, the files are compressed - they have a UPX header, which points to the...

...Ultimate Packer for eXecutables

The cuda50 I downloaded is running fine with the compressed DLLs. Jeff, I think you simply need to delete the unversioned files - they'll be cuda22, or even cuda20, and could well cause problems for a cuda23 main program.

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 259
Credit: 30,651,002
RAC: 78,978
United States
Message 1379535 - Posted: 10 Jun 2013, 19:16:49 UTC - in response to Message 1379528.

Just occureed to me - is there anything in the EventLog for those tasks?

Sorry, that machine has been rebooted since then and I have a fresh Event Log.

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1580
Credit: 9,460,369
RAC: 7,261
Message 1379538 - Posted: 10 Jun 2013, 19:18:32 UTC - in response to Message 1379535.

Just occureed to me - is there anything in the EventLog for those tasks?

Sorry, that machine has been rebooted since then and I have a fresh Event Log.

main data folder, file stdoutdae.txt or stdoutdae.old
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 259
Credit: 30,651,002
RAC: 78,978
United States
Message 1379539 - Posted: 10 Jun 2013, 19:20:49 UTC - in response to Message 1379531.

Yes, the files are compressed - they have a UPX header, which points to the...

...Ultimate Packer for eXecutables

The cuda50 I downloaded is running fine with the compressed DLLs. Jeff, I think you simply need to delete the unversioned files - they'll be cuda22, or even cuda20, and could well cause problems for a cuda23 main program.

As I mentioned, cuda50 hasn't been a problem for me either, just all 7 of the cuda23 tasks on that one machine and a single cuda22 on another (which does have similar 00x files for cuda22).

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8436
Credit: 47,953,571
RAC: 59,629
United Kingdom
Message 1379544 - Posted: 10 Jun 2013, 19:30:39 UTC - in response to Message 1379539.

Yes, the files are compressed - they have a UPX header, which points to the...

...Ultimate Packer for eXecutables

The cuda50 I downloaded is running fine with the compressed DLLs. Jeff, I think you simply need to delete the unversioned files - they'll be cuda22, or even cuda20, and could well cause problems for a cuda23 main program.

As I mentioned, cuda50 hasn't been a problem for me either, just all 7 of the cuda23 tasks on that one machine and a single cuda22 on another (which does have similar 00x files for cuda22).

Yes, the file sizes turn out to be a red herring - sorry about that. Eric was one very smart jump ahead of us.

The cuda20/22/23 problem is a very much older one. All three of those main programs expect to find files named, exactly, cudart.dll and cufft.dll: but they are different files, and they have to be kept separate.

So Eric distributes them with different filenames, and copies them to the 'slot' (temporary working scratchpad) directory as the application starts, and renames them on the way. If the app finds the right dll in the slot, it should run.

BUT, under Windows, the application looks in the project directory first, and if it finds a DLL with the right name in there, it tries to use it - with no further checking. Deleting the unversioned files in the project directory should force the application to look at the next location on the list - the slot directory - and hopefully find the right one in there.

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 259
Credit: 30,651,002
RAC: 78,978
United States
Message 1379547 - Posted: 10 Jun 2013, 19:34:42 UTC - in response to Message 1379538.

Just occureed to me - is there anything in the EventLog for those tasks?

Sorry, that machine has been rebooted since then and I have a fresh Event Log.

main data folder, file stdoutdae.txt or stdoutdae.old

Found it. Here's the first portion:

09-Jun-2013 10:29:06 [SETI@home] Computation for task 10ja12ab.8138.10044.7.12.195_1 finished
09-Jun-2013 10:29:06 [SETI@home] Starting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0
09-Jun-2013 10:29:07 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file
09-Jun-2013 10:29:07 [SETI@home] If this happens repeatedly you may need to reset the project.
09-Jun-2013 10:29:07 [SETI@home] Restarting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0
09-Jun-2013 10:29:08 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file
09-Jun-2013 10:29:08 [SETI@home] If this happens repeatedly you may need to reset the project.
09-Jun-2013 10:29:08 [SETI@home] Started upload of 10ja12ab.8138.10044.7.12.195_1_0
09-Jun-2013 10:29:08 [SETI@home] Restarting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0
09-Jun-2013 10:29:09 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file
09-Jun-2013 10:29:09 [SETI@home] If this happens repeatedly you may need to reset the project.
09-Jun-2013 10:29:09 [SETI@home] Restarting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0
09-Jun-2013 10:29:10 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file
09-Jun-2013 10:29:10 [SETI@home] If this happens repeatedly you may need to reset the project.
09-Jun-2013 10:29:10 [SETI@home] Restarting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0
09-Jun-2013 10:29:11 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file
09-Jun-2013 10:29:11 [SETI@home] If this happens repeatedly you may need to reset the project.

I assume you don't really want to see all 2000+ lines! ;-)

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4058
Credit: 32,803,628
RAC: 5,169
United Kingdom
Message 1379548 - Posted: 10 Jun 2013, 19:37:23 UTC - in response to Message 1379544.
Last modified: 10 Jun 2013, 19:40:49 UTC

Deleting the unversioned files in the project directory should force the application to look at the next location on the list - the slot directory - and hopefully find the right one in there.

But Boinc might redownload them if the Cuda (2.0) 6.08 app is still referenced, last time i checked they weren't versioned.

Claggy

1 · 2 · 3 · 4 · Next

Message boards : Number crunching : v7 cuda23 WUs getting ERR_TOO_MANY_EXITS

Copyright © 2014 University of California