Message boards :
Number crunching :
v7 cuda23 WUs getting ERR_TOO_MANY_EXITS
Message board moderation
Author | Message |
---|---|
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Yesterday, the scheduler sent 7 v7 cuda23 WUs to one of my machines (http://setiathome.berkeley.edu/results.php?hostid=6949656) that's been happily running cuda32, cuda42, and cuda50. All 7 promptly crapped out with "-226 (0xffffffffffffff1e) ERR_TOO_MANY_EXITS" as soon as they tried to run. The GPU is an NVIDIA 8600 GT with the latest driver (314.22) and had no problem with cuda23 under v6, having successfully processed over 1,100 of them in the 2 months prior to the v7 rollout. Following that group of 7, only cuda42 and cuda32 have downloaded and my card is happily chewing on those again. Is there anything I should be doing at my end to ensure this doesn't happen again, or were these cuda23s just a bad lot, or perhaps a scheduler screw-up? |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
nothing printed to stderr - app didn't even start, looks like some initialisation error. boinc obviously tried to get the task to run a few times and then gave up. the app itself should be ok, maybe your dll's are corrupted (wrong size). you could check the size against the copies for v6 (named plain cudart and cufft) - if it's wrong a project reset should download fresh copies (or just just copy and rename). Besides occasional hiccups can happen, only worry if it's continuous. A person who won't read has no advantage over one who can't read. (Mark Twain) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Thanks, William. I looked at the various cudart and cufft dll's and, although none are specifically identified as v6 vs. v7, the file creation dates seem to tell me that the cudart.dll (188 KB), cudart_23_win32.dll (279 KB), cufft.dll (380 KB) and cufft_23_win32.dll (8,435 KB) all came along with v6, while the cudart23_00_00x.dll (113 KB) and cufft23_00_00x.dll (1,138 KB) came with v7. Clearly the file sizes for the apparent v7 dll's are very different from the ones I have for v6, but so are the file names. Are you saying that they should be the same? If so, why would there even be new dll's with new names if they're not v6/v7 specific? I don't think I should rush into a copy/rename without understanding more about whether there's actually anything wrong here or not. By the way, since none of my other machines have downloaded any v7 cuda23 tasks, I can't do any sort of cross-machine file comparisons. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
Thanks, William. I looked at the various cudart and cufft dll's and, although none are specifically identified as v6 vs. v7, the file creation dates seem to tell me that the cudart.dll (188 KB), cudart_23_win32.dll (279 KB), cufft.dll (380 KB) and cufft_23_win32.dll (8,435 KB) all came along with v6, while the cudart23_00_00x.dll (113 KB) and cufft23_00_00x.dll (1,138 KB) came with v7. ah right cudart23_00_00x.dll should be same as (same size) of cudart_23_win32.dll and cufft23_00_00x.dll should be the size of cufft_23_win32.dll . your v7 copies are both too small - you can either reset and hope you get the right files this time or you copy and rename the above (correct) v6 versions to their v7 counterparts. We've had reports of wrong file sizes before. A person who won't read has no advantage over one who can't read. (Mark Twain) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Hmmmm. Since I don't currently have any more cuda23 tasks in the queue, it's not urgent and the nuclear option (project Reset) would seem to be a bit of overkill. I can probably just wait to see if any more cuda23 tasks download before I try the copy/rename approach. However, just out of curiosity, I decided to try downloading the 2 v7 dll's directly (from boinc2.ssl.berkeley.edu/sah/download_fanout/) on my daily driver, and got the exact same shorter cudart (113 KB) and cufft (1,138 KB) files that were downloaded automatically with the cuda23 tasks. That would seem to indicate a server problem that might need to be looked into. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Hmmmm. Since I don't currently have any more cuda23 tasks in the queue, it's not urgent and the nuclear option (project Reset) would seem to be a bit of overkill. I can probably just wait to see if any more cuda23 tasks download before I try the copy/rename approach. It does indeed. The cuda23 versions of those two files should be at a minimum (*): cudart.dll: 285,696 bytes cufft.dll: 8,636,928 bytes (*) possibly slightly larger, because the server copies are digitally signed. If you feel confident with this, could you help with some diagnostics, please? Find the file 'client_state.xml' in your BOINC data directory, and open it, read only, with notepad or some similar text-only tool. Find the setiathome project section, and within that, one or more <app_version> sections. For this, you'll want to concentrate on the one which has <app_name>setiathome_v7 <plan_class>cuda23 It should have file references for the cuda DLLs. Each <file_ref> should have <file_name>cu***** - probably with some extra version numbers <open_name>cu[dart|fft].dll - with no extra version numbers. Please post both those file_ref sections, so we can check the server is set up right. You might want to check whether the files named in the <file_name> lines have been downloaded, and have the sizes I gave above. If not, try downloading them manually from the fanout as you did before, and see what size you get. Once you have the files with version numbers - which should be recognisably '23' in some form - you should delete every unversioned cudart.dll and cufft.dll you may have lying around - especially if they have that smaller file size. Try running the app again then. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Okay, here are the two file_ref sections: <file_ref> <file_name>cudart23_00_00x.dll</file_name> <open_name>cudart.dll</open_name> <copy_file/> </file_ref> <file_ref> <file_name>cufft23_00_00x.dll</file_name> <open_name>cufft.dll</open_name> <copy_file/> </file_ref> Those are definitely the dll's that were downloaded with the first v7 cuda23 tasks and that I downloaded again manually just for comparison. Speaking of comparison, although I don't have the expertise to really pick apart the dll's, I was curious to see if the cudart23_00_00x.dll really was just a truncated version of the cudart_23_win32.dll, as William's reply sort of implied, so I just eyeballed them side-by-side in a hex editor. They are definitely very different dll's, not just a complete one and a truncated one. They both appear to export the same functions, but one is just shorter than the other, though still complete. Can't really try running the app again, though, unless some new cuda23 tasks come down the pipe. That GPU is happily gnawing on an AP task at the moment, with another AP in the queue. Wasn't able to get any more tasks this morning. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I've just set up a machine to run as stock, and I'm seeing similar undersized DLL downloads with cuda50, too. Investigating. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I've just set up a machine to run as stock, and I'm seeing similar undersized DLL downloads with cuda50, too. Investigating. Haven't had a problem with cuda50 (or cuda42, or cuda32, though I did have my one and only cuda22 crap out in similar fashion on another machine), just cuda23 under v7. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Looking at my E8500/98000GTX+ host, all the recently received cufft & cudart files are a lot smaller than their Beta equivalent, Eric Compressed them didn't he? all the Cuda22/Cuda23/Cuda32/Cuda42 apps received so far have worked O.K Seti Main cudart22_00_00x.dll 114,176 bytes cufft22_00_00x.dll 228,864 bytes cudart23_00_00x.dll 115,200 bytes cufft23_00_00x.dll 1,165,312 bytes cudart32_32_16.dll 128,616 bytes cufft32_32_16.dll 2,983,016 bytes cudart32_42_9.dll 135,168 bytes cufft32_42_9.dll 4,446,208 bytes Seti Beta cudart22_00_00x.dll 281,600 bytes cufft22_00_00x.dll 1,175,552 bytes cudart23_00_00x.dll 285,696 bytes cufft23_00_00x.dll 8,636,928 bytes cudart32_42_9.dll 384,616 bytes cufft32_32_16.dll 28,551,272 bytes cudart32_42_9.dll 446,976 bytes cufft32_42_9.dll 29,353,984 bytes Claggy |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
I've just set up a machine to run as stock, and I'm seeing similar undersized DLL downloads with cuda50, too. Investigating. Cuda 22 is a lot more fragile than the later versions. A person who won't read has no advantage over one who can't read. (Mark Twain) |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
Just occureed to me - is there anything in the EventLog for those tasks? A person who won't read has no advantage over one who can't read. (Mark Twain) |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
Looking at my E8500/98000GTX+ host, all the recently received cufft & cudart files are a lot smaller than their Beta equivalent, Eric Compressed them didn't he? Yes, we've just realised he's using upx compression. Explains why my main folder was so much smaller than my beta one :D A person who won't read has no advantage over one who can't read. (Mark Twain) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Yes, the files are compressed - they have a UPX header, which points to the... ...Ultimate Packer for eXecutables The cuda50 I downloaded is running fine with the compressed DLLs. Jeff, I think you simply need to delete the unversioned files - they'll be cuda22, or even cuda20, and could well cause problems for a cuda23 main program. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Just occureed to me - is there anything in the EventLog for those tasks? Sorry, that machine has been rebooted since then and I have a fresh Event Log. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
Just occureed to me - is there anything in the EventLog for those tasks? main data folder, file stdoutdae.txt or stdoutdae.old A person who won't read has no advantage over one who can't read. (Mark Twain) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Yes, the files are compressed - they have a UPX header, which points to the... As I mentioned, cuda50 hasn't been a problem for me either, just all 7 of the cuda23 tasks on that one machine and a single cuda22 on another (which does have similar 00x files for cuda22). |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Yes, the files are compressed - they have a UPX header, which points to the... Yes, the file sizes turn out to be a red herring - sorry about that. Eric was one very smart jump ahead of us. The cuda20/22/23 problem is a very much older one. All three of those main programs expect to find files named, exactly, cudart.dll and cufft.dll: but they are different files, and they have to be kept separate. So Eric distributes them with different filenames, and copies them to the 'slot' (temporary working scratchpad) directory as the application starts, and renames them on the way. If the app finds the right dll in the slot, it should run. BUT, under Windows, the application looks in the project directory first, and if it finds a DLL with the right name in there, it tries to use it - with no further checking. Deleting the unversioned files in the project directory should force the application to look at the next location on the list - the slot directory - and hopefully find the right one in there. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Just occureed to me - is there anything in the EventLog for those tasks? Found it. Here's the first portion: 09-Jun-2013 10:29:06 [SETI@home] Computation for task 10ja12ab.8138.10044.7.12.195_1 finished 09-Jun-2013 10:29:06 [SETI@home] Starting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0 09-Jun-2013 10:29:07 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file 09-Jun-2013 10:29:07 [SETI@home] If this happens repeatedly you may need to reset the project. 09-Jun-2013 10:29:07 [SETI@home] Restarting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0 09-Jun-2013 10:29:08 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file 09-Jun-2013 10:29:08 [SETI@home] If this happens repeatedly you may need to reset the project. 09-Jun-2013 10:29:08 [SETI@home] Started upload of 10ja12ab.8138.10044.7.12.195_1_0 09-Jun-2013 10:29:08 [SETI@home] Restarting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0 09-Jun-2013 10:29:09 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file 09-Jun-2013 10:29:09 [SETI@home] If this happens repeatedly you may need to reset the project. 09-Jun-2013 10:29:09 [SETI@home] Restarting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0 09-Jun-2013 10:29:10 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file 09-Jun-2013 10:29:10 [SETI@home] If this happens repeatedly you may need to reset the project. 09-Jun-2013 10:29:10 [SETI@home] Restarting task 10ja12ab.16278.1046.10.12.95_1 using setiathome_v7 version 700 (cuda23) in slot 0 09-Jun-2013 10:29:11 [SETI@home] Task 10ja12ab.16278.1046.10.12.95_1 exited with zero status but no 'finished' file 09-Jun-2013 10:29:11 [SETI@home] If this happens repeatedly you may need to reset the project. I assume you don't really want to see all 2000+ lines! ;-) |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Deleting the unversioned files in the project directory should force the application to look at the next location on the list - the slot directory - and hopefully find the right one in there. But Boinc might redownload them if the Cuda (2.0) 6.08 app is still referenced, last time i checked they weren't versioned. Claggy |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.