OpenCL AstroPulse crash after processing completion - write here.

Message boards : Number crunching : OpenCL AstroPulse crash after processing completion - write here.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 11 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1338380 - Posted: 15 Feb 2013, 7:32:25 UTC - in response to Message 1338353.  

It appears that most of the people having a lot of these Errors are running completely stock Apps without an app_info file. Go back and look at the links I posted, all stock Hosts. So....how would I go back to pure Stock without causing much disruption? I have ATI 604 APs, those should be fine with the current setup as I'm running the stock AP App right now. I have AP 601s, I would need the Stock astropulse_6.01_windows_intelx86.exe, I have that, anything else to go with that App? I also have 609 Cuda 23s and the stock setiathome_6.09_windows_intelx86__cuda23.exe. I place those stock Apps back in the main project folder, remove the app_info file, AP r557 & Lunatics_x41g_win32_cuda32.exe Apps, start BOINC, and I'm back running stock?

I have a feeling I will start getting AstroPulse Errors when I go back to Stock, without an app_info file.
ID: 1338380 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1338400 - Posted: 15 Feb 2013, 8:48:06 UTC - in response to Message 1338380.  

No sense to do that. I already know that error exists. What I need to know now if last build fixes that error or not. Running stock old app will not help in that.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1338400 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1338405 - Posted: 15 Feb 2013, 9:07:09 UTC - in response to Message 1338400.  

Well, I've gone 9 days without an Error before. It could be longer with your Debug App. If there is a major difference between running Stock with many Errors, and running with an app_info file and having One, don't you think people would choose the One? If running Stock produces Errors, maybe someone could fix the way stock runs....

Just a thought.
ID: 1338405 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1338453 - Posted: 15 Feb 2013, 13:54:11 UTC - in response to Message 1338405.  

From application point of view there is no difference if app being run as stock or under anonymous platform.
Quite possible last build just doesn't produce those errors.
Try to switch from debug build to rev 1766 opt one I posted in this thread.
It will speedup calculations at least. And if there is no such errors with r1766 - then we will try to promote it to new stock app instead of r1316.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1338453 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1338614 - Posted: 15 Feb 2013, 21:12:24 UTC - in response to Message 1338453.  

I'll let it run stock until the first group of CPU tasks finish, in about 18 hours, and then go back to running the Debug build. This time I'll raise the settings back up to normal, that should speed it up. I've been running 'Stock' for about 10 hours and still No errors. The last time I ran Stock GPU APs it was an Error Fest. That is strange.
ID: 1338614 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1338617 - Posted: 15 Feb 2013, 21:22:20 UTC - in response to Message 1338614.  

Lets make it more clear:
"stock" means rev1316 running w/o app_info.
What I want you to run now is rev 1766, not debug (public link instead of PMed one) running under app_info. I don't see how you can run this binary as "stock" (w/o app_info). BOINC will check md5 sum and redownload rev1316 instead.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1338617 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1338626 - Posted: 15 Feb 2013, 21:38:39 UTC - in response to Message 1338617.  

I have the 0.40 Installer, and all my old files. When the current CPU tasks finish, I will make it the way it was yesterday. There are quite a few people running the Public r1766 build, I would just be one more. However, the Debug would be unique. Are you sure you don't want any more r1766 Debugging?
ID: 1338626 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1338956 - Posted: 16 Feb 2013, 21:01:27 UTC - in response to Message 1338626.  

all those "other peoples" didn't report still ;)
And I want to know that last build fixes issue more than I want to know where in BOINC API crash occurs. So yes, if you could detect case where app "tried" to crash but it was prevented (not by BOINC level restart but by app itself, one will have special message about that in stderr) then I think we can stop with this issue for now.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1338956 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1338982 - Posted: 16 Feb 2013, 23:33:50 UTC - in response to Message 1338956.  
Last modified: 17 Feb 2013, 0:01:58 UTC

I'm back running the Public Build of r1766 under BOINC 7.0.45. Hopefully things will settle down now. When I switched back to my old app_info I didn't take into account that the Scheduler had given me two different classes of ATI APs. That wouldn't have been bad, except, the same Flop setting was suddenly way too low causing the tasks to be Timed-Out instead of resent. Whatever. I did finally get an Error just before leaving the Stock r1316. That would be about one Error in around 2 days with r1316, about normal. We'll see how r1766 does.
ID: 1338982 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1339006 - Posted: 17 Feb 2013, 0:31:03 UTC - in response to Message 1338956.  
Last modified: 17 Feb 2013, 0:32:28 UTC

... one will have special message about that in stderr)...

How does it look like?

As soon as I got first crash with 1363 with newly installed 5850, I switched to 1766 and have no issues so far. Except it is a bit slower, compared to 1363. Average completion time on low/no blanking tasks went from 32 minutes to 33:something minutes. Not a problem, imho, as much as it saves whole WU from being trashed.
ID: 1339006 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1339091 - Posted: 17 Feb 2013, 8:17:09 UTC

I forgot to knock on wood, when I wrote previous post and here it is, two crashed WUs tonight, with r1766:

ap_09dc12ah_B2_P1_00386_20130216_18005.wu_0
ap_09dc12af_B6_P1_00017_20130216_17830.wu_1

Btw, it never happened on my NVidia devices.
ID: 1339091 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1339098 - Posted: 17 Feb 2013, 8:49:02 UTC

With unroll set to 16 its no wonder.




With each crime and every kindness we birth our future.
ID: 1339098 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1339102 - Posted: 17 Feb 2013, 8:59:08 UTC
Last modified: 17 Feb 2013, 9:03:53 UTC

To crash AFTER processing? Don't think so.
Even 5770 worked with unroll 16. It had other problem, but tests proved, it wasn't caused by unroll value.
ID: 1339102 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1339107 - Posted: 17 Feb 2013, 9:16:52 UTC - in response to Message 1339091.  
Last modified: 17 Feb 2013, 9:30:08 UTC

I forgot to knock on wood, when I wrote previous post and here it is, two crashed WUs tonight, with r1766:

ap_09dc12ah_B2_P1_00386_20130216_18005.wu_0
ap_09dc12af_B6_P1_00017_20130216_17830.wu_1

Btw, it never happened on my NVidia devices.


Thanks. Well, r1766 still can't catch exception :(

EDIT: if you or anyone else want to run debug build of 1766 to locate place of error please PM me, I will send link on debug pack.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1339107 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1339110 - Posted: 17 Feb 2013, 9:32:10 UTC - in response to Message 1338982.  

I'm back running the Public Build of r1766 under BOINC 7.0.45. Hopefully things will settle down now. When I switched back to my old app_info I didn't take into account that the Scheduler had given me two different classes of ATI APs. That wouldn't have been bad, except, the same Flop setting was suddenly way too low causing the tasks to be Timed-Out instead of resent. Whatever. I did finally get an Error just before leaving the Stock r1316. That would be about one Error in around 2 days with r1316, about normal. We'll see how r1766 does.


hbomber outran you. Now it's better to stay with r1766 debug until I will have something new to try. Sorry for inconvience.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1339110 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1339245 - Posted: 17 Feb 2013, 21:44:06 UTC - in response to Message 1339110.  

I just had a Restart. After just one day of running the Public 1766. I ran the 1766 Debug for over three days and had nothing.

ap_09dc12ah_B5_P1_00324_20130217_11126.wu_1

I'll reinstall the Debugger and use the same settings as now.
ID: 1339245 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1339381 - Posted: 19 Feb 2013, 15:52:51 UTC
Last modified: 19 Feb 2013, 15:54:32 UTC

ID: 1339381 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1339477 - Posted: 19 Feb 2013, 23:00:29 UTC - in response to Message 1339381.  

I think I finally got an Error with the Debug build running. Unfortunately, it appears the 7.0.45 'Restart Feature' and the Debugger don't play well. I got a BSOD instead of a report. I don't have BOINC set to run at login, so I was able to look through the slots before BOINC started again. There was nothing there of any value. Complete waste of time. I'll go back to 7.0.44, which doesn't have the 'Restart Feature', and try it again. That is, as soon as I get the machine working again. I think the BSOD did something to the ATI driver, things are not going very well. I'm uninstalling the driver at present, it's taking a very long time. If that ever finishes, I'll reinstall it and try again...
ID: 1339477 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1339520 - Posted: 20 Feb 2013, 3:15:54 UTC - in response to Message 1339477.  

I think I finally got an Error with the Debug build running. Unfortunately, it appears the 7.0.45 'Restart Feature' and the Debugger don't play well. I got a BSOD instead of a report. I don't have BOINC set to run at login, so I was able to look through the slots before BOINC started again. There was nothing there of any value. Complete waste of time. I'll go back to 7.0.44, which doesn't have the 'Restart Feature', and try it again. That is, as soon as I get the machine working again. I think the BSOD did something to the ATI driver, things are not going very well. I'm uninstalling the driver at present, it's taking a very long time. If that ever finishes, I'll reinstall it and try again...


Could you recall any additional info about that BSoD ?
What type of BSoD ? You could run WhoCrashed app to get more details from stored memory minidump (on by default should be).

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1339520 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1339525 - Posted: 20 Feb 2013, 3:19:04 UTC - in response to Message 1339381.  

We have candidate:
http://setiathome.berkeley.edu/result.php?resultid=2841769852

yeah, good catch!
Will provide new debug build soon. This log showed that crash could occur inside class destructor. Will add some printfs to that destructor to see where it crashes.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1339525 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 11 · Next

Message boards : Number crunching : OpenCL AstroPulse crash after processing completion - write here.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.