OpenCL AstroPulse crash after processing completion - write here.

Message boards : Number crunching : OpenCL AstroPulse crash after processing completion - write here.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 11 · Next

AuthorMessage
Profile Spectrum
Avatar

Send message
Joined: 14 Jun 99
Posts: 468
Credit: 53,129,336
RAC: 0
Australia
Message 1335367 - Posted: 7 Feb 2013, 6:00:09 UTC - in response to Message 1335317.  

Hi Tbar.

Thanks for the advice, I have added the app info entry and will see how it goes.

ID: 1335367 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1335375 - Posted: 7 Feb 2013, 6:19:35 UTC - in response to Message 1335289.  

I have been getting a heap of these lately, is this what we are talking about?


Stderr output

<core_client_version>7.0.25</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>



No, it's another (BOINC own) issue. The reason in bold.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1335375 · Report as offensive
Profile Spectrum
Avatar

Send message
Joined: 14 Jun 99
Posts: 468
Credit: 53,129,336
RAC: 0
Australia
Message 1335415 - Posted: 7 Feb 2013, 9:02:02 UTC - in response to Message 1335375.  
Last modified: 7 Feb 2013, 9:03:08 UTC

Thanks for the reply Raistmer, any known fixes for this as its just wasting cycles?

Tbars idea didn't work, still getting errors.
ID: 1335415 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1335420 - Posted: 7 Feb 2013, 9:20:51 UTC
Last modified: 7 Feb 2013, 9:23:30 UTC

I don't think you should be manually setting the flops count - looking at your host's applications, the server has already determined a stable estimate of run times for AstroPulse on your ATI GPU. It may just be that your particular work-unit had a high blanking percentage, which results in more work being done on the CPU than the GPU and prolonging the processing time.

I have had the occasional AstroPulse WU which unfortunately had 99+% blanking. Instead of finishing immediately as 100% blanked WUs, it processed it all on the CPU and consequently hit the maximum-elapsed-time-exceeded error after something like five hours (I believe the time limit is 10x normal expected run time and since the WUs are normally processed in less than half an hour: 10 x half an hour = ~5 hours).

Back to the topic, I had an AstroPulse WU crash with an odd error - haven't seen this before. But it's possible things just got really messed up for this one and isn't a recurring or reproducible error.
Soli Deo Gloria
ID: 1335420 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1335423 - Posted: 7 Feb 2013, 10:35:58 UTC

Have you freed CPU core`s ?

Thats very important on high blanked WU´s.



With each crime and every kindness we birth our future.
ID: 1335423 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1335559 - Posted: 7 Feb 2013, 19:47:11 UTC - in response to Message 1335415.  

Thanks for the reply Raistmer, any known fixes for this as its just wasting cycles?

Tbars idea didn't work, still getting errors.

That's a strange error you're getting with the old Multibeam program. Of the pages of Errors, some actually worked. Worse yet, they worked fine in the past. I had a similar experience the last time I tried that program, although it was a different Error. It had also worked fine for me in the past. You could try updating your ATI driver, then updating BOINC, you never know. A newer Multibeam App for ATI was released a short while ago, you might try that as well. I haven't tried the newer Multibeam version as I only use the Multibeam App when I can't receive any AstroPulses. Look here for the New ATI Multibeam App, OpenCL apps are available for download on Lunatics
Good Luck.
ID: 1335559 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1336338 - Posted: 9 Feb 2013, 21:18:13 UTC

Had another restart this morning, first one in a couple days. I had to look in the log to find it;

ap_09dc12ab_B6_P1_00217_20130206_30237.wu_0

2/9/2013 8:46:47 AM |  | Starting BOINC client version 7.0.45 for windows_intelx86
...
2/9/2013 8:46:49 AM | SETI@home | Restarting task ap_28dc12aa_B0_P1_00333_20130128_15160.wu_0 using astropulse_v6 version 601 in slot 2
2/9/2013 8:46:49 AM | SETI@home | Restarting task ap_17dc12aa_B6_P0_00117_20130128_18735.wu_2 using astropulse_v6 version 601 in slot 4
2/9/2013 8:46:49 AM | SETI@home | Restarting task ap_17dc12ab_B5_P1_00004_20130201_24812.wu_0 using astropulse_v6 version 601 in slot 1
2/9/2013 8:46:49 AM | SETI@home | Restarting task ap_09dc12ab_B6_P1_00217_20130206_30237.wu_0 using astropulse_v6 version 604 (ati_opencl_100) in slot 0
2/9/2013 8:46:49 AM | SETI@home | Restarting task 15dc12ac.22337.228355.8.10.51_1 using setiathome_enhanced version 609 (cuda23) in slot 3
2/9/2013 8:46:49 AM | SETI@home | Sending scheduler request: To fetch work.
2/9/2013 8:46:49 AM | SETI@home | Requesting new tasks for NVIDIA and ATI
2/9/2013 8:46:58 AM | SETI@home | Scheduler request completed: got 0 new tasks
2/9/2013 8:46:58 AM | SETI@home | No tasks sent
2/9/2013 8:46:58 AM | SETI@home | No tasks are available for SETI@home Enhanced
2/9/2013 8:46:58 AM | SETI@home | No tasks are available for AstroPulse v6
2/9/2013 8:46:58 AM | SETI@home | This computer has reached a limit on tasks in progress
2/9/2013 8:46:58 AM | SETI@home | Project has no tasks available
2/9/2013 8:48:19 AM | SETI@home | Computation for task ap_09dc12ab_B6_P1_00217_20130206_30237.wu_0 finished
2/9/2013 8:48:19 AM | SETI@home | Starting task ap_09dc12ac_B1_P1_00165_20130206_01989.wu_1 using astropulse_v6 version 604 (ati_opencl_100) in slot 0
2/9/2013 8:48:21 AM | SETI@home | Started upload of ap_09dc12ab_B6_P1_00217_20130206_30237.wu_0_0
2/9/2013 8:48:25 AM | SETI@home | Finished upload of ap_09dc12ab_B6_P1_00217_20130206_30237.wu_0_0
2/9/2013 8:52:04 AM | SETI@home | Sending scheduler request: To fetch work.
2/9/2013 8:52:04 AM | SETI@home | Reporting 1 completed tasks
2/9/2013 8:52:04 AM | SETI@home | Requesting new tasks for ATI
2/9/2013 8:52:10 AM | SETI@home | Scheduler request completed: got 0 new tasks
2/9/2013 8:52:10 AM | SETI@home | No tasks sent
2/9/2013 8:52:10 AM | SETI@home | No tasks are available for SETI@home Enhanced
2/9/2013 8:52:10 AM | SETI@home | No tasks are available for AstroPulse v6
...


Another Success...
ID: 1336338 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1336356 - Posted: 9 Feb 2013, 21:37:03 UTC - in response to Message 1336338.  

Will provide new build for this issue soon.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1336356 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1336572 - Posted: 10 Feb 2013, 9:58:02 UTC
Last modified: 10 Feb 2013, 10:10:31 UTC

Here is updated build that hopefully will catch exception and shutdown gracefully (it's important for BOINC clients that did not do re-run).

https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_ATI_r1766.7z
https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_NV_r1766.7z

Please continue to post cases of re-runs/restarts concerning this issue.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1336572 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1336645 - Posted: 10 Feb 2013, 15:24:07 UTC - in response to Message 1336572.  

Here is updated build that hopefully will catch exception and shutdown gracefully (it's important for BOINC clients that did not do re-run).

https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_ATI_r1766.7z
https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_NV_r1766.7z

Please continue to post cases of re-runs/restarts concerning this issue.

Thanks, I'll give it a go after the servers come back up. I get nervous making edits with so many unreported tasks. I already had to make an edit to change the nVidia card to APs, I'll change to this when I remove the nVidia AP edit.
ID: 1336645 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1336836 - Posted: 10 Feb 2013, 20:57:55 UTC
Last modified: 10 Feb 2013, 21:56:34 UTC

Back up and running the debug train to...where ever.

2/10/2013 3:38:36 PM |  | Starting BOINC client version 7.0.42 for windows_intelx86
2/10/2013 3:38:36 PM |  | OS: Microsoft Windows XP: Professional x86 Edition, Service Pack 3, (05.01.2600.00)
2/10/2013 3:38:36 PM |  | CUDA: NVIDIA GPU 0: GeForce 8800 GT (driver version 306.81, CUDA version 5.0, compute capability 1.1, 512MB, 467MB available, 504 GFLOPS peak)
2/10/2013 3:38:36 PM |  | CAL: ATI GPU 0: AMD Radeon HD 6800 series (Barts) (CAL version 1.4.1664, 1024MB, 1006MB available, 2976 GFLOPS peak)
2/10/2013 3:38:36 PM |  | OpenCL: NVIDIA GPU 0: GeForce 8800 GT (driver version 306.81, device version OpenCL 1.0 CUDA, 512MB, 467MB available, 504 GFLOPS peak)
2/10/2013 3:38:36 PM |  | OpenCL: ATI GPU 0: AMD Radeon HD 6800 series (Barts) (driver version CAL 1.4.1664, device version OpenCL 1.1 AMD-APP (851.4), 1024MB, 1006MB available, 2976 GFLOPS peak)
2/10/2013 3:38:36 PM |  | Version change (7.0.45 -> 7.0.42)
2/10/2013 3:39:20 PM | SETI@home | Restarting task ap_25jl12ac_B6_P0_00142_20130119_09915.wu_3 using astropulse_v6 version 601 in slot 2
2/10/2013 3:39:20 PM | SETI@home | Restarting task ap_02ja13ac_B5_P0_00380_20130129_23414.wu_1 using astropulse_v6 version 601 in slot 4
2/10/2013 3:39:20 PM | SETI@home | Restarting task 17dc12ac.19577.17220.12.10.101_0 using setiathome_enhanced version 609 (cuda23) in slot 0
2/10/2013 3:41:53 PM | SETI@home | task ap_27dc12ac_B1_P0_00309_20130208_09561.wu_0 resumed by user
2/10/2013 3:41:54 PM | SETI@home | Starting task ap_27dc12ac_B1_P0_00309_20130208_09561.wu_0 using astropulse_v6 version 604 (ati_opencl_100) in slot 1
....


First candidate; ap_27dc12ac_B1_P0_00309_20130208_09561.wu_0
Of course it just has to be one that is heavily blanked...
ID: 1336836 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1336871 - Posted: 10 Feb 2013, 22:48:17 UTC
Last modified: 10 Feb 2013, 22:58:49 UTC

This Guy just showed up in one of my latest Work-Units.

He has a few; Error tasks for computer 6846852

The other platform; Computer 6843287...
ID: 1336871 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1337061 - Posted: 11 Feb 2013, 15:39:33 UTC - in response to Message 1336836.  
Last modified: 11 Feb 2013, 15:42:42 UTC


First candidate; ap_27dc12ac_B1_P0_00309_20130208_09561.wu_0
Of course it just has to be one that is heavily blanked...

this one has no signs of restart

EDIT: blanking ~11% - no so heavy as could be ;)
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1337061 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1337062 - Posted: 11 Feb 2013, 15:40:43 UTC - in response to Message 1336871.  

This Guy just showed up in one of my latest Work-Units.

He has a few; Error tasks for computer 6846852

he running old rev so no additional info can be extracted.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1337062 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1337076 - Posted: 11 Feb 2013, 16:30:00 UTC - in response to Message 1337062.  

This Guy just showed up in one of my latest Work-Units.

He has a few; Error tasks for computer 6846852

he running old rev so no additional info can be extracted.

None of them have had any Errors, so far. The one I listed was the first one using your new App, if there was some way to sort them by time/date that might be useful. All of them using the new App are also using BOINC 7.0.42. Maybe I should go back to BOINC 7.0.28, most people getting the Errors are using that version and I got repeated Errors using the unroll 2 setting with 7.0.28.

This one was killing my ATI App so I arranged for the nVidia card to run it ap_16dc12ac_B6_P0_00338_20130208_22461.wu_1, note the blanking. It would have probably timed-out on the ATI App. Lots of Blanking going on...

More Errors;
Computer 5736754
Computer 6204844
ID: 1337076 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1337133 - Posted: 11 Feb 2013, 19:05:27 UTC - in response to Message 1335272.  

Something else impressive is how well Ubuntu 64-bit crunches CPU AstroPulses. My pieced together Linux system is crunching better than a faster Xeon in both 64-bit OSX and 32-bit XP. The 2.8GHz Xeon takes just under 9 hours in OSX and over 10 hours in 32-bit XP. The 2.4GHz Xeon is doing it around the mid-eights in Ubuntu. We need an better CPU AstroPulse App for 32-bit Windows.

I think I've sorted this. The reason the 2.4GHz Intel processor is running better than a 2.8GHz Intel processor is because it's actually running at 3.01GHz. For some reason, when you place an Intel XEON 3060, on an Intel DP43TF board, set the Intel BIOS to 'Automatic', it overclocks the 3060 to 3.01GHz. Since many people were clocking the 3060 to 3.4GHz, I guess I shouldn't be concerned by Intel clocking their own component to 3.01. It seems to work fine, I just need a Full Sized ATX case for it. It will fit in an old Compaq EVO 510 case, I've had it in one before. Once I have it in a case, I'll add SETI video cards...
ID: 1337133 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1337657 - Posted: 13 Feb 2013, 6:45:40 UTC - in response to Message 1337062.  

I gave up on 7.0.42. It appears your bug finder has scared all the bugs away. I went back to 7.0.28, where all this started. Back then, using the stock App set at the default setting unroll 2, I was getting mostly Errors. This is the first one using your new App, ap_15dc12ac_B4_P1_00152_20130207_30944.wu_2. If you look in the Workunit you will find another one, Workunit 1166143526. No, I didn't plan it that way, it just happened...
ID: 1337657 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1338214 - Posted: 14 Feb 2013, 20:36:02 UTC

I think we've entered the Twilight Zone. Desperately seeking some evidence of a bug, I moved the original astropulse_6.04_windows_intelx86__opencl_ati_100.exe, astropulse_6.04_windows_intelx86__opencl_ati_100.pdb & AstroPulse_Kernels_r1316.cl from the 'oldApp_backup' folder and began running that App. Not a single bug in over a day. I can't think of any difference over the last time I ran the original Stock App other than I'm now running a 2GB RAM disk in the upper 6GB of ram. I do still have the r1766 debug build in the project folder even though it's not being used. I'll try removing the r1766DB and the ram disk and see what happens. Since I placed the r1766_debug_build in the project folder I haven't had any Errors.
ID: 1338214 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1338331 - Posted: 15 Feb 2013, 4:05:32 UTC - in response to Message 1338214.  
Last modified: 15 Feb 2013, 4:52:42 UTC

Well, I removed the Debug Build from the Project folder and nothing changed. There are a couple other major differences from when I was receiving all the Errors with the Stock App. As with most other people receiving many of these Errors, I was running completely Stock, without an app_info file. Since using the app_info file I usually only receive one Error a day, sometimes one Error a week. Another observation is currently the App seems to be using much more CPU time even with Zero blanking. Usually it would use 10-20% CPU with the lightly blanked tasks whereas now it is using 30-50% CPU with light blanking. It appears that Apps not using an app_info file use less CPU time, however, it appears inconclusive. Here is an interesting task, Workunit 1167407957 Note the CPU times, and the number of other Errors the one host has. Also note how the one Host whose results were listed as 'Invalid' due to an Error, actually had the Valid results...

shrugs...
ID: 1338331 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1338353 - Posted: 15 Feb 2013, 6:29:22 UTC - in response to Message 1338331.  

debug build has no CPU optimizations on instruction level so will slower and consume more CPU.
But if you have troubles with finding any crashes with it could you try last opt build I posted in this thread instead? At least we will know if workaround works or not even w/o knowledge where crash occurs.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1338353 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 11 · Next

Message boards : Number crunching : OpenCL AstroPulse crash after processing completion - write here.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.