GPU AP's error out on one host

Message boards : Number crunching : GPU AP's error out on one host
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732151 - Posted: 5 Oct 2015, 22:38:24 UTC
Last modified: 5 Oct 2015, 22:55:36 UTC

On one host, every AP that hits my GPU's errors out with Exit status "193 (0xc1) EXIT_SIGNAL" and:

INFO: can't open binary kernel file: /home/jpsoifer/BOINC/projects/setiathome.berkeley.edu/AstroPulse_Kernels_r2751.cl_GeForceGTX750Ti.bin_V7_TWIN_FFA_35511, continue with recompile...
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_S_construct null not valid
SIGABRT: abort called

Any idea what's causing this and how I might fix it?

This is the host: http://setiathome.berkeley.edu/show_host_detail.php?hostid=7772630
ID: 1732151 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1732176 - Posted: 6 Oct 2015, 1:32:03 UTC - in response to Message 1732151.  
Last modified: 6 Oct 2015, 1:48:23 UTC

Try removing the 2 TUNE cmdline settings and see if that helps;
TUNE: kernel 1 now has workgroup size of (64,8,1)
TUNE: kernel 2 now has workgroup size of (128,8,1)

Have you had any APs work on a 750 Ti with those settings?
Also, Linux usually likes lower FFA numbers. I found my cards like;
FFA thread block override value:3072
FFA thread fetchblock override value:1536

Now that I think about it, I remember having a problem with the FFA thread block override value set to 6144 or above. The post in somewhere at Beta. So, you might try using;
FFA thread block override value:3072
FFA thread fetchblock override value:1536
On the 750s.
ID: 1732176 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732207 - Posted: 6 Oct 2015, 4:20:32 UTC - in response to Message 1732176.  
Last modified: 6 Oct 2015, 4:23:47 UTC

Hi TBar,

Thanks, I'll make those changes and see what happens.

This is what I've got now:

<cmdline>-unroll 12 -ffa_block 3072 -ffa_block_fetch 1536 -oclFFT_plan 256 16 256 -hp</cmdline>
ID: 1732207 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732343 - Posted: 6 Oct 2015, 15:45:44 UTC - in response to Message 1732207.  

ID: 1732343 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1732351 - Posted: 6 Oct 2015, 20:14:06 UTC - in response to Message 1732343.  

You are over clocking your card, 750Ti's don't like to be pushed that hard.

You have 1254Mhz, I have 1150Mhz on mine.
ID: 1732351 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1732354 - Posted: 6 Oct 2015, 20:21:54 UTC - in response to Message 1732343.  
Last modified: 6 Oct 2015, 20:22:17 UTC

Hi,

You should check that the file BOINC/projects/setiathome.berkeley.edu/AstroPulse_Kernels_r2751.cl exists and has read permissions and that its contents is not empty.

You can set the permissions in xterm with
chmod ugo+r BOINC/projects/setiathome.berkeley.edu/AstroPulse_Kernels_r2751.cl
that gives read access to all.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1732354 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732391 - Posted: 6 Oct 2015, 22:42:17 UTC - in response to Message 1732351.  
Last modified: 6 Oct 2015, 23:25:54 UTC

You are over clocking your card, 750Ti's don't like to be pushed that hard.

You have 1254Mhz, I have 1150Mhz on mine.


I don't overclock any of my cards. Those are EVGA 750ti SC's, which have the following factory clock speeds:

1176 MHz Base Clock

1255 MHz Boost Clock
ID: 1732391 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732393 - Posted: 6 Oct 2015, 22:47:57 UTC - in response to Message 1732354.  

Hi petri,

No, that file does not exist. Shouldn't it have been created automatically? On this host [http://setiathome.berkeley.edu/show_host_detail.php?hostid=7772567] I see AstroPulse_Kernels_r2751.cl_GeForceGTX970.bin_V7_TWIN_FFA_35511

Is it something that is created during the driver install? Should I try reinstalling the Nvidia drivers?



Hi,

You should check that the file BOINC/projects/setiathome.berkeley.edu/AstroPulse_Kernels_r2751.cl exists and has read permissions and that its contents is not empty.

You can set the permissions in xterm with
chmod ugo+r BOINC/projects/setiathome.berkeley.edu/AstroPulse_Kernels_r2751.cl
that gives read access to all.
ID: 1732393 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1732397 - Posted: 6 Oct 2015, 22:58:50 UTC - in response to Message 1732393.  

No, it's something that is written and supplied by the application developer. You need it: it should have been supplied with the application.
ID: 1732397 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732411 - Posted: 6 Oct 2015, 23:24:58 UTC - in response to Message 1732397.  

I'm running the stock Linux AP opencl app. astropulse_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100. Shouldn't AstroPulse_Kernels_r2751.cl have downloaded/been created when I first received GPU AP work?
ID: 1732411 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1732423 - Posted: 7 Oct 2015, 0:48:10 UTC - in response to Message 1732411.  

If you were running Stock the file would have been downloaded, however, both your 750Ti Hosts are running Anonymous platform. The other Host is now giving the same Error, http://setiathome.berkeley.edu/results.php?hostid=7772598&state=6&appid=. If you are missing AstroPulse_Kernels_r2751.cl on those Hosts you're going to have to supply them manually.
ID: 1732423 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732454 - Posted: 7 Oct 2015, 4:02:39 UTC - in response to Message 1732423.  

OK, where can I get the file? Do you have a link?

If you were running Stock the file would have been downloaded, however, both your 750Ti Hosts are running Anonymous platform. The other Host is now giving the same Error, http://setiathome.berkeley.edu/results.php?hostid=7772598&state=6&appid=. If you are missing AstroPulse_Kernels_r2751.cl on those Hosts you're going to have to supply them manually.
ID: 1732454 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1732460 - Posted: 7 Oct 2015, 5:11:21 UTC - in response to Message 1732454.  

You should be able to use the .cl file from one of your other machines, they are basically the same even across GPUs vendors. The link from another post should still work though, http://boinc2.ssl.berkeley.edu/beta/download/AstroPulse_Kernels_r2751.cl
Hmmm, change the ati to nvidia and this one works too, http://boinc2.ssl.berkeley.edu/beta/download/astropulse_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100

It appears you have quite a few Ghosts on those machines. You Can recover them, 20 at a time if you have cache space and want to jump through the hoops. You basically have to report a task twice to trigger a resend event, then it will send 20 tasks per event. It might take a while with all those Ghosts ;-)
ID: 1732460 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732464 - Posted: 7 Oct 2015, 5:33:53 UTC

Thank you so much! Let's see how things go with the kernel file, then maybe I'll try tackling the ghosts. :-)
ID: 1732464 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732592 - Posted: 7 Oct 2015, 17:43:16 UTC
Last modified: 7 Oct 2015, 17:45:51 UTC

Now I'm getting errors on a different host. All seemed fine through yesterday:

http://setiathome.berkeley.edu/result.php?resultid=4430016187

This started happening today:

http://setiathome.berkeley.edu/result.php?resultid=4430588341

I have no clue what's going on.
ID: 1732592 · Report as offensive
castor

Send message
Joined: 2 Jan 02
Posts: 13
Credit: 17,721,708
RAC: 0
Finland
Message 1732603 - Posted: 7 Oct 2015, 18:31:23 UTC

Was there by any chance a Linux kernel update recently, messing up the nvidia driver install?
ID: 1732603 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1732606 - Posted: 7 Oct 2015, 18:42:11 UTC - in response to Message 1732592.  
Last modified: 7 Oct 2015, 18:43:36 UTC

Hi,
Your host 7772567 used to have 4 GPUS, now it has 3.

Have You tried a reboot ... Maybe one of the GPU's has overheated or suffered some kind of crash.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1732606 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732620 - Posted: 7 Oct 2015, 19:10:32 UTC - in response to Message 1732606.  

Hi,
Your host 7772567 used to have 4 GPUS, now it has 3.

Have You tried a reboot ... Maybe one of the GPU's has overheated or suffered some kind of crash.



I'm seeing 4. http://setiathome.berkeley.edu/show_host_detail.php?hostid=7772567

So strange.
ID: 1732620 · Report as offensive
Profile Fawkesguy
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 108
Credit: 188,578,766
RAC: 0
United States
Message 1732621 - Posted: 7 Oct 2015, 19:11:34 UTC - in response to Message 1732603.  

Was there by any chance a Linux kernel update recently, messing up the nvidia driver install?


No, I haven't done any kernel updates.
ID: 1732621 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1732622 - Posted: 7 Oct 2015, 19:15:15 UTC - in response to Message 1732620.  

Hi,
Your host 7772567 used to have 4 GPUS, now it has 3.

Have You tried a reboot ... Maybe one of the GPU's has overheated or suffered some kind of crash.



I'm seeing 4. http://setiathome.berkeley.edu/show_host_detail.php?hostid=7772567

So strange.


<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
Not using ap_cmdline.txt-file, using commandline options.
DATA_CHUNK_UNROLL set to:18
oclFFT plan class overrides requested: global radix 256; local radix 16;  max workgroup size 256
FFA thread block override value:16384
FFA thread fetchblock override value:8192
TUNE: kernel 1 now has workgroup size of (64,8,1)
TUNE: kernel 2 now has workgroup size of (64,8,1)
Running on device number: 3
GPU not found: type=NVIDIA, opencl_device_index=3, device_num=-1
WARNING: boinc_get_opencl_ids failed with code -1
OpenCL platform detected: NVIDIA Corporation
WARNING: BOINC supplied wrong platform!
Number of OpenCL devices found : 3 
BOINC assigns slot on device #3.


Yes, it should have four. The application finds only three.

The second last line : Number of OpenCL devices found : 3

Strange..
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1732622 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : GPU AP's error out on one host


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.