Open Beta test: SoG for NVidia, Lunatics v0.45 - Beta6 (RC again)

Message boards : Number crunching : Open Beta test: SoG for NVidia, Lunatics v0.45 - Beta6 (RC again)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 32 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1817462 - Posted: 16 Sep 2016, 2:23:11 UTC - in response to Message 1817454.  



. . Hi Robert,

. . I had a look at your results and you have not thrown an ivalid for over 30 hours, and there have been a lot of validated WUs in that time. You may have solved the problem already. What is the PSU rating on that HP box? I am guessing 280W (going on the fact that my HP Core(2) Duo has a 240W PSU). It may be simply that the GTX560 (which needs 150W for itself) needs more power than it can provide. Have you considered trying the 560 in your other PC? Also that GTX560 needs 2 6pin external power connectors (PCIe), does the HP have them?

Stephen

.

I've found that it usually takes at least 30 hours for invalids to be detected, so I plan to wait longer before deciding if the problem is fixed by almost shutting down all CPU workunits. I'm still getting suspicious spike results notices, still with no way to tell which SETI@home task they were from.

The computer with the 560 is an HP d5200t, with a 450W PSU. It took a few adapters to provide the two 6pin connectors - the PSU does not provide them directly. It still uses the original 3 GHz Q9650 CPU. I haven't found information on whether the CPU is soldered in place or in a socket. If it's in a socket, what other CPUs should work better in the same socket?

I've been careful not to order graphics boards listed as needing a PSU rated more than I have, except for a GTX 980 which is waiting for a new computer with a higher rated PSU.

My other computer has only a 300W PSU, so is that adequate for even trying the 560?


. . I'm glad that you are conservative. That delay you refer to is the lag between you completing a WU and your wingman doing the same. Many jobs are not validated until after much longer than 30 hours but most do not take so long. The fact that you have 43 validated tasks since the last invalid indicates a high validation rate within 24 hours and I feel that you have at least reduced the numbers of invalids.

. . A HP Core(2) Quad with a 450W PSU and an i7 with only a 300W PSU, that is the opposite of what I had imagined. And no, a 300W PSU is probably not sufficient for the GTX560, more's the pity, it would have been a good way to prove the issue to either the GPU itself of the HP rig it is in. Not sure where to go from here, it seems it would only be going over what has already been covered.

Stephen

. . Maybe one thing, you mentioned the adapters, I take it you are using two adapters getting power from 4 mollex connectors to feed the two six pin sockets on the GPU. Have you checked that they are not running hot? I have read about an issue where a bad connection in such an adapter caused it to overheat and melt the plastic. A bit scary. The reason you need two Mollex per 6pin is that the PCIe is rated at a higher current than the Mollex is. If you are using single Mollex to 6pin then resistance may be causing a voltage drop at that amperage. Just trying to think of anything that might cause problems.

Stephen

.
ID: 1817462 · Report as offensive
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 213
Credit: 4,117,756
RAC: 6
United States
Message 1817464 - Posted: 16 Sep 2016, 2:36:37 UTC - in response to Message 1817462.  

[snip]

. . I'm glad that you are conservative. That delay you refer to is the lag between you completing a WU and your wingman doing the same. Many jobs are not validated until after much longer than 30 hours but most do not take so long. The fact that you have 43 validated tasks since the last invalid indicates a high validation rate within 24 hours and I feel that you have at least reduced the numbers of invalids.

. . A HP Core(2) Quad with a 450W PSU and an i7 with only a 300W PSU, that is the opposite of what I had imagined. And no, a 300W PSU is probably not sufficient for the GTX560, more's the pity, it would have been a good way to prove the issue to either the GPU itself of the HP rig it is in. Not sure where to go from here, it seems it would only be going over what has already been covered.

Stephen

. . Maybe one thing, you mentioned the adapters, I take it you are using two adapters getting power from 4 mollex connectors to feed the two six pin sockets on the GPU. Have you checked that they are not running hot? I have read about an issue where a bad connection in such an adapter caused it to overheat and melt the plastic. A bit scary. The reason you need two Mollex per 6pin is that the PCIe is rated at a higher current than the Mollex is. If you are using single Mollex to 6pin then resistance may be causing a voltage drop at that amperage. Just trying to think of anything that might cause problems.

Stephen

.


The adapters are supplied from two separate Molexes. They are not melting the plastic. I'll check for overheating tomorrow. I'm using more than two adapters - the type needed to do it with fewer was not available.
ID: 1817464 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1817465 - Posted: 16 Sep 2016, 2:43:07 UTC - in response to Message 1817457.  

. . Hi Jim,

. . Even if it has a 325W PSU it would be below the needs of a GTX560, it is a 150W TDP card. The 750ti's are only 60W cards. Nvidia specify a 450W PSU for the 560, in practice you might get away with 350W plus, but not less. How many PCIe connectors are in your HP box? There are none in my HP 8000 elite (but is is SFF and only has 240w PSU).

Stephen

.

NVidia's ratings are so conservative it's nuts. Add to that the fact that even in heavy duty crunching the cards seldom exceed 80% of TDP and there's a lot of headroom there. When you add together all the "worst case" ratings, the requirements end up totally out of proportion to reality.
The CMT has 2 x16 slots (second is only wired for x4) and 1 x1 slot. The way it's laid out the x16 slots are adjacent, so with dual slot GPUs only the x16 and x1 are usable. I use an x1 riser to support the second 750, as the 2.5gb vs. 5.0gb bus rate is irrelevant for what we're doing, especially now with SoG.
So, I'll never get a third GPU running in the CMT, too bad as I think the CPU would handle it well and there's probably enough room left on that 325w supply for a third 750.


. . OK I have the picture on your card setup. Personally I am too timid to use risers to access the other slots. But I was actually referring to external PCIe PSU connectors, the PSU in my 8000 has none, I was wondering how many you have. And I suspect the 750ti's crunching at full steam might be using 80% of TDP or a little more (my GTX950 is running at 85% TDP). So 3 x 60W at 80% is about 150W, considering PSUs are good for about 80% of their nominal value under sustained loads that mean 325W @ 80% less 150W for the 750ti's leaves just over 100W to run the C2Q and peripherals, I'd call that pushing the limits a bit :). My Pent-D rig with the 2 970s has a 650W PSU and is running at 450W +/- 30W crunching, it runs fairly warm and I would not want to push it very much harder. But I like being conservative.

Stephen

.


. .
ID: 1817465 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1817467 - Posted: 16 Sep 2016, 2:53:34 UTC - in response to Message 1817464.  



The adapters are supplied from two separate Molexes. They are not melting the plastic. I'll check for overheating tomorrow. I'm using more than two adapters - the type needed to do it with fewer was not available.


. . Now I am confused, not sure what type of adapters you are using. But the main thing is they are there and not running hot :)

Stephen

.
ID: 1817467 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1856
Credit: 268,616,081
RAC: 1,349
United States
Message 1817477 - Posted: 16 Sep 2016, 3:49:01 UTC - in response to Message 1817465.  

. . OK I have the picture on your card setup. Personally I am too timid to use risers to access the other slots. But I was actually referring to external PCIe PSU connectors, the PSU in my 8000 has none, I was wondering how many you have. And I suspect the 750ti's crunching at full steam might be using 80% of TDP or a little more (my GTX950 is running at 85% TDP). So 3 x 60W at 80% is about 150W, considering PSUs are good for about 80% of their nominal value under sustained loads that mean 325W @ 80% less 150W for the 750ti's leaves just over 100W to run the C2Q and peripherals, I'd call that pushing the limits a bit :). My Pent-D rig with the 2 970s has a 650W PSU and is running at 450W +/- 30W crunching, it runs fairly warm and I would not want to push it very much harder. But I like being conservative.

Stephen

The EVGA 750tiSCs do not have an external power connector, they take what they need off the mobo (or, in the case of one using a riser, off a single Molex plugged into the riser card). That's one reason I really like that card and got so many. A PCIE slot is rated to supply 75w, so that's another place where I feel confident I'm not pushing the boundary any with this. At this point, the PS fan runs slowly, and the exhaust air doesn't even seem very warm, especially compared to the other boxes which tend to cook.
Interestingly enough, my other box in this class (Gigabyte mb, Xeon that's basically a C2Q, and 3x GTX750tiSCs, keeps its OCX 700w ps blowing pretty warm air. Makes me wonder just how over-rated that OCX ps is, and/or how under-rated this HP PS is.
As far as I can remember, the 8000CMT PS does not have any PCIE power connectors, too old for that.
I think we drifted a bit OT here. Sorry, ...
ID: 1817477 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1817508 - Posted: 16 Sep 2016, 8:24:35 UTC

Worth to check if GPU has any visual artefacts on GPU tests.
Maybe some issues with GPU memory chip.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1817508 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1817530 - Posted: 16 Sep 2016, 11:30:35 UTC

Has anyone seen this showing up on the notices tab before? It has happened a couple of times, but I can find no other info on it. Everything is running nicely and all of a sudden this pops up. It would be nice to at least see what task was causing the problem.

SETI@home: Notice from BOINC
Task postponed: CL file build failure
09/15/2016 23:59:21

This is what is in the event log --

09/15/2016 23:59:21 | SETI@home | task postponed 30.000000 sec:
09/15/2016 23:59:21 | SETI@home | task postponed 30.000000 sec:
09/15/2016 23:59:21 | SETI@home | Task postponed: CL file build failure


I don't buy computers, I build them!!
ID: 1817530 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1817533 - Posted: 16 Sep 2016, 12:07:58 UTC - in response to Message 1817530.  

09/15/2016 23:59:21 | SETI@home | Task postponed: CL file build failure

Which machine/project was this on?

There is a known problem with r3525 (only), but

1) That was only ever released (briefly) at the SETI Beta project, and has never been released to the main project, or via a Lunatics Installer, beta or otherwise.

2) The problem was confined to GTX 6xx and earlier GPUs: the only machine you have attached to the Main project (here) has dual GTX 750Ti GPUs, which should be unaffected by this problem.

So, more details, please.
ID: 1817533 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1817536 - Posted: 16 Sep 2016, 12:40:05 UTC - in response to Message 1817533.  

09/15/2016 23:59:21 | SETI@home | Task postponed: CL file build failure

Which machine/project was this on?

There is a known problem with r3525 (only), but

1) That was only ever released (briefly) at the SETI Beta project, and has never been released to the main project, or via a Lunatics Installer, beta or otherwise.

2) The problem was confined to GTX 6xx and earlier GPUs: the only machine you have attached to the Main project (here) has dual GTX 750Ti GPUs, which should be unaffected by this problem.

So, more details, please.


Hi Richard,

This is happening on SETI main on my 4770K, Win 7 x(64) machine, 2 x GTX750Ti @ 2Gb each, running Lunatics 0.45 beta -4 opencl_nivida_SoG (r3500). GPUs are running 3 tasks each at .5 CPU for each task.

cmd_line.txt (-use_sleep -sbs 512 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp)


I don't buy computers, I build them!!
ID: 1817536 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34354
Credit: 79,922,639
RAC: 80
Germany
Message 1817541 - Posted: 16 Sep 2016, 13:02:44 UTC - in response to Message 1817536.  

09/15/2016 23:59:21 | SETI@home | Task postponed: CL file build failure

Which machine/project was this on?

There is a known problem with r3525 (only), but

1) That was only ever released (briefly) at the SETI Beta project, and has never been released to the main project, or via a Lunatics Installer, beta or otherwise.

2) The problem was confined to GTX 6xx and earlier GPUs: the only machine you have attached to the Main project (here) has dual GTX 750Ti GPUs, which should be unaffected by this problem.

So, more details, please.


Hi Richard,

This is happening on SETI main on my 4770K, Win 7 x(64) machine, 2 x GTX750Ti @ 2Gb each, running Lunatics 0.45 beta -4 opencl_nivida_SoG (r3500). GPUs are running 3 tasks each at .5 CPU for each task.

cmd_line.txt (-use_sleep -sbs 512 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp)


Hi Cliff you need to change to

-use_sleep -sbs 512 -spike_fft_thresh 2048 -tune 1 32 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32 -hp

Maybe to remove -hp is another thing you should try.


With each crime and every kindness we birth our future.
ID: 1817541 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1817544 - Posted: 16 Sep 2016, 13:21:12 UTC

Hi Cliff you need to change to

-use_sleep -sbs 512 -spike_fft_thresh 2048 -tune 1 32 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32 -hp

Maybe to remove -hp is another thing you should try.


Will try the new settings including removing the -hp and will get back to you.


I don't buy computers, I build them!!
ID: 1817544 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1817548 - Posted: 16 Sep 2016, 14:22:46 UTC - in response to Message 1817544.  

Hi Cliff you need to change to

-use_sleep -sbs 512 -spike_fft_thresh 2048 -tune 1 32 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32 -hp

Maybe to remove -hp is another thing you should try.

Will try the new settings including removing the -hp and will get back to you.

Cliff, thanks for providing the host and application details.

One thing that's still perplexing me: how long ago did you deploy r3500, and did the build error message start immediately? My understanding is that the 'CL file build' process only has to happen once, the first time the application is run. Looking at my own machine, I have an r3500 BIN file dated 02 September, and an r3528 BIN file dated 11 September - no sign of any rebuilding since then. So I was wondering what might have triggered these new messages?
ID: 1817548 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1817571 - Posted: 16 Sep 2016, 16:52:33 UTC - in response to Message 1817548.  

Hi Cliff you need to change to

-use_sleep -sbs 512 -spike_fft_thresh 2048 -tune 1 32 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 32 -oclfft_tune_cw 32 -hp

Maybe to remove -hp is another thing you should try.

Will try the new settings including removing the -hp and will get back to you.

Cliff, thanks for providing the host and application details.

One thing that's still perplexing me: how long ago did you deploy r3500, and did the build error message start immediately? My understanding is that the 'CL file build' process only has to happen once, the first time the application is run. Looking at my own machine, I have an r3500 BIN file dated 02 September, and an r3528 BIN file dated 11 September - no sign of any rebuilding since then. So I was wondering what might have triggered these new messages?



R3500 was first deployed on 15 August. It ran for a couple of weeks with the cmd_file.txt supplied above before it first showed up, but I didn't pay that close attention to it. It happened once or twice after that, but when it showed up last night, I decided to report it. What is r3528, and does it have anything to do with CUDA, as I don't run them?


I don't buy computers, I build them!!
ID: 1817571 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1817576 - Posted: 16 Sep 2016, 17:03:17 UTC - in response to Message 1817571.  
Last modified: 16 Sep 2016, 17:04:43 UTC

but when it showed up last night, I decided to report it

Next time record (Copy/Paste) also the task name so we can search for it and look at stderr


What is r3528

The same "thing" as r3500 but newer
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1817576 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1817585 - Posted: 16 Sep 2016, 17:31:49 UTC - in response to Message 1817576.  

but when it showed up last night, I decided to report it

Next time record (Copy/Paste) also the task name so we can search for it and look at stderr


What is r3528

The same "thing" as r3500 but newer



Can't copy/paste what isn't there, as there is no indication what task was involved. I noticed the error message in the notice tab this morning, by then the suspected task was already gone. Will r3528 come out in beta -5 or do I need to do a stand-alone install?


I don't buy computers, I build them!!
ID: 1817585 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1817592 - Posted: 16 Sep 2016, 17:49:08 UTC - in response to Message 1817585.  

Will r3528 come out in beta -5 or do I need to do a stand-alone install?

Raistmer has found something 'worth a deeper look', which suggests r3528 won't be the end of the line. So it's not worth hanging on for a full final release - I'll try and get a Beta5 out tomorrow, in the hope we can catch all the bugs in one go if we all combine forces. (A bit late to start that on a Friday night, this side of the pond)
ID: 1817592 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1817593 - Posted: 16 Sep 2016, 17:49:34 UTC - in response to Message 1817585.  

Can't copy/paste what isn't there, as there is no indication what task was involved.

You can search your stdoutdae.txt and stdoutdae.old for:
CL file build failure

This will find the lines you already posted:
09/15/2016 23:59:21 | SETI@home | task postponed 30.000000 sec:
09/15/2016 23:59:21 | SETI@home | task postponed 30.000000 sec:
09/15/2016 23:59:21 | SETI@home | Task postponed: CL file build failure

Maybe a few lines above is the "[SETI@home] Starting task ..."
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1817593 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1817595 - Posted: 16 Sep 2016, 17:52:51 UTC - in response to Message 1817576.  
Last modified: 16 Sep 2016, 17:55:27 UTC

Seems like it might be Task 5157268689.

I have a utility that can retrieve all task details for a host, which can then be searched. That's the only currently listed task of his that has anything with "CL file build" in it, and the time frame looks to be about right.
ID: 1817595 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1817598 - Posted: 16 Sep 2016, 18:00:20 UTC - in response to Message 1817595.  

OpenCL-kernels filename : MultiBeam_Kernels_r3500.cl
CL file build log on device GeForce GTX 750 Ti

INFO: can't build program from binary kernels, code 0 , recompiling from source...
Error : Building Program (binary, clBuildProgram):main kernels: not OK code -6
CL file build log on device GeForce GTX 750 Ti
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1817598 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1817604 - Posted: 16 Sep 2016, 18:12:06 UTC - in response to Message 1817598.  

OpenCL-kernels filename : MultiBeam_Kernels_r3500.cl
CL file build log on device GeForce GTX 750 Ti

INFO: can't build program from binary kernels, code 0 , recompiling from source...
Error : Building Program (binary, clBuildProgram):main kernels: not OK code -6
CL file build log on device GeForce GTX 750 Ti

But followed at the next attempt by

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 FMA3 SSE4.1 SSE4.2 AVX
OpenCL-kernels filename : MultiBeam_Kernels_r3500.cl
ar=0.012665 NumCfft=117119 NumGauss=0 NumPulse=47842204544 NumTriplet=60817138848
Currently allocated 585 MB for GPU buffers
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768

and from there on, it completed and was validated - with no sign of re-compiling, so the binary file was there all along.

It would be interesting if Cliff could search the log files BilBg suggested for

blc4_2bit_guppi_57449_43932_HIP78775_0013.24448.0.18.27.37.vlar

and post the whole history, from the first attempt at running to final completion. One possible thought: since Cliff has two identical GPUs in the host, if two copies of the app tried to start at nearly the same instant, might one suffer an access problem?
ID: 1817604 · Report as offensive
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 32 · Next

Message boards : Number crunching : Open Beta test: SoG for NVidia, Lunatics v0.45 - Beta6 (RC again)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.