Errors on Cuda Units with new server build

留言板 : Number crunching : Errors on Cuda Units with new server build
留言板合理

To post messages, you must log in.

1 · 2 · 后

作者消息
-BeNt-
Avatar

发送消息
已加入:17 Oct 99
贴子:1234
积分:10,116,112
近期平均积分:0
United States
消息 1064482 - 发表于:7 Jan 2011, 23:25:59 UTC
最近的修改日期:8 Jan 2011, 0:16:12 UTC

Up to almost 18 WU's without error on the gpus's now. Heat or maybe a reseat on the cards did something?!

*Edit*
Now up to 20 WU's without an error on the gpu's. My machine is eating WU's right now! Wow.

*Edit*
Account is now saying I've had 37 consecutive WUs without error.
Traveling through space at ~67,000mph!
ID: 1064482 · 举报违规帖子
-BeNt-
Avatar

发送消息
已加入:17 Oct 99
贴子:1234
积分:10,116,112
近期平均积分:0
United States
消息 1064402 - 发表于:7 Jan 2011, 21:07:06 UTC

Small update gentlemen. I decided before ordering a new power supply to pull out the 430 watt I had in the closet and go dual gpu for awhile. Guess what? Still errors except from the GTS 250 with the 8800 removed! So what's that leave me with?! Stalling WU's and errors still. Grr.....

So I decided today to try something different and put the 8800 back into the box and leave the side of the case off, maybe it's a heat issue? Granted the times I checked the heat on the cards one was running in the 60's and the other in the mid 60's. So I don't think it is that, however with the side off the case it has completed 6 work units, 3 on each video card, without a single error?! So what is going on!? Beginning to wonder if I got some malformed WUs now, is that even possible?
Traveling through space at ~67,000mph!
ID: 1064402 · 举报违规帖子
-BeNt-
Avatar

发送消息
已加入:17 Oct 99
贴子:1234
积分:10,116,112
近期平均积分:0
United States
消息 1064105 - 发表于:7 Jan 2011, 0:09:54 UTC - 回复消息 1064102.  

BeNt,
You don't say which of your machines are doing this. Could it be your AMD machine? They have been known to do that. If so, a reboot will get them started again or Lunatic's opt-app cures it most times.


Sorry guess I should have went into more detail. The AMD machine has been replaced with the e8400 server. As you can see by the thread it's a continuous battle to make this thing right. I have taken the machine down and fully removed the 8800GTS from it as maybe it's still too much strain on the psu, going to see what happens now. Ever since I've restarted the machine, minus the other card, the workunits are progressing as normal.
Traveling through space at ~67,000mph!
ID: 1064105 · 举报违规帖子
Profile perryjay
志愿者测试人员
Avatar

发送消息
已加入:20 Aug 02
贴子:3377
积分:20,676,751
近期平均积分:0
United States
消息 1064102 - 发表于:7 Jan 2011, 0:06:02 UTC - 回复消息 1064069.  
最近的修改日期:7 Jan 2011, 0:06:25 UTC

BeNt,
You don't say which of your machines are doing this. Could it be your AMD machine? They have been known to do that. If so, a reboot will get them started again or Lunatic's opt-app cures it most times.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1064102 · 举报违规帖子
-BeNt-
Avatar

发送消息
已加入:17 Oct 99
贴子:1234
积分:10,116,112
近期平均积分:0
United States
消息 1064069 - 发表于:6 Jan 2011, 22:33:52 UTC

Ok guys any clue what would cause a WU to just stop?! I've just had another one time out and I keep having to suspend WU's and get a new one going to keep it moving. This is really starting to get aggravating.
Traveling through space at ~67,000mph!
ID: 1064069 · 举报违规帖子
Profile Fred J. Verster
志愿者测试人员
Avatar

发送消息
已加入:21 Apr 04
贴子:3252
积分:31,903,643
近期平均积分:0
Netherlands
消息 1063991 - 发表于:6 Jan 2011, 16:16:54 UTC - 回复消息 1063966.  
最近的修改日期:6 Jan 2011, 16:20:08 UTC


BTW Steve, really? 10 gauge 30 amp twist locks lines from your mains, really? You don't happen to have a time machine stuffed in there somewhere do you, because you have to be producing at least 1.21 jiggawatts with that kind of power. Insane but I can't bash if I had some extra funding I would probably build a computer lab onto my house lol.


The overkill seems to be working! I can't believe it is still climbing! :D

Steve


Don't forget to build your own generator, solar panels, wind turbines, etc. ;-)
(I can use 10KW/h at home, 3-fased;3x 234V; 25A, but don't want a war with my 'landlord')
ID: 1063991 · 举报违规帖子
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
志愿者测试人员
Avatar

发送消息
已加入:20 Jun 99
贴子:6557
积分:121,090,076
近期平均积分:0
United States
消息 1063966 - 发表于:6 Jan 2011, 14:37:32 UTC - 回复消息 1063962.  
最近的修改日期:6 Jan 2011, 14:38:08 UTC


BTW Steve, really? 10 gauge 30 amp twist locks lines from your mains, really? You don't happen to have a time machine stuffed in there somewhere do you, because you have to be producing at least 1.21 jiggawatts with that kind of power. Insane but I can't bash if I had some extra funding I would probably build a computer lab onto my house lol.


The overkill seems to be working! I can't believe it is still climbing! :D

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1063966 · 举报违规帖子
-BeNt-
Avatar

发送消息
已加入:17 Oct 99
贴子:1234
积分:10,116,112
近期平均积分:0
United States
消息 1063962 - 发表于:6 Jan 2011, 14:16:14 UTC
最近的修改日期:6 Jan 2011, 14:16:54 UTC

Since 9:45 UTC yesterday (3:45 local) I haven't had any errors. One thing I am noticing happening now however is the gpu will have a work unit that just stops. If I suspend it and it loads a different one in it crunches away happily. One more issue to figure out. But this could also be from the power supply, who knows.

I've decided to go after the Corsair 850TX, seems right now it's the best bang for the buck and it's only going to run about ~$130, and it seems after rebate could be $109. So I need to get that ordered to see if it really cures my problem. Thanks for all the help guys getting this worked out, and especially for the suggestions and comments!

BTW Steve, really? 10 gauge 30 amp twist locks lines from your mains, really? You don't happen to have a time machine stuffed in there somewhere do you, because you have to be producing at least 1.21 jiggawatts with that kind of power. Insane but I can't bash if I had some extra funding I would probably build a computer lab onto my house lol.
Traveling through space at ~67,000mph!
ID: 1063962 · 举报违规帖子
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
志愿者测试人员
Avatar

发送消息
已加入:20 Jun 99
贴子:6557
积分:121,090,076
近期平均积分:0
United States
消息 1063855 - 发表于:6 Jan 2011, 3:46:31 UTC
最近的修改日期:6 Jan 2011, 3:47:48 UTC

Just for the record.

Piggy:
1250 watt BFG PSU
2200 VA APC Smart UPS
30 AMP twist lock direct line from the mains, 10 Guage
Overkill.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1063855 · 举报违规帖子
Profile Tim Norton
志愿者测试人员
Avatar

发送消息
已加入:2 Jun 99
贴子:835
积分:33,540,164
近期平均积分:0
United Kingdom
消息 1063852 - 发表于:6 Jan 2011, 3:38:15 UTC - 回复消息 1063851.  

My vote would be the Corsair as its only $10 more and you get 2 years extra warranty

also its sli certified so designed to run with two cards - 4 pcie connectors

As your machine boots up now and runs with two cards even though crunching is a error prone i suggest you are not that far below what you need to run your setup

you would be giving the system another 270watts of capacity to work with which is almost enough to run the cards on their own

You might get away with a 750 but as Steve suggested i would play safe and go higher than you think too give you a margin and the PSU will run cooler and more efficient

Good luck with the boss ok'ing the spend :)

PS: Just as an aside you could use the 420 watt with the 580 watt with a bit of research
Tim

ID: 1063852 · 举报违规帖子
-BeNt-
Avatar

发送消息
已加入:17 Oct 99
贴子:1234
积分:10,116,112
近期平均积分:0
United States
消息 1063851 - 发表于:6 Jan 2011, 3:13:16 UTC
最近的修改日期:6 Jan 2011, 3:33:29 UTC

Steve, I generally do the same thing on my main rig. But the server is sort of an after thought machine comprised of unused parts from previous builds as I grow into a new gaming machine etc. It's merely a place I use as a cruncher and file storage machine. Occasionally I use it for video transcoding etc, but nothing super important. Hence if I have to buy parts for it I like to stay as cheap as possible. Before the 580 went in it was running off an Antec Neo 430 watt for years. When I got the 8800 it needed a bit more power than my 7800 needed so I went to the 580 etc. My 480 machine is running on a 750 watt PC&C supply which I love. So needless to say I have never bought a single part for the server. But then again it's always been rehashed of older hardware which all worked together, Until now with the dual video cards.

So now the debate for me looms. Upgrade my gaming machine to a better supply and put the 750 in the server, or simply buy a 750+ psu for the server. I don't anticipate putting any additional hardware in it any time soon so I just want what I need at the moment as money is also a constraint, especially when you start talking about a psu bigger than 850 watts. I guess bottom line is I may end up having to retire the 8800 from service(trusty as it's always been). Guess it's time to get with the better half and work out a deal. ;)

As far as what I'm looking at are these:
Corsair 850TX 80+ Silver - $129.99 5 Year warranty
Seasonic SS-850HT 80+ Silver - $119.99 3 year warranty
What I would get but probably above my price range at this time:
PC&C Silencer 950 80+ Silver - $189.99 7 year warranty

Both the Corsair and Seasonic seem to have a single 12v(70A) rail which is really good, but the PC&C has an 83.4A single rail not to mention 88% efficiency at full load! Between the two it's hard to pick. I know bill give the Seasonic a thumbs up, anyone else have an opinion or supply they have an opinion on?

I generally stay in the Antec / PC Power & Cooling circle of things, but this is a budget limited fix(read not much per the boss), so going with a $200+ supply is out of the question or I would go with a much larger supply. I love my PC&C 750 and can't help but wonder if it would power the server because I can get a new one of those for $129.99.

Obviously I may be hoping beyond my means, but I wish I could run this setup off a 750 psu. All the calculators online put me either at 470 watts or ~700 watts for my setup so I'm lost on the dual card psu debate.

*Edit*
After a bit of reading I'm really leaning towards the Seasonic. I have found out that PC&C outsource the production of their units to Seasonic as their OEM. I never knew that! Apparently they design the supplies and send the build order to them. So if they trust them I'm sure I can possibly.

*Edit #2*
Scratch PC&C off my plate from this point on, at least their MKII line. OCZ(They own PC&C now) is outsourcing that work to Sirfa who makes all the woefully mediocre power supplies, pretty much ever built. There are even reports of hand soldered on capacitors on the end of the circuit boards inside to keep 12v rippples in check. Blast, way to ruin a good name.....
Traveling through space at ~67,000mph!
ID: 1063851 · 举报违规帖子
bill

发送消息
已加入:16 Jun 99
贴子:861
积分:29,352,955
近期平均积分:0
United States
消息 1063841 - 发表于:6 Jan 2011, 2:22:18 UTC - 回复消息 1063823.  

Both of these have given me good service:

http://www.newegg.com/Product/Product.aspx?Item=N82E16817116012

NZXT HALE90-850-M 850W ATX 12V v2.2, EPS 12V v2.91 80 PLUS GOLD Certified Modular Active PFC Power Supply


http://www.newegg.com/Product/Product.aspx?Item=N82E16817151100

Seasonic SS-850HT 850W ATX12V v2.31,EPS12V v2.92 80Plus Silver Certified, Active PFC Power Supply - OEM
ID: 1063841 · 举报违规帖子
Profile perryjay
志愿者测试人员
Avatar

发送消息
已加入:20 Aug 02
贴子:3377
积分:20,676,751
近期平均积分:0
United States
消息 1063839 - 发表于:6 Jan 2011, 2:12:31 UTC - 回复消息 1063823.  

My first thought was a bare minimum of 750W and I'm not all that sure so if you can go with higher I would. As for the reschedule tool, I would run it at least until the card levels out. It might be good to keep on running it as there are still quite a few of the old unmarked VLARs out there. I've got mine set for every 6 hours but it would depend on how you have your cache set up and how fast you get to new work in line. If we have another three plus day outage I should still have enough in reserve that any new work sent out would be checked by the reschedule tool before I got to them.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1063839 · 举报违规帖子
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
志愿者测试人员
Avatar

发送消息
已加入:20 Jun 99
贴子:6557
积分:121,090,076
近期平均积分:0
United States
消息 1063825 - 发表于:6 Jan 2011, 1:51:22 UTC - 回复消息 1063823.  

I'm thinking on minimum with both cards crunching along with the cpu I would probably need a 750 at the lowest and an 850 at the highest? As always I appreciate your input on helping me figure this out.


Think more than you need at the moment. Overkill will last longer in the long run. When I used to repair electronics, I would always replace defective components with ones stronger than the original. That way, the problems I fixed, were less likely to return.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1063825 · 举报违规帖子
-BeNt-
Avatar

发送消息
已加入:17 Oct 99
贴子:1234
积分:10,116,112
近期平均积分:0
United States
消息 1063823 - 发表于:6 Jan 2011, 1:38:00 UTC

Awesome thanks for the reply guys! Sorry it took me awhile to get back I've got a lot on the bench today. As far as testing each card, I know they are both good because the 250 has been crunching for the last few months without issue and the 8800 is what it replaced and crunched in the past. I did however take the 8800 out this morning and all the errors seemed to have went away.

Jravin, wow I can't believe I missed the x2 calculation on the cards! And the 80% efficiency! It's an SLi power supply that's about 2 years old so I know it has some capacitor aging, but it is 80 bronze certified. With all being considered that you have brought up under load I bet it is out sizing my 580 watt psu and causing issues with the second card! Now things are seeming to come a bit clearer.

I have taken the time to properly setup Fred's tool on my server and will re-enable the second card after I have a proper testing period. Seems the last errored unit was at 9:45UTC time today. I think that's about the time I took the other card offline.

As far as mismatching the cards etc. they aren't so different especially considering they are both G92 series cards. The only real difference is one has more memory and a higher clock speed, along with a die shrink so it's using less power, merely a refresh not actually a new architecture(it's a 9800GTX rebranded). But it could be the issue, I'm totally not sure at this time. I assumed with two different cards the time was calculated dependent of the other card? Is there anything besides the rescheduler fix that will fix this without needing to keep the rescheduler working? Also how often should I tell the scheduler to check everything? Right now I have it setup for every 2 hours but should it be sooner?

Thanks for all the tips, suggestions, and ideas guys I really appreciate it. I've just never had issues with my crunchers like this and the only thing I have never done before is add a second video card into the mix of things. When it comes to power supply size what do you think would be reasonable for a dual video card machine? I'm thinking on minimum with both cards crunching along with the cpu I would probably need a 750 at the lowest and an 850 at the highest? As always I appreciate your input on helping me figure this out.
Traveling through space at ~67,000mph!
ID: 1063823 · 举报违规帖子
Profile soft^spirit
Avatar

发送消息
已加入:18 May 99
贴子:6497
积分:34,134,168
近期平均积分:0
United States
消息 1063768 - 发表于:5 Jan 2011, 22:10:54 UTC - 回复消息 1063750.  

I am getting about one every 2-3 days out of the GT 9600. Honestly the card is most likely approaching the end of its useful life cycle, as well as the AMD.

I will try a blowout and reseat soon, or if it gets much worse. Beyond that...
Well it might be almost time to start collecting parts for my next system. I do need two computers.
Janice
ID: 1063768 · 举报违规帖子
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
志愿者测试人员
Avatar

发送消息
已加入:20 Jun 99
贴子:6557
积分:121,090,076
近期平均积分:0
United States
消息 1063750 - 发表于:5 Jan 2011, 20:50:54 UTC - 回复消息 1063746.  

@ S^S,
This may sound strange, but I used to get those errors at a rate of 2 or 3 a day. That can be a memory error, so I was reluctant to over clock my GPU's memory. Finally I just did it, and all the 1 errors went away. Your's may be a different cause, but I am mentioning it as it was just strange.

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1063750 · 举报违规帖子
Profile soft^spirit
Avatar

发送消息
已加入:18 May 99
贴子:6497
积分:34,134,168
近期平均积分:0
United States
消息 1063746 - 发表于:5 Jan 2011, 20:44:40 UTC - 回复消息 1063704.  

possibly unrelated, but my 9600GT is occasionally generating 0x1 errors like the following:

Stderr output
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce 9600 GT, 499 MiB, regsPerBlock 8192
computeCap 1.1, multiProcs 8
clockRate = 1625000
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce 9600 GT is okay
SETI@home using CUDA accelerated device GeForce 9600 GT
Priority of process raised successfully
Priority of worker thread raised successfully
size 8 fft, is a freaky powerspectrum
size 16 fft, is a cufft plan
size 32 fft, is a cufft plan
size 64 fft, is a cufft plan
size 128 fft, is a cufft plan
size 256 fft, is a freaky powerspectrum
size 512 fft, is a freaky powerspectrum
size 1024 fft, is a freaky powerspectrum
size 2048 fft, is a cufft plan
size 4096 fft, is a cufft plan
size 8192 fft, is a cufft plan
size 16384 fft, is a cufft plan
size 32768 fft, is a cufft plan
size 65536 fft, is a cufft plan
size 131072 fft, is a cufft plan

) _ _ _)_ o _ _
(__ (_( ) ) (_( (_ ( (_ (
not bad for a human... _)

Multibeam x32f Preview, Cuda 3.0

Work Unit Info:
...............
WU true angle range is : 0.420956
Cuda error 'cufftExecC2C' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_fft.cu' in line 102 : unknown error.
Cuda error 'cudaAcc_GetPowerSpectrum_kernel' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_PowerSpectrum.cu' in line 56 : unknown error.
Cuda error 'cudaAcc_GetPowerSpectrum_kernel' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_PowerSpectrum.cu' in line 56 : unknown error.
Cuda error 'cudaAcc_summax32_kernel' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 147 : unknown error.
Cuda error 'cudaAcc_summax32_kernel' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 147 : unknown error.
Cuda error 'cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, cudaAcc_NumDataPoints / fftlen * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 160 : unknown error.

</stderr_txt>
]]>

Janice
ID: 1063746 · 举报违规帖子
Profile Area 51
Avatar

发送消息
已加入:31 Jan 04
贴子:965
积分:42,193,520
近期平均积分:0
United Kingdom
消息 1063704 - 发表于:5 Jan 2011, 17:42:08 UTC - 回复消息 1063664.  

Man though this was resolved, it appears today the issue is back but now I'm getting -177 errors with the information listing "Unhandled Exception Detected...". I tried going back to crunching only one gpu and the problem is still there. Really getting irritated, especially consider my other machine never gave any issues. Going to start running a memtest to see it's the ram.


-177 errors can be fixed with Fred's Rescheduler tool. There is a checkbox that can fix them.

Steve



Yeah I've used the rescheduler for a bit now. But it isn't fixing these issues. I'm getting all kinds of errors. In the last 2 days or so I've returned 8-10 bad units. I'm still thinking it's a power supply issue but I'm not sure.

*Edit*
I apologize Steve it isn't check marked on my server machine. Not sure if I want to stop the mem check and start crunching again or not to find out.

*Update on some research*

-177 - No clue still. Normally caused by gpu trying to process cpu tasks?

1(0x1) error (Incorrect function) - Says out of date drivers possibly the cause. But I'm running 260.99 from Nvidia. Verified both cards were using it as well.

-1073741819 (0xffffffffc0000005)/ Access Violation (0xc0000005) - Can not find any information about this error. This one was caused directly after I tried OCing my 8800.

-6 (0xfffffffffffffffa) (Bad Work Unit Header) - Says this is mainly caused by something on the Seti@Home server side or issues during transfer. I don't think this was caused by my computer but could be wrong. I think this one also may have came after my OC attempts. No sure though.

Seems most of my errored work units have been the -177 and 1(0x1) items. This is really smelling of an under powered set of video cards to me.

e8400 @ 3Ghz (no oc)
4GB Mushkin Blackline DDr2 800
GTS 250 1GB
8800 GTS 640MB
Assorted hard disks x 4
580 watt psu.

I don't think the psu is enough still, but like I said I want to try any available routes before I have to spend $100+ for a 750-850 watt. And if you were buying a power supply for this machine what size would you shoot for? I'm thinking a 750 would be enough, but I've never ran dual card. I'm figuring, on a rough estimate, ~200 watt for motherboard, ram, processor and optical drive. About 170 each on the video cards and 80 watt's or less for the drives. That would put me at about 450. Like I said I've got a 580 supply but I'm thinking under load it may be spiking too high for the psu and cause one of the cards to error out.



Thermaltake hasve a PSU sizing tool on their website:

http://www.thermaltake.outervision.com/

Never used it before, but it may be of some use to you......
ID: 1063704 · 举报违规帖子
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

发送消息
已加入:25 Mar 02
贴子:1513
积分:370,893,186
近期平均积分:340
United States
消息 1063687 - 发表于:5 Jan 2011, 16:43:15 UTC - 回复消息 1063664.  

I'm figuring, on a rough estimate, ~200 watt for motherboard, ram, processor and optical drive. About 170 each on the video cards and 80 watt's or less for the drives. That would put me at about 450. Like I said I've got a 580 supply but I'm thinking under load it may be spiking too high for the psu and cause one of the cards to error out.


200 + 2 * 170 + 80 = 620, not 450.

So it looks like you ARE overpowering your PSU.

And even if 450 were correct, given that the PSU is < 80% efficient, it would draw at least 450/.80 = 562.5 watts, so YES you need a bigger PSU.

ID: 1063687 · 举报违规帖子
1 · 2 · 后

留言板 : Number crunching : Errors on Cuda Units with new server build


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.