留言板 :
Number crunching :
Errors on Cuda Units with new server build
留言板合理
| 作者 | 消息 |
|---|---|
|
-BeNt- 发送消息 已加入:17 Oct 99 贴子:1234 积分:10,116,112 近期平均积分:0
|
Up to almost 18 WU's without error on the gpus's now. Heat or maybe a reseat on the cards did something?! *Edit* Now up to 20 WU's without an error on the gpu's. My machine is eating WU's right now! Wow. *Edit* Account is now saying I've had 37 consecutive WUs without error. Traveling through space at ~67,000mph! |
|
-BeNt- 发送消息 已加入:17 Oct 99 贴子:1234 积分:10,116,112 近期平均积分:0
|
Small update gentlemen. I decided before ordering a new power supply to pull out the 430 watt I had in the closet and go dual gpu for awhile. Guess what? Still errors except from the GTS 250 with the 8800 removed! So what's that leave me with?! Stalling WU's and errors still. Grr..... So I decided today to try something different and put the 8800 back into the box and leave the side of the case off, maybe it's a heat issue? Granted the times I checked the heat on the cards one was running in the 60's and the other in the mid 60's. So I don't think it is that, however with the side off the case it has completed 6 work units, 3 on each video card, without a single error?! So what is going on!? Beginning to wonder if I got some malformed WUs now, is that even possible? Traveling through space at ~67,000mph! |
|
-BeNt- 发送消息 已加入:17 Oct 99 贴子:1234 积分:10,116,112 近期平均积分:0
|
BeNt, Sorry guess I should have went into more detail. The AMD machine has been replaced with the e8400 server. As you can see by the thread it's a continuous battle to make this thing right. I have taken the machine down and fully removed the 8800GTS from it as maybe it's still too much strain on the psu, going to see what happens now. Ever since I've restarted the machine, minus the other card, the workunits are progressing as normal. Traveling through space at ~67,000mph! |
perryjay 发送消息 已加入:20 Aug 02 贴子:3377 积分:20,676,751 近期平均积分:0
|
BeNt, You don't say which of your machines are doing this. Could it be your AMD machine? They have been known to do that. If so, a reboot will get them started again or Lunatic's opt-app cures it most times. PROUD MEMBER OF Team Starfire World BOINC |
|
-BeNt- 发送消息 已加入:17 Oct 99 贴子:1234 积分:10,116,112 近期平均积分:0
|
Ok guys any clue what would cause a WU to just stop?! I've just had another one time out and I keep having to suspend WU's and get a new one going to keep it moving. This is really starting to get aggravating. Traveling through space at ~67,000mph! |
Fred J. Verster 发送消息 已加入:21 Apr 04 贴子:3252 积分:31,903,643 近期平均积分:0
|
Don't forget to build your own generator, solar panels, wind turbines, etc. ;-) (I can use 10KW/h at home, 3-fased;3x 234V; 25A, but don't want a war with my 'landlord')
|
SciManStev ![]() 发送消息 已加入:20 Jun 99 贴子:6557 积分:121,090,076 近期平均积分:0
|
The overkill seems to be working! I can't believe it is still climbing! :D Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
|
-BeNt- 发送消息 已加入:17 Oct 99 贴子:1234 积分:10,116,112 近期平均积分:0
|
Since 9:45 UTC yesterday (3:45 local) I haven't had any errors. One thing I am noticing happening now however is the gpu will have a work unit that just stops. If I suspend it and it loads a different one in it crunches away happily. One more issue to figure out. But this could also be from the power supply, who knows. I've decided to go after the Corsair 850TX, seems right now it's the best bang for the buck and it's only going to run about ~$130, and it seems after rebate could be $109. So I need to get that ordered to see if it really cures my problem. Thanks for all the help guys getting this worked out, and especially for the suggestions and comments! BTW Steve, really? 10 gauge 30 amp twist locks lines from your mains, really? You don't happen to have a time machine stuffed in there somewhere do you, because you have to be producing at least 1.21 jiggawatts with that kind of power. Insane but I can't bash if I had some extra funding I would probably build a computer lab onto my house lol. Traveling through space at ~67,000mph! |
SciManStev ![]() 发送消息 已加入:20 Jun 99 贴子:6557 积分:121,090,076 近期平均积分:0
|
Just for the record. Piggy: 1250 watt BFG PSU 2200 VA APC Smart UPS 30 AMP twist lock direct line from the mains, 10 Guage Overkill. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
Tim Norton 发送消息 已加入:2 Jun 99 贴子:835 积分:33,540,164 近期平均积分:0
|
My vote would be the Corsair as its only $10 more and you get 2 years extra warranty also its sli certified so designed to run with two cards - 4 pcie connectors As your machine boots up now and runs with two cards even though crunching is a error prone i suggest you are not that far below what you need to run your setup you would be giving the system another 270watts of capacity to work with which is almost enough to run the cards on their own You might get away with a 750 but as Steve suggested i would play safe and go higher than you think too give you a margin and the PSU will run cooler and more efficient Good luck with the boss ok'ing the spend :) PS: Just as an aside you could use the 420 watt with the 580 watt with a bit of research Tim
|
|
-BeNt- 发送消息 已加入:17 Oct 99 贴子:1234 积分:10,116,112 近期平均积分:0
|
Steve, I generally do the same thing on my main rig. But the server is sort of an after thought machine comprised of unused parts from previous builds as I grow into a new gaming machine etc. It's merely a place I use as a cruncher and file storage machine. Occasionally I use it for video transcoding etc, but nothing super important. Hence if I have to buy parts for it I like to stay as cheap as possible. Before the 580 went in it was running off an Antec Neo 430 watt for years. When I got the 8800 it needed a bit more power than my 7800 needed so I went to the 580 etc. My 480 machine is running on a 750 watt PC&C supply which I love. So needless to say I have never bought a single part for the server. But then again it's always been rehashed of older hardware which all worked together, Until now with the dual video cards. So now the debate for me looms. Upgrade my gaming machine to a better supply and put the 750 in the server, or simply buy a 750+ psu for the server. I don't anticipate putting any additional hardware in it any time soon so I just want what I need at the moment as money is also a constraint, especially when you start talking about a psu bigger than 850 watts. I guess bottom line is I may end up having to retire the 8800 from service(trusty as it's always been). Guess it's time to get with the better half and work out a deal. ;) As far as what I'm looking at are these: Corsair 850TX 80+ Silver - $129.99 5 Year warranty Seasonic SS-850HT 80+ Silver - $119.99 3 year warranty What I would get but probably above my price range at this time: PC&C Silencer 950 80+ Silver - $189.99 7 year warranty Both the Corsair and Seasonic seem to have a single 12v(70A) rail which is really good, but the PC&C has an 83.4A single rail not to mention 88% efficiency at full load! Between the two it's hard to pick. I know bill give the Seasonic a thumbs up, anyone else have an opinion or supply they have an opinion on? I generally stay in the Antec / PC Power & Cooling circle of things, but this is a budget limited fix(read not much per the boss), so going with a $200+ supply is out of the question or I would go with a much larger supply. I love my PC&C 750 and can't help but wonder if it would power the server because I can get a new one of those for $129.99. Obviously I may be hoping beyond my means, but I wish I could run this setup off a 750 psu. All the calculators online put me either at 470 watts or ~700 watts for my setup so I'm lost on the dual card psu debate. *Edit* After a bit of reading I'm really leaning towards the Seasonic. I have found out that PC&C outsource the production of their units to Seasonic as their OEM. I never knew that! Apparently they design the supplies and send the build order to them. So if they trust them I'm sure I can possibly. *Edit #2* Scratch PC&C off my plate from this point on, at least their MKII line. OCZ(They own PC&C now) is outsourcing that work to Sirfa who makes all the woefully mediocre power supplies, pretty much ever built. There are even reports of hand soldered on capacitors on the end of the circuit boards inside to keep 12v rippples in check. Blast, way to ruin a good name..... Traveling through space at ~67,000mph! |
|
bill 发送消息 已加入:16 Jun 99 贴子:861 积分:29,352,955 近期平均积分:0
|
Both of these have given me good service: http://www.newegg.com/Product/Product.aspx?Item=N82E16817116012 NZXT HALE90-850-M 850W ATX 12V v2.2, EPS 12V v2.91 80 PLUS GOLD Certified Modular Active PFC Power Supply http://www.newegg.com/Product/Product.aspx?Item=N82E16817151100 Seasonic SS-850HT 850W ATX12V v2.31,EPS12V v2.92 80Plus Silver Certified, Active PFC Power Supply - OEM |
perryjay 发送消息 已加入:20 Aug 02 贴子:3377 积分:20,676,751 近期平均积分:0
|
My first thought was a bare minimum of 750W and I'm not all that sure so if you can go with higher I would. As for the reschedule tool, I would run it at least until the card levels out. It might be good to keep on running it as there are still quite a few of the old unmarked VLARs out there. I've got mine set for every 6 hours but it would depend on how you have your cache set up and how fast you get to new work in line. If we have another three plus day outage I should still have enough in reserve that any new work sent out would be checked by the reschedule tool before I got to them. PROUD MEMBER OF Team Starfire World BOINC |
SciManStev ![]() 发送消息 已加入:20 Jun 99 贴子:6557 积分:121,090,076 近期平均积分:0
|
I'm thinking on minimum with both cards crunching along with the cpu I would probably need a 750 at the lowest and an 850 at the highest? As always I appreciate your input on helping me figure this out. Think more than you need at the moment. Overkill will last longer in the long run. When I used to repair electronics, I would always replace defective components with ones stronger than the original. That way, the problems I fixed, were less likely to return. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
|
-BeNt- 发送消息 已加入:17 Oct 99 贴子:1234 积分:10,116,112 近期平均积分:0
|
Awesome thanks for the reply guys! Sorry it took me awhile to get back I've got a lot on the bench today. As far as testing each card, I know they are both good because the 250 has been crunching for the last few months without issue and the 8800 is what it replaced and crunched in the past. I did however take the 8800 out this morning and all the errors seemed to have went away. Jravin, wow I can't believe I missed the x2 calculation on the cards! And the 80% efficiency! It's an SLi power supply that's about 2 years old so I know it has some capacitor aging, but it is 80 bronze certified. With all being considered that you have brought up under load I bet it is out sizing my 580 watt psu and causing issues with the second card! Now things are seeming to come a bit clearer. I have taken the time to properly setup Fred's tool on my server and will re-enable the second card after I have a proper testing period. Seems the last errored unit was at 9:45UTC time today. I think that's about the time I took the other card offline. As far as mismatching the cards etc. they aren't so different especially considering they are both G92 series cards. The only real difference is one has more memory and a higher clock speed, along with a die shrink so it's using less power, merely a refresh not actually a new architecture(it's a 9800GTX rebranded). But it could be the issue, I'm totally not sure at this time. I assumed with two different cards the time was calculated dependent of the other card? Is there anything besides the rescheduler fix that will fix this without needing to keep the rescheduler working? Also how often should I tell the scheduler to check everything? Right now I have it setup for every 2 hours but should it be sooner? Thanks for all the tips, suggestions, and ideas guys I really appreciate it. I've just never had issues with my crunchers like this and the only thing I have never done before is add a second video card into the mix of things. When it comes to power supply size what do you think would be reasonable for a dual video card machine? I'm thinking on minimum with both cards crunching along with the cpu I would probably need a 750 at the lowest and an 850 at the highest? As always I appreciate your input on helping me figure this out. Traveling through space at ~67,000mph! |
soft^spirit 发送消息 已加入:18 May 99 贴子:6497 积分:34,134,168 近期平均积分:0
|
I am getting about one every 2-3 days out of the GT 9600. Honestly the card is most likely approaching the end of its useful life cycle, as well as the AMD. I will try a blowout and reseat soon, or if it gets much worse. Beyond that... Well it might be almost time to start collecting parts for my next system. I do need two computers. Janice |
SciManStev ![]() 发送消息 已加入:20 Jun 99 贴子:6557 积分:121,090,076 近期平均积分:0
|
@ S^S, This may sound strange, but I used to get those errors at a rate of 2 or 3 a day. That can be a memory error, so I was reluctant to over clock my GPU's memory. Finally I just did it, and all the 1 errors went away. Your's may be a different cause, but I am mentioning it as it was just strange. Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
soft^spirit 发送消息 已加入:18 May 99 贴子:6497 积分:34,134,168 近期平均积分:0
|
possibly unrelated, but my 9600GT is occasionally generating 0x1 errors like the following: Stderr output <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> setiathome_CUDA: Found 1 CUDA device(s): Device 1: GeForce 9600 GT, 499 MiB, regsPerBlock 8192 computeCap 1.1, multiProcs 8 clockRate = 1625000 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce 9600 GT is okay SETI@home using CUDA accelerated device GeForce 9600 GT Priority of process raised successfully Priority of worker thread raised successfully size 8 fft, is a freaky powerspectrum size 16 fft, is a cufft plan size 32 fft, is a cufft plan size 64 fft, is a cufft plan size 128 fft, is a cufft plan size 256 fft, is a freaky powerspectrum size 512 fft, is a freaky powerspectrum size 1024 fft, is a freaky powerspectrum size 2048 fft, is a cufft plan size 4096 fft, is a cufft plan size 8192 fft, is a cufft plan size 16384 fft, is a cufft plan size 32768 fft, is a cufft plan size 65536 fft, is a cufft plan size 131072 fft, is a cufft plan ) _ _ _)_ o _ _ (__ (_( ) ) (_( (_ ( (_ ( not bad for a human... _) Multibeam x32f Preview, Cuda 3.0 Work Unit Info: ............... WU true angle range is : 0.420956 Cuda error 'cufftExecC2C' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_fft.cu' in line 102 : unknown error. Cuda error 'cudaAcc_GetPowerSpectrum_kernel' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_PowerSpectrum.cu' in line 56 : unknown error. Cuda error 'cudaAcc_GetPowerSpectrum_kernel' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_PowerSpectrum.cu' in line 56 : unknown error. Cuda error 'cudaAcc_summax32_kernel' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 147 : unknown error. Cuda error 'cudaAcc_summax32_kernel' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 147 : unknown error. Cuda error 'cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, cudaAcc_NumDataPoints / fftlen * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)' in file 'd:/[Projects]/Berkeley/seti_cuda/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 160 : unknown error. </stderr_txt> ]]> Janice |
Area 51 发送消息 已加入:31 Jan 04 贴子:965 积分:42,193,520 近期平均积分:0
|
Man though this was resolved, it appears today the issue is back but now I'm getting -177 errors with the information listing "Unhandled Exception Detected...". I tried going back to crunching only one gpu and the problem is still there. Really getting irritated, especially consider my other machine never gave any issues. Going to start running a memtest to see it's the ram. Thermaltake hasve a PSU sizing tool on their website: http://www.thermaltake.outervision.com/ Never used it before, but it may be of some use to you......
|
Cruncher-American ![]() 发送消息 已加入:25 Mar 02 贴子:1513 积分:370,893,186 近期平均积分:340
|
I'm figuring, on a rough estimate, ~200 watt for motherboard, ram, processor and optical drive. About 170 each on the video cards and 80 watt's or less for the drives. That would put me at about 450. Like I said I've got a 580 supply but I'm thinking under load it may be spiking too high for the psu and cause one of the cards to error out. 200 + 2 * 170 + 80 = 620, not 450. So it looks like you ARE overpowering your PSU. And even if 450 were correct, given that the PSU is < 80% efficient, it would draw at least 450/.80 = 562.5 watts, so YES you need a bigger PSU. |
©2020 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.