Panic Mode On (102) Server Problems?

Author	Message
kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1782408 - Posted: 25 Apr 2016, 16:27:43 UTC I'm about 600 tasks short of a full boat here myself at the moment. Kitties are not getting their proper ration of kibbles. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1782408 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1782409 - Posted: 25 Apr 2016, 16:33:58 UTC - in response to Message 1782407. Last modified: 25 Apr 2016, 16:34:40 UTC It appears that the boys in the lab have done something to prevent downloads of (Nvidia) GPU WUs. If not all, then at least the GB ones and there are none available from Arecibo. Downloads have been working normal for me so I do not think I have done anything to prevent them from downloading. I have currently 100 CPU WUs in progress but only 80 GPU WUs despite asking for them. Edit: Guess I called a "no hitter". Just received a couple of guppi GPU WUs. Not sure why they do not seem to be available in quantity though. As we know VLAR tasks are not sent to GPUs. GBT data is expected to be mostly VLARs. If we are on a data sets from Arecibo that generate mostly VLARs as well. Then there will be few tasks for GPUs. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1782409 ·

Zombu2 Volunteer tester Send message Joined: 24 Feb 01 Posts: 1615 Credit: 49,315,423 RAC: 0	Message 1782431 - Posted: 25 Apr 2016, 18:58:31 UTC - in response to Message 1782396. Last modified: 25 Apr 2016, 19:00:15 UTC First thing I would do with that GPU is set it back to the Nvidia defaults: GTX 780 GPU Engine Specs: 2304CUDA Cores 863Base Clock (MHz) 900Boost Clock (MHz) Instead of what you've got it set to: GPU current clockRate = 1201 MHz the card is at factory default i got no idea why nobody reads it i have said it many times IT IS AT DEFAULT I came down with a bad case of i don't give a crap ID: 1782431 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22186 Credit: 416,307,556 RAC: 380	Message 1782433 - Posted: 25 Apr 2016, 19:05:44 UTC ...because you keep on having the same problem :-( The other thing you might consider is evicting the dust bunnies - they breed when we aren't looking. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1782433 ·

Zombu2 Volunteer tester Send message Joined: 24 Feb 01 Posts: 1615 Credit: 49,315,423 RAC: 0	Message 1782452 - Posted: 25 Apr 2016, 20:08:01 UTC - in response to Message 1782433. ...because you keep on having the same problem :-( The other thing you might consider is evicting the dust bunnies - they breed when we aren't looking. yeah dust bunnys do sneak in but this machine is blown out on a weekly basis so are all the other machines i have been running the msi burnin test now for 6 hours and no artifacts driver crashes or anything else for that matter card is running at a nice 60C I came down with a bad case of i don't give a crap ID: 1782452 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1782472 - Posted: 25 Apr 2016, 21:28:11 UTC So... when you combine 4 of these guppi's, do you get a gouldfish then? ;-) ID: 1782472 ·

Zombu2 Volunteer tester Send message Joined: 24 Feb 01 Posts: 1615 Credit: 49,315,423 RAC: 0	Message 1782494 - Posted: 25 Apr 2016, 23:06:19 UTC Maybe a dead one with 3 eyes I came down with a bad case of i don't give a crap ID: 1782494 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1782526 - Posted: 26 Apr 2016, 2:00:38 UTC - in response to Message 1782234. All i can tell you is that none of the cards are overclocked they all on default Something's odd then. The defaults for the reference GTX780 are Base clock: 863MHz Boost clock: 900 MHz The EVGA 780 SC being a factory overclock are Base Clock: 967 MHZ Boost Clock: 1020 MHz So something has boosted your clock speed if it is running at 1201 MHz PrecisionX settings? PCIe bus overclock? EDIT- Beaten by TBar. What does GPUz have to say? maybe the seti client has wrong info on it or read it wrong dunno I think it's really hard to know for sure what those core clock readings really mean. One of the cards on my T7400 is an NVIDIA reference GTX780. It runs at default settings, no overclocking. Looking at the Graphics Card tab in GPU-Z the GPU Clock = 863 MHz and the Boost Clock = 902 MHz. However, looking at the Sensors tab, the GPU Core Clock speed is shown as 1019 MHz. That's the same as what shows up in the Stderr for a Cuda50 task. Precision X and Open Hardware Monitor also report the 1019 MHz value. I remember an exchange with Jason a couple of months ago that touched on this sort of discrepancy, and I don't think any conclusions were reached. Oh, and my GTX780 doesn't appear to be throwing any errors at all, even on MESSIER031 tasks, so I would think you do have a hardware problem to deal with. ID: 1782526 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782527 - Posted: 26 Apr 2016, 2:21:20 UTC - in response to Message 1782526. Last modified: 26 Apr 2016, 2:37:00 UTC I think the suspicions of something triggerring excessive boost clock (either some software issue, firmware/hardware, or manual setting in precision etc) are likely. A >20% overclock on those isn't out of the realm of possibility (factory, auto boost or not), though you'd be expecting watercooling, beefed up power, and increased GPU voltages to achieve it, rather than an out of the box boost clock to do it with fan cooling. The application reading is taken right in the middle of a computationally intense portion of code that is sufficient to trigger the (normal) boost functionality. That uses the same API as available to GPUz and Precision, so they should read the same peak (unless there is something wrong) The eventual boost value is determined by a complex array of GPU internal sensors (I think about 23 different metrics IIRC), and a curve in the firmware set by the manufacturer, with overrides by the monitoting/OC software (like precision and others) Aside from the possible software/settings issues, It's possible the particular GPU model, being factory OC'd, has an aggressive/optimistic boost curve, one or more sensors is dicey, or some other element of the GPU is weak. It'd be impossible to know which, if any of these, would be to blame, other than just manually reducing the clock offset so that boost drops to factory or even reference card spec. If results/normal operation come good, then you can just say 'it was software, GPU manufacturer, or something else, but works now' It is [very] unlikely the MHz reading is incorrect, so getting that frequency to drop to normal levels will be the thing to prove something has been forcing the clock inappropriately high. (assuming it comes good) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782527 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1782529 - Posted: 26 Apr 2016, 2:49:52 UTC My 750Ti's are factory clocked at 1228, someone told me they should be at 1200. It never runs invalids though. My problem is, when I run AP tasks (and using the computer) it frequently crashes :( If I'm not using the computer It runs all night with no problems. It is on my to do list to try down clocking during the nest AP run. ID: 1782529 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782530 - Posted: 26 Apr 2016, 3:01:55 UTC - in response to Message 1782529. Last modified: 26 Apr 2016, 3:08:04 UTC My 750Ti's are factory clocked at 1228, someone told me they should be at 1200. It never runs invalids though. My problem is, when I run AP tasks (and using the computer) it frequently crashes :( If I'm not using the computer It runs all night with no problems. It is on my to do list to try down clocking during the nest AP run. Could still possibly be a similar situation, aggressive/optimistic factory boost curve. The reasons factories do these relate to marketing/competition for what is sold as a gaming card, high volume in the 750ti case. (HPC Tesla Compute devices are built, binned and specced differently, with no vendor tweaks allowed, IIRC Still only by nVidia themselves) Under stress (such as when using the host while crunching) every GPU+host will behave a bit differently, Assuming available application settings have already been explored (reducing pressure), one possible mechanism for AP or MB crashes under contention with the user/display, could be GPU memory or PCIe Saturation inducing driver crashes through OS timeouts in the latter case, or excessive memory errors in the first case. If reducing the GPU memory+core boost offsets doesn't help, then sometimes a small voltage bump can be all that's needed (if temps/cooling and power allow) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782530 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1782536 - Posted: 26 Apr 2016, 3:20:25 UTC - in response to Message 1782529. Last modified: 26 Apr 2016, 3:47:19 UTC My 750Ti's are factory clocked at 1228, someone told me they should be at 1200. It never runs invalids though. My problem is, when I run AP tasks (and using the computer) it frequently crashes :( If I'm not using the computer It runs all night with no problems. It is on my to do list to try down clocking during the nest AP run. All depends on whose board and which flavor of 750ti. I have 4 machines, each with 2x EVGA 750ti. 6 are SCs, 2 are FTWs. Machine, GPU, Max clock 1 , SC(0) ,1320 1 , SC(1) ,1333 2 , FTW(0),1360 2 , SC(1) ,1306 3 , FTW(0),1345 3 , SC(1) ,1320 4 , SC(0) ,1333 4 , SC(1) ,1333 Shrugs. Seems like these tend to be all over the map. Trying to tweak around with them seems pointless, even with the latest Precision x-16. Flaky stuff. But I will note I have not had any issues I can point back to a GPU card or clocking. BTW, the specs are here for the SC and FTW boards. The times I had issues with freezes and restarts, it was generally giving the GPUs more CPU to work with ... ID: 1782536 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782539 - Posted: 26 Apr 2016, 3:42:51 UTC - in response to Message 1782536. Those clocks suggest the previous example 750ti isn't clocked all that high compared to many. the Freeing of CPUs (effectively reducing contention) seems logical, Settings for the crashing AP app also. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782539 ·

Zombu2 Volunteer tester Send message Joined: 24 Feb 01 Posts: 1615 Credit: 49,315,423 RAC: 0	Message 1782547 - Posted: 26 Apr 2016, 4:28:53 UTC well the card worked great up until v8 got into my queue and that's when the whole shabang started so i'm more inclined to blame either boinc or teh lunatics app I came down with a bad case of i don't give a crap ID: 1782547 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1782553 - Posted: 26 Apr 2016, 4:56:28 UTC - in response to Message 1782539. Last modified: 26 Apr 2016, 4:59:53 UTC @Jason Yea I have tried shutting down cores and only running 1 AP task, removing/changing command line, Without success. It only seems to freeze/lock up when I move the mouse. I did finally get my iGPU working (it was a stubborn bugger) so I want to try using that as my main display, and turn off the 750. That should relieve some strain. if that doesn't work then it's downclocking time, or maybe power as you suggested. It would be nice to have a reliable supply of APs for testing, just frustrating that it works great for the few hours I'm here, nest time it crashes, then forget what I changed before I ever see another task. So in the mean time, I have been trying to let them kick in at night. ID: 1782553 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1782555 - Posted: 26 Apr 2016, 5:00:21 UTC - in response to Message 1782547. Last modified: 26 Apr 2016, 5:26:08 UTC I think there's no argument with the fact that V8 is more demanding of resources than V7 was. I know I had to dial it back a bit on my weakest machine. ID: 1782555 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1782557 - Posted: 26 Apr 2016, 5:10:21 UTC - in response to Message 1782547. I looked at your last AP and it shows the clock rate as 1019, which is about where it should be. I then went to the same time frame as that AP and found the cuda tasks around a "normal" clock rate. Looking ahead and backwards from that point it appears that when the machine is restarted it uses a Different clock rate. It will use that rate until it is again restarted. I looked at other 780s and noticed their rate was pretty consistent. So, the question is what is changing your clock rate after a reboot? Looking as far back as possible it seems it was working fine with Version 8 CUDA; Received 29 Feb 2016, 16:04:27 UTC, GPU current clockRate = 1123 MHz 1123/24 seems to be a consistent rate on a few machines. This is where the trouble begins, Received 19 Apr 2016, 3:23:31 UTC, GPU current clockRate = 1215 MHz That task began as 1123 and after a restart it was 1215. It continued as 1215 until it was restarted here at 1097; Received 21 Apr 2016, 19:35:13 UTC, GPU current clockRate = 1097 MHz Then it worked fine until it was restarted here, https://setiathome.berkeley.edu/result.php?resultid=4879467266 Until the next restart it was bad news while clocked at 1201, https://setiathome.berkeley.edu/result.php?resultid=4879629574 Here it was restarted at 1136, https://setiathome.berkeley.edu/result.php?resultid=4883014586 ID: 1782557 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1782559 - Posted: 26 Apr 2016, 5:19:04 UTC Last modified: 26 Apr 2016, 5:27:08 UTC Sorry if I caused confusion. It looks like we have two different issues on two different boards being discussed. All the 750ti stuff I mentioned is moot in relation to the board in question, a 780. ID: 1782559 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782581 - Posted: 26 Apr 2016, 6:13:51 UTC - in response to Message 1782557. @Tbar, To my knowledge (unless something changed), the AP OpenCL App does not use the NVAPI detection as does Seti Cuda MB, GPUz, and Precision-X, but instead standard figures reported before the application even initialises the device, so it's not a measurement. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782581 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782582 - Posted: 26 Apr 2016, 6:14:39 UTC - in response to Message 1782559. Sorry if I caused confusion. It looks like we have two different issues on two different boards being discussed. All the 750ti stuff I mentioned is moot in relation to the board in question, a 780. Yeah I twigged into that bit :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782582 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.