Panic Mode On (102) Server Problems?

Message boards : Number crunching : Panic Mode On (102) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 25 · Next

AuthorMessage
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1782408 - Posted: 25 Apr 2016, 16:27:43 UTC

I'm about 600 tasks short of a full boat here myself at the moment.
Kitties are not getting their proper ration of kibbles.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1782408 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1782409 - Posted: 25 Apr 2016, 16:33:58 UTC - in response to Message 1782407.  
Last modified: 25 Apr 2016, 16:34:40 UTC

It appears that the boys in the lab have done something to prevent downloads of (Nvidia) GPU WUs. If not all, then at least the GB ones and there are none available from Arecibo. Downloads have been working normal for me so I do not think I have done anything to prevent them from downloading. I have currently 100 CPU WUs in progress but only 80 GPU WUs despite asking for them.


Edit: Guess I called a "no hitter". Just received a couple of guppi GPU WUs. Not sure why they do not seem to be available in quantity though.

As we know VLAR tasks are not sent to GPUs. GBT data is expected to be mostly VLARs. If we are on a data sets from Arecibo that generate mostly VLARs as well. Then there will be few tasks for GPUs.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1782409 · Report as offensive
Profile Zombu2
Volunteer tester

Send message
Joined: 24 Feb 01
Posts: 1615
Credit: 49,315,423
RAC: 0
United States
Message 1782431 - Posted: 25 Apr 2016, 18:58:31 UTC - in response to Message 1782396.  
Last modified: 25 Apr 2016, 19:00:15 UTC

First thing I would do with that GPU is set it back to the Nvidia defaults:

GTX 780 GPU Engine Specs:
  2304CUDA Cores
  863Base Clock (MHz)
  900Boost Clock (MHz)


Instead of what you've got it set to:
GPU current clockRate = 1201 MHz


the card is at factory default i got no idea why nobody reads it i have said it many times IT IS AT DEFAULT
I came down with a bad case of i don't give a crap
ID: 1782431 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22186
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1782433 - Posted: 25 Apr 2016, 19:05:44 UTC

...because you keep on having the same problem :-(


The other thing you might consider is evicting the dust bunnies - they breed when we aren't looking.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1782433 · Report as offensive
Profile Zombu2
Volunteer tester

Send message
Joined: 24 Feb 01
Posts: 1615
Credit: 49,315,423
RAC: 0
United States
Message 1782452 - Posted: 25 Apr 2016, 20:08:01 UTC - in response to Message 1782433.  

...because you keep on having the same problem :-(


The other thing you might consider is evicting the dust bunnies - they breed when we aren't looking.


yeah dust bunnys do sneak in but this machine is blown out on a weekly basis so are all the other machines i have been running the msi burnin test now for 6 hours and no artifacts driver crashes or anything else for that matter card is running at a nice 60C
I came down with a bad case of i don't give a crap
ID: 1782452 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1782472 - Posted: 25 Apr 2016, 21:28:11 UTC

So... when you combine 4 of these guppi's, do you get a gouldfish then? ;-)
ID: 1782472 · Report as offensive
Profile Zombu2
Volunteer tester

Send message
Joined: 24 Feb 01
Posts: 1615
Credit: 49,315,423
RAC: 0
United States
Message 1782494 - Posted: 25 Apr 2016, 23:06:19 UTC

Maybe a dead one with 3 eyes
I came down with a bad case of i don't give a crap
ID: 1782494 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1782526 - Posted: 26 Apr 2016, 2:00:38 UTC - in response to Message 1782234.  

All i can tell you is that none of the cards are overclocked they all on default


Something's odd then.

The defaults for the reference GTX780 are
Base clock: 863MHz
Boost clock: 900 MHz

The EVGA 780 SC being a factory overclock are
Base Clock: 967 MHZ
Boost Clock: 1020 MHz

So something has boosted your clock speed if it is running at 1201 MHz
PrecisionX settings? PCIe bus overclock?

EDIT-
Beaten by TBar.
What does GPUz have to say?


maybe the seti client has wrong info on it or read it wrong dunno

I think it's really hard to know for sure what those core clock readings really mean. One of the cards on my T7400 is an NVIDIA reference GTX780. It runs at default settings, no overclocking. Looking at the Graphics Card tab in GPU-Z the GPU Clock = 863 MHz and the Boost Clock = 902 MHz. However, looking at the Sensors tab, the GPU Core Clock speed is shown as 1019 MHz. That's the same as what shows up in the Stderr for a Cuda50 task. Precision X and Open Hardware Monitor also report the 1019 MHz value. I remember an exchange with Jason a couple of months ago that touched on this sort of discrepancy, and I don't think any conclusions were reached.

Oh, and my GTX780 doesn't appear to be throwing any errors at all, even on MESSIER031 tasks, so I would think you do have a hardware problem to deal with.
ID: 1782526 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1782527 - Posted: 26 Apr 2016, 2:21:20 UTC - in response to Message 1782526.  
Last modified: 26 Apr 2016, 2:37:00 UTC

I think the suspicions of something triggerring excessive boost clock (either some software issue, firmware/hardware, or manual setting in precision etc) are likely.

A >20% overclock on those isn't out of the realm of possibility (factory, auto boost or not), though you'd be expecting watercooling, beefed up power, and increased GPU voltages to achieve it, rather than an out of the box boost clock to do it with fan cooling.

The application reading is taken right in the middle of a computationally intense portion of code that is sufficient to trigger the (normal) boost functionality. That uses the same API as available to GPUz and Precision, so they should read the same peak (unless there is something wrong)

The eventual boost value is determined by a complex array of GPU internal sensors (I think about 23 different metrics IIRC), and a curve in the firmware set by the manufacturer, with overrides by the monitoting/OC software (like precision and others)

Aside from the possible software/settings issues, It's possible the particular GPU model, being factory OC'd, has an aggressive/optimistic boost curve, one or more sensors is dicey, or some other element of the GPU is weak. It'd be impossible to know which, if any of these, would be to blame, other than just manually reducing the clock offset so that boost drops to factory or even reference card spec. If results/normal operation come good, then you can just say 'it was software, GPU manufacturer, or something else, but works now'

It is [very] unlikely the MHz reading is incorrect, so getting that frequency to drop to normal levels will be the thing to prove something has been forcing the clock inappropriately high. (assuming it comes good)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1782527 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1782529 - Posted: 26 Apr 2016, 2:49:52 UTC

My 750Ti's are factory clocked at 1228, someone told me they should be at 1200. It never runs invalids though.

My problem is, when I run AP tasks (and using the computer) it frequently crashes :( If I'm not using the computer It runs all night with no problems.

It is on my to do list to try down clocking during the nest AP run.
ID: 1782529 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1782530 - Posted: 26 Apr 2016, 3:01:55 UTC - in response to Message 1782529.  
Last modified: 26 Apr 2016, 3:08:04 UTC

My 750Ti's are factory clocked at 1228, someone told me they should be at 1200. It never runs invalids though.

My problem is, when I run AP tasks (and using the computer) it frequently crashes :( If I'm not using the computer It runs all night with no problems.

It is on my to do list to try down clocking during the nest AP run.



Could still possibly be a similar situation, aggressive/optimistic factory boost curve. The reasons factories do these relate to marketing/competition for what is sold as a gaming card, high volume in the 750ti case. (HPC Tesla Compute devices are built, binned and specced differently, with no vendor tweaks allowed, IIRC Still only by nVidia themselves)

Under stress (such as when using the host while crunching) every GPU+host will behave a bit differently,

Assuming available application settings have already been explored (reducing pressure), one possible mechanism for AP or MB crashes under contention with the user/display, could be GPU memory or PCIe Saturation inducing driver crashes through OS timeouts in the latter case, or excessive memory errors in the first case.

If reducing the GPU memory+core boost offsets doesn't help, then sometimes a small voltage bump can be all that's needed (if temps/cooling and power allow)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1782530 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1782536 - Posted: 26 Apr 2016, 3:20:25 UTC - in response to Message 1782529.  
Last modified: 26 Apr 2016, 3:47:19 UTC

My 750Ti's are factory clocked at 1228, someone told me they should be at 1200. It never runs invalids though.

My problem is, when I run AP tasks (and using the computer) it frequently crashes :( If I'm not using the computer It runs all night with no problems.

It is on my to do list to try down clocking during the nest AP run.

All depends on whose board and which flavor of 750ti.
I have 4 machines, each with 2x EVGA 750ti. 6 are SCs, 2 are FTWs.
    Machine, GPU, Max clock
    1 , SC(0) ,1320
    1 , SC(1) ,1333
    2 , FTW(0),1360
    2 , SC(1) ,1306
    3 , FTW(0),1345
    3 , SC(1) ,1320
    4 , SC(0) ,1333
    4 , SC(1) ,1333


Shrugs. Seems like these tend to be all over the map. Trying to tweak around with them seems pointless, even with the latest Precision x-16. Flaky stuff. But I will note I have not had any issues I can point back to a GPU card or clocking.
BTW, the specs are here for the SC and FTW boards.
The times I had issues with freezes and restarts, it was generally giving the GPUs more CPU to work with ...


ID: 1782536 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1782539 - Posted: 26 Apr 2016, 3:42:51 UTC - in response to Message 1782536.  

Those clocks suggest the previous example 750ti isn't clocked all that high compared to many. the Freeing of CPUs (effectively reducing contention) seems logical, Settings for the crashing AP app also.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1782539 · Report as offensive
Profile Zombu2
Volunteer tester

Send message
Joined: 24 Feb 01
Posts: 1615
Credit: 49,315,423
RAC: 0
United States
Message 1782547 - Posted: 26 Apr 2016, 4:28:53 UTC

well the card worked great up until v8 got into my queue and that's when the whole shabang started

so i'm more inclined to blame either boinc or teh lunatics app
I came down with a bad case of i don't give a crap
ID: 1782547 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1782553 - Posted: 26 Apr 2016, 4:56:28 UTC - in response to Message 1782539.  
Last modified: 26 Apr 2016, 4:59:53 UTC

@Jason

Yea I have tried shutting down cores and only running 1 AP task, removing/changing command line, Without success.

It only seems to freeze/lock up when I move the mouse.

I did finally get my iGPU working (it was a stubborn bugger) so I want to try using that as my main display, and turn off the 750. That should relieve some strain.

if that doesn't work then it's downclocking time, or maybe power as you suggested.

It would be nice to have a reliable supply of APs for testing, just frustrating that it works great for the few hours I'm here, nest time it crashes, then forget what I changed before I ever see another task.

So in the mean time, I have been trying to let them kick in at night.
ID: 1782553 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1782555 - Posted: 26 Apr 2016, 5:00:21 UTC - in response to Message 1782547.  
Last modified: 26 Apr 2016, 5:26:08 UTC

I think there's no argument with the fact that V8 is more demanding of resources than V7 was. I know I had to dial it back a bit on my weakest machine.
ID: 1782555 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1782557 - Posted: 26 Apr 2016, 5:10:21 UTC - in response to Message 1782547.  

I looked at your last AP and it shows the clock rate as 1019, which is about where it should be. I then went to the same time frame as that AP and found the cuda tasks around a "normal" clock rate. Looking ahead and backwards from that point it appears that when the machine is restarted it uses a Different clock rate. It will use that rate until it is again restarted. I looked at other 780s and noticed their rate was pretty consistent. So, the question is what is changing your clock rate after a reboot?

Looking as far back as possible it seems it was working fine with Version 8 CUDA;
Received 29 Feb 2016, 16:04:27 UTC, GPU current clockRate = 1123 MHz
1123/24 seems to be a consistent rate on a few machines.
This is where the trouble begins, Received 19 Apr 2016, 3:23:31 UTC, GPU current clockRate = 1215 MHz
That task began as 1123 and after a restart it was 1215.
It continued as 1215 until it was restarted here at 1097;
Received 21 Apr 2016, 19:35:13 UTC, GPU current clockRate = 1097 MHz
Then it worked fine until it was restarted here, https://setiathome.berkeley.edu/result.php?resultid=4879467266
Until the next restart it was bad news while clocked at 1201, https://setiathome.berkeley.edu/result.php?resultid=4879629574
Here it was restarted at 1136, https://setiathome.berkeley.edu/result.php?resultid=4883014586
ID: 1782557 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1782559 - Posted: 26 Apr 2016, 5:19:04 UTC
Last modified: 26 Apr 2016, 5:27:08 UTC

Sorry if I caused confusion. It looks like we have two different issues on two different boards being discussed. All the 750ti stuff I mentioned is moot in relation to the board in question, a 780.
ID: 1782559 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1782581 - Posted: 26 Apr 2016, 6:13:51 UTC - in response to Message 1782557.  

@Tbar, To my knowledge (unless something changed), the AP OpenCL App does not use the NVAPI detection as does Seti Cuda MB, GPUz, and Precision-X, but instead standard figures reported before the application even initialises the device, so it's not a measurement.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1782581 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1782582 - Posted: 26 Apr 2016, 6:14:39 UTC - in response to Message 1782559.  

Sorry if I caused confusion. It looks like we have two different issues on two different boards being discussed. All the 750ti stuff I mentioned is moot in relation to the board in question, a 780.


Yeah I twigged into that bit :)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1782582 · Report as offensive
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 25 · Next

Message boards : Number crunching : Panic Mode On (102) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.