Posts by HAL9000


log in
1) Message boards : Number crunching : Phantom Triplets (Message 1575798)
Posted 27 minutes ago by Profile HAL9000
Having something that consistently fails is indeed much nicer. Which is why Jason mentioned trying to force an error. With an error rate of 1 in 800 that would put it at 1 about every 16 days. You said most of the errors happen late at night, but have they happened on the same day(s)? It makes me think there could be some kind of large industrial plant powering up or down ever 2 weeks. Causing just enough of a fluctuation in the line voltage to make your GPU to go a little nuts.

Ah, if only it was that consistent! Although it may average out to once every 16 days, the intervals have actually ranged from 2 days up to 35 days, the most recent interval being just 4 days. Not happening the same day or time, either, I'm afraid. Let's see, I have (in order) a Monday at ~1:35 AM, a Wednesday at ~4:40 AM, a Tuesday at ~6:05 PM, a Tuesday at ~10:50 PM, and a Saturday at ~3:20 PM. No industrial plants nearby, either. I live in a fairly rural area. That's not to say PG&E's electric service is entirely prodicable, though. Sometimes they make California feel like a third world country, with random outages that seem to have no external cause (i.e., perfectly bright, sunny day or calm, clear night, no car meeting power pole, but suddenly, no juice). However, that's been true for as long as I've lived here (28+ years) and this thing with this one GPU is quite new. I'm keeping the suggestions on voltages, both yours and Jason's, in reserve for now, pending any worsening of the situation.

I'd still like to try running a GPU memory test first, if I can find one similar to memtest86, just to test the memory, not stress test the whole GPU.

This seems to be one of few tools out there to test video memory.
http://mikelab.kiev.ua/index_en.php?page=PROGRAMS/vmt_en
2) Message boards : Number crunching : Phantom Triplets (Message 1575765)
Posted 2 hours ago by Profile HAL9000
About every 4-6 weeks the 8500 GT I'm using throws a fit and starts trashing work. Despite it being in a lab with temperature, humidity, & power regulation. I tried a few things like shutting down the system weekly to see if it would help, but nothing I tried has so far.
I decided it was rare enough I didn't want to spend any more time looking into it.

Actually, I think if my 550Ti actually went off the rails like that, continuing to throw Invalids for an extended period after it got the first one, I'd understand it better, or at least be more understanding of it, than the way it just pharts once and then resumes as if there's nothing wrong. :^) At least then, if cleaning it or reseating it or increasing the fan speed, etc., didn't make the problem go away, I wouldn't feel hesitant to replace it. Under the present circumstances, though, I certainly wouldn't want to do that.

Speaking of replacing a card, a couple months ago I replaced an 8600 GT in my old IBM Thinkcentre with one of the (relatively) new ASUS GT 630 1GB cards. It's passively cooled and only has a maximum draw of 25w versus the 47w max for the 8600 GT. The actual power draw seems to be running only about 13w. The card only cost me $33.00 USD delivered (new, on eBay) and I figure it saves me over $4/month in electricity, so should pay for itself in about 8 months. Oh, and it provides about an 18% boost in production (at least as measured by Credits). Might be worth considering for your cranky 8500 GT.

Having something that consistently fails is indeed much nicer. Which is why Jason mentioned trying to force an error. With an error rate of 1 in 800 that would put it at 1 about every 16 days. You said most of the errors happen late at night, but have they happened on the same day(s)? It makes me think there could be some kind of large industrial plant powering up or down ever 2 weeks. Causing just enough of a fluctuation in the line voltage to make your GPU to go a little nuts.

In my case the 8500 GT is a work machine I use for testing. Hardware is kept the same in order to have consistent test platforms or for instances when something needs to be regressed. I did try a few years ago to get some GT 430's for several of the systems. So the systems could correctly support Windows Aero, but that never happened. I had also made a proposal to replace all of the monitors in the lab that were using CRTs with LCDs. I even included the amount of time it would take to recoup the cost in electric to pay for them. Sometimes our bean counters are not the brightest.
3) Message boards : Number crunching : Phantom Triplets (Message 1575738)
Posted 4 hours ago by Profile HAL9000
Thanks for the insights, Jason. Thus far, I haven't tinkered with the voltage or clocks on that card at all. It's just running at all the defaults, with no overclocking. Probably, unless the frequency of this little anomaly increases significantly, I'll try to avoid altering any of those settings. The idea of actually trying to increase the Invalid rate in order to possibly get a clue to diagnose the existing low failure rate doesn't really appeal to me at the moment! ;^)

- The inherent susceptibility of 'consumer grade' hardware to soft error
http://en.wikipedia.org/wiki/Soft_error#Causes_of_soft_errors, especially,

IBM estimated in 1996 that one error per month per 256 MiB of ram was expected for a desktop computer


That's some really interesting stuff...definitely a scary door! I had no idea that "soft" errors were so prevalent and could be caused so easily. Alpha particles and cosmic rays and thermal neutrons, oh my! I can't wait for my bank to blame the next data breach on a thermal neutron.

Seriously, though, could a soft error possibly be consistent enough to cause the sort of rare, yet consistent, hiccup that I'm seeing where only Triplets (and lots of them) are being incorrectly identified where none apparently exist?

About every 4-6 weeks the 8500 GT I'm using throws a fit and starts trashing work. Despite it being in a lab with temperature, humidity, & power regulation. I tried a few things like shutting down the system weekly to see if it would help, but nothing I tried has so far.
I decided it was rare enough I didn't want to spend any more time looking into it.
4) Message boards : Number crunching : Phantom Triplets (Message 1575686)
Posted 7 hours ago by Profile HAL9000
Is the 550 Ti one of the Ti cards prone issues with power?
If it is you could log the voltages and see if there are any drops on the 12v supply that coincide with the running time window for any future invalid tasks.
5) Message boards : Number crunching : venting (Message 1575669)
Posted 8 hours ago by Profile HAL9000
Prevent the aliens or government from controlling or reading your thoughts......LOL

In a funny twist it would actually act as an antenna. Making one more receptive to such mind controlling devices.
For a faraday cage to work correctly it must complete surround the object. So a tinfoil suit is a much better idea. :)
6) Message boards : Number crunching : headless computer? (Message 1575158)
Posted 1 day ago by Profile HAL9000
tbret or anybody

Team Viewer seems cheap and easy. What is this about push the power button less than 4 seconds? Wouldn't you just shutdown over Team Viewer?

The default action for Windows is to start shutting down when the power button is pressed. This gives you a clean system shutdown.
If you hold the power button more than 4 seconds it is ATX spec to shut off the PSU. Which Windows will treat as an unexpected shutdown.
7) Message boards : Number crunching : headless computer? (Message 1575130)
Posted 1 day ago by Profile HAL9000
Headless Computer:

Is it without a monitor, keyboard and mouse? How do you get it setup? How do you shut it down? I assume it will just be running seti/lunatics/win8.1.

There are many options. A few are.
1) You could get an inexpensive KVM switch to go between your machines.
2) Get a network KVM device so you remote into it and have full control over the whole machine just like a KVM, but across the network. However these are often expensive.
3) Connect your current monitor and input devices. Setup the system and install remote software such as VNC when done.
4) Connect your current monitor and input devices. Setup the system then control it remotely over the network with command line options.

I mostly do 3 or 4 myself.
Once you have the system setup & configure BOINC to launch on start up you can control BOINC through BOINC Manager from your main system.
Most of my systems I don't even have BOINC Manager running. Since I am accessing the systems remotely when I need to do anything via boinccmd.

As far as shutting the system down you can do that remotely via a windows command line option. Open a command line prompt and type shutdown /? to get the full help instructions.
8) Message boards : Number crunching : Different cache for AP/MB (Message 1575126)
Posted 1 day ago by Profile HAL9000
Is it possible to set different cache values for AP and MB? Like 10 days for AP and 1 day for MB?

Cache preferences are globally for BOINC instead of per project or application in a project.
9) Message boards : Number crunching : The ultimate build (Message 1575056)
Posted 1 day ago by Profile HAL9000
The GTX750Ti is a dual slot wide card, right?
Then the max is 4 cards/mobo.
Or low profile cards are available?

Instead of two GTX750Ti one e.g. GTX780 / maybe ~ same RAC output.

AFAIK, BOINC can manage up to 8 GPU chips (4x dual, or 8 separate cards).

You could use 14/20 slot chassis or use PCIe extensions to have the cards further apart. The main issue would be having enough PCIe lanes for all the cards. So a multi-socket MB would be a good start.
I think BOINC has always supported more than 8 GPUs. However scheduler wouldn't believe the client if it reported more 8. It would only give it # of tasks for 8 GPUs. I seem to recall that this limit was increased to a higher value. I want to say 40, but that may not be correct.
10) Message boards : Number crunching : Panic Mode On (89) Server Problems? (Message 1575047)
Posted 1 day ago by Profile HAL9000
I would love to see one computer download gig worth of work. I wonder how long it would take to process a week on latest hardware running 24 7

Do you want a GB of MB work or AP work. Because a machine doing CPU & GPU can have 1.6GB of AP very easily when they are being produced. For MB 1 GB would be around 2700. Which was about the size of the 10 day queue on my 24 core server before the limits & when we were on MB v6.
11) Message boards : Number crunching : The ultimate build (Message 1574551)
Posted 2 days ago by Profile HAL9000
Something that might be good to include would be price range categories.
Perhaps separated by 500 increments.
$/£/€ 499.99 & under
$/£/€ 999.99 - 500.00
$/£/€ 1499.99 - 1000.00
$/£/€ 1999.99 - 1500.00
$/£/€ 2499.99 - 2000.00
$/£/€ 2999.99 - 2500.00
Then larger values as the amount increases
$/£/€ 3999.99 - 3000.00
$/£/€ 4999.99 - 4000.00
$/£/€ 5999.99 - 5000.00
$/£/€ 14999.99 - 10000.00
$/£/€ 19999.99 - 15000.00
$/£/€ 24999.99 - 20000.00
12) Message boards : Number crunching : GPU Wars 2014: Postponed to 2015? (Message 1574507)
Posted 2 days ago by Profile HAL9000
This is interesting..

http://www.pcworld.com/article/2686115/nvidia-unveils-its-all-new-geforce-gtx-980-and-gtx-970-graphics-processors.html

the TDP for the GeForce GTX 980 with 4GB is just 165 watts.


Another article say 195W. Either way, that's still pretty good reduction

Despite the lower power consumption, Nvidia says the GTX 980 will deliver 5 teraflops of single-precision compute performance

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980/specifications
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications
13) Message boards : Number crunching : Max temp for getting the normal life out of a video card? (Message 1574476)
Posted 2 days ago by Profile HAL9000
I'm getting too hooked on this stuff. I just bought a dell with a r9 260 as my graphic card and I just swore the other day I wanted out of the dell hook up.
But I saw this pretty little thing for a nice price and I just had to do it.
Didn't even have a choice. That's it, no more stuff.

You might want some ointment. As it seems you have been bitten by the bug.
14) Message boards : Number crunching : Panic Mode On (89) Server Problems? (Message 1574473)
Posted 2 days ago by Profile HAL9000
Interesting stair-steps on the Crickets right now.


Yes, in Daily Graph. But when looking at Monthly Graph I see transfers are going back to "normal" and above it, no more AP frenzy???

Just my guess...

If you look at the monthly graph you can see our "normal" AP frenzy traffic. Which is often about 200Mb. The past few weeks there was an uptick in the inbound, first small and then larger, and outbound even went up over 400Mb at one point. It looks odd even on the yearly graph.
15) Message boards : Number crunching : ntpckr questions (Message 1574465)
Posted 2 days ago by Profile HAL9000
My understanding is.. they got a donated, spec-built server specifically for ntpckr to be run, but shortly after firing that process up, it just about crippled the science database, so it was turned off.

Then work (and more hardware donations) began on improving the I/O throughput for the database, and we now have the hardware for the database server that can handle immense I/O loads, and then it was discovered that the I/O limitations are the software itself (or the drivers for the hardware), which needs tweaking and fine-tuning, or for alternatives to be found and tested and eventually deployed.

This is the stage where I believe it still is. I believe it was generally decided that the database software was not particularly designed for a database as large as what we have, so we are in "uncharted territory" so to speak.


However, I have suggested once before that if ntpckr is such a load on the database, why not use a copy of the weekly back-up and put it directly on the ntpckr machine and let it chew through that for however long it takes for it to get "caught up," and then throw an incremental update to the database at it and it will chew through that in short order, maybe do one more incremental update, and from there, it should be able to handle "near real time" with minimal impact.

But.. I don't know if that's even a feasible option or not.

I think Matt had mentioned they were looking into splitting the DB into smaller chunks. So that they could actually work with it.
16) Message boards : Number crunching : Max temp for getting the normal life out of a video card? (Message 1574393)
Posted 2 days ago by Profile HAL9000
I only run one task at a time. The temps are 58-64C. The fan on the r7 I leave at 100% since it runs fairly quiet. The 7770 fan is usually around 70%.

I think Hal told me to only run one task at a time on my cards, if I remember correctly, but it was kind of hectic back then. You get the problem where the gpu's get slowed down by those heavily blanked tasks and have to rely on the pace of the cpu's.

Even if it was someone else. For these range cards it is probably good to run only 1 AP at a time. If the non blanked tasks are keeping the load up pretty high more than 1 isn't really a benefit. Also when mixing GPUs you either have to setup for the slowest one or run multiple instances of BOINC.

I leave the fan on my HD6870 on auto. On a hot day with the windows open the fan spins up to about 55% from 40-45%. While keeping the GPU at the same temp as when the A/C going is on. I would have to check the temp logs but I think it stays well under 60ºC all of the time.
I leave the fan on auto to hopefully get as much life out of the fan as possible. Because a dead fan is a fairly likely way to have a dead GPU.
17) Message boards : Number crunching : GPU Wars 2014: Postponed to 2015? (Message 1574268)
Posted 2 days ago by Profile HAL9000
I just wandering why they skip the 800 series.

I think maybe to cause a better separation between different segments. Like Radeon R3, R5, R9 series GPUs. Perhaps they will keep some as 700 series. Then introduce new mid range GPU as 800 series.
18) Message boards : Number crunching : Panic Mode On (89) Server Problems? (Message 1573015)
Posted 5 days ago by Profile HAL9000
Yes the blue cricket line certainly has been under an unusual load over the last week or so, but then my GPU's have been doing a lot of MB shorties lately too.

Cheers.

I wonder if they are doing something with one of their other servers in the colo. As I recall they moved over all of their servers from the closet. Which included other things besides just the SETI@home gear.
A solid 150Mb, in both directions, is a lot of data the past few days.
19) Message boards : Number crunching : Computer crashes a lot (Message 1572851)
Posted 5 days ago by Profile HAL9000
As a work around until this is sorted out you could configure the machine for auto logon. That was when it crashes it will log in and restart BOINC. Running control userpasswords2 is one way you can setup the system for auto logon.

The information WhoCrashed is giving you is located in the Windows Event log. You don't really need a 3rd party application to read it, but whichever way you prefer works.

That error doesn't point to anything specific. Kernel crashes can be tricky to lock down. If you look in the windows event log are there any event prior to the crash that may show what the system was doing. Perhaps an anti-virus scan or something along those lines was running?
20) Message boards : Number crunching : Different setting for different GPUs? (Message 1572763)
Posted 5 days ago by Profile HAL9000

In another post Raistmer was mentioning it would be nice if we had a per GPU config and I made this suggestion. Perhaps one of the Lunatics forwarded it or a better idea to the BOINC devs.


I think that would be a great option if they could do it. Nice to know both of you were thinking already about that. Guess I'll wait and see if anything comes from that.

Zalster

If you are on the BOINC Alpha list you could mention it also. I would think that the more times something is mentioned the more likely it is to happen.


Next 20

Copyright © 2014 University of California