Message boards :
Number crunching :
Phantom Triplets
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
In the spirit of "You never know what you'll find when you start pulling on a loose thread", I just ran across an example of another problem that's been going on for a long time. I decided to start preemptively watching my "Inconclusive" tasks on that machine, to try to catch these phantom Triplets while the result file might still be available to somebody. I just checked the 4 new ones that appeared today and, while I didn't catch any phantom Triplets, I did find one WU where my task got marked Inconclusive while my first wingman's got immediately marked Invalid. Initially, I assumed that this was an example of the ongoing "-9 overflow with truncated Stderr" problem for which the fix still hasn't been implemented. However, on closer inspection, this wingman turned out to be one of those who is still trying to run v7 tasks through a v6 app, specifically: setiathome_enhanced 6.11 $Revision: 850 $ g++ (Ubuntu/Linaro 4.5.1-8ubuntu2~ppa1) 4.5.1 Of course, since the host has an Anonymous user, under the present setup the only people who could contact him to set him straight are the project admins. In the meantime, that host shows: Invalid (643) · Error (52) Not good! |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
In the spirit of "You never know what you'll find when you start pulling on a loose thread", I just ran across an example of another problem that's been going on for a long time. I decided to start preemptively watching my "Inconclusive" tasks on that machine, to try to catch these phantom Triplets while the result file might still be available to somebody. I just checked the 4 new ones that appeared today and, while I didn't catch any phantom Triplets, I did find one WU where my task got marked Inconclusive while my first wingman's got immediately marked Invalid. Initially, I assumed that this was an example of the ongoing "-9 overflow with truncated Stderr" problem for which the fix still hasn't been implemented. However, on closer inspection, this wingman turned out to be one of those who is still trying to run v7 tasks through a v6 app, specifically: It's a new host id for the host mentioned in this thread: Anonymous owner of computer 5940769 Claggy |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I ran some GPU memory test programs today, first one called OCCT and then a couple from the Folding@Home Utilities page, MemtestG80 and MemtestCL. I wasn't terribly impressed by the OCCT program, but the other two seem to be reasonable facsimiles of Memtest86, adapted for the GPU. MemtestG80 is just for NVIDIA CUDA-enabled GPUs, while the MemtestCL can run on both NVIDIA and ATI Open-CL cards. None of them detected any errors on the GTX 550 Ti, even after many iterations and several hours. However, the programs seem to have some limitations in regard to the maximum amount of memory they can test. The max I could get MemtestG80 to look at was 680MB out of the 1024MB on the GPU, even though GPU-Z only reported 81MB being in use prior to running the test. MemtestCL, however, was able to test 924MB under the same conditions. The advantage of MemtestG80 is that it runs about 8-10 times faster than MemtestCL (which took about 2.5 hours to test the 924MB for 50 iterations of its 13 different test schemes). Of course, the absence of errors doesn't really prove that there isn't a weak bit lurking somewhere in there but, for now, I think I'll just let it ride. I've got BoincLogX running to capture Result files, so if the phantom Triplets show up again, perhaps there will be some more evidence available to help pin it down. |
shizaru Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0 |
That's not to say PG&E's electric service is entirely prodicable, though. Sometimes they make California feel like a third world country, with random outages that seem to have no external cause (i.e., perfectly bright, sunny day or calm, clear night, no car meeting power pole, but suddenly, no juice). Thanx for inadvertently reminding me about this doc. Sat up until 5 in the morning watching the whole thing again right after posting (first time I saw it was many, many years ago). It's really good, it's one of those rare impartial ones (well as much as humanly possible anyway). I'll borrow Ebert's take on it: "This is not a political documentary. It is a crime story. No matter what your politics, Enron: The Smartest Guys in the Room will make you mad". And a bit sad I'd add. Don't want to spoil it for anyone in case anybody actually ever wants to watch it but I will say that apparently PGE has no say in where & how their power is distributed. How & why Californian blackouts occur is painfully explained in great detail halfway through the film. (Since it's a documentary about Enron, it only focuses on the one company obviously but there are more "players" on the market apparently pulling the same stunts) http://www.pbs.org/independentlens/enron/film.html |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I ran some GPU memory test programs today, first one called OCCT and then a couple from the Folding@Home Utilities page, MemtestG80 and MemtestCL. I wasn't terribly impressed by the OCCT program, but the other two seem to be reasonable facsimiles of Memtest86, adapted for the GPU. MemtestG80 is just for NVIDIA CUDA-enabled GPUs, while the MemtestCL can run on both NVIDIA and ATI Open-CL cards. Finally had another one of these show up this morning, about 17 days after the last one. It's WU #1608838137, which my host completed and reported at about 11:51 PM local time on October 5. My machine thinks it found 24 Triplets, whereas both wingmen found none. Unfortunately, when I checked the result file that BoincLogX captured, I found that it only contained the workunit_header information and none of the actual result data. Checking other result files captured around the same time, I found that some did contain result data and others didn't. I have to assume the BoincLogX monitoring interval is to blame, since I hadn't thought to change it from the 15 second default. Sigh... So now I've changed the monitoring interval to 5 seconds and will just have to wait and see if that does the trick the next time the phantom triplets show up (since I doubt if the problem will just go away on its own, even though it's pretty rare). |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I ran some GPU memory test programs today, first one called OCCT and then a couple from the Folding@Home Utilities page, MemtestG80 and MemtestCL. I wasn't terribly impressed by the OCCT program, but the other two seem to be reasonable facsimiles of Memtest86, adapted for the GPU. MemtestG80 is just for NVIDIA CUDA-enabled GPUs, while the MemtestCL can run on both NVIDIA and ATI Open-CL cards. The result file truncation 'smells' a bit like the boincapi thread safety issues might be concerned (similar to truncated/missing stderr cases). Whether that's just a symptom, or somehow acting as a cause is a totally different, not easily answered, question I guess. Though the time between events is long, since you are running under anonymous platform could you switch to one of my commode.obj enabled builds ? (intended to test one kindof workaround for boinc thread safety problems). If the same extra triplet issues still appear (different cause) then at least stderr and result content should be complete with this build, improving the likliehood of a firmer diagnosis (such as a driver synchronisation issue lurking). If they stop appearing with this build, I would investigate potential host DPC ( driver/software-interrupt) latency issues, which can manifest from obscure driver and software quality issues, chipset/RAID/SATA drivers for one possible suspect of many. Commode.obj enabled builds are provided for diagnostic purposes at: http://jgopt.org/download.html "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Finally had another one of these show up this morning, about 17 days after the last one. It's WU #1608838137, which my host completed and reported at about 11:51 PM local time on October 5. My machine thinks it found 24 Triplets, whereas both wingmen found none. Thanks for the suggestion, Jsaon. Actually, my thinking is that even though the result file that BoincLogX captured was truncated, the file that was actually uploaded was probably complete. Richard Haselgrove had warned me (when we were first researching the truncated Stderr / immediate invalid issue last year) that BoincLogX needed a fairly aggressive monitoring interval in order to ensure capturing all result file. Unfortunately, I had forgotten about that until this morning, and had left it at the default interval of 15 seconds. I've now lowered it to 5 seconds, which seemed to work well last year. Currently, the 3 machines that I have running Anonymous Platform have plain vanilla app_info files and I don't think I want to tinker with any of them until Richard releases the new installer for the AP v7 release and I get those machines updated. Once that's done, I'll look into your commode builds, especially since it looks like Eric is never going to bother to implement that simple server-side fix that Joe Segur came up with for the Immediate Invalid problem (which still occurs with depressing regularity, my latest one being WU 1609856673). |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Okay, I think I've got another one after only about 8 days this time. It's task 3786421579 and, while it is just in an Inconclusive state right now, has the look and smell of one that will go Invalid once the tie-breaking wingman reports, likely in a day or two. Where my host found 1 Pulse and 14 Triplets, my initial wingman found 1 Pulse and 0 Triplets. (Of course, I suppose it could be the wingman who got it wrong, but since his machine looks clean and he happens to be today's "User of the Day", I feel comfortable giving him the benefit of the doubt. :^)) Whether thanks to the shorter monitoring interval for BoincLogX, or to Jason's commode.obj build which I installed a couple days ago, it appears that I captured a complete result file this time. I thought perhaps some sort of pattern might jump out at me, but nothing grabs my untrained eye, so here's the results for all 14 triplets, in the hopes that something useful can be found that might point to a possible cause: <triplet> <peak_power>1701.0302734375</peak_power> <mean_power>0.001165299792774</mean_power> <time>2456904.1331839</time> <ra>9.7058914595861</ra> <decl>17.092461484168</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.0762</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0196608</period> </triplet> <triplet> <peak_power>1954.1756591797</peak_power> <mean_power>0.001165299792774</mean_power> <time>2456904.1331839</time> <ra>9.7058914595861</ra> <decl>17.092461484168</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.0762</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0131072</period> </triplet> <triplet> <peak_power>2168.2409667969</peak_power> <mean_power>0.001165299792774</mean_power> <time>2456904.133184</time> <ra>9.7058932889566</ra> <decl>17.092461483448</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.3419</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0196608</period> </triplet> <triplet> <peak_power>781.69622802734</peak_power> <mean_power>0.001165299792774</mean_power> <time>2456904.133184</time> <ra>9.7058932889566</ra> <decl>17.092461483448</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.3419</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0065536</period> </triplet> <triplet> <peak_power>2168.2409667969</peak_power> <mean_power>0.001165299792774</mean_power> <time>2456904.133184</time> <ra>9.705895118327</ra> <decl>17.092461482727</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.6076</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0131072</period> </triplet> <triplet> <peak_power>1035.8040771484</peak_power> <mean_power>0.001165299792774</mean_power> <time>2456904.133184</time> <ra>9.705895118327</ra> <decl>17.092461482727</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.6076</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0065536</period> </triplet> <triplet> <peak_power>2168.2409667969</peak_power> <mean_power>0.001165299792774</mean_power> <time>2456904.1331841</time> <ra>9.7058969476975</ra> <decl>17.092461482007</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.8733</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0065536</period> </triplet> <triplet> <peak_power>1702.4604492188</peak_power> <mean_power>0.0011643208563328</mean_power> <time>2456904.1331839</time> <ra>9.7058914595861</ra> <decl>17.092461484168</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.0762</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0196608</period> </triplet> <triplet> <peak_power>1955.8186035156</peak_power> <mean_power>0.0011643208563328</mean_power> <time>2456904.1331839</time> <ra>9.7058914595861</ra> <decl>17.092461484168</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.0762</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0131072</period> </triplet> <triplet> <peak_power>2170.0639648438</peak_power> <mean_power>0.0011643208563328</mean_power> <time>2456904.133184</time> <ra>9.7058932889566</ra> <decl>17.092461483448</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.3419</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0196608</period> </triplet> <triplet> <peak_power>782.35345458984</peak_power> <mean_power>0.0011643208563328</mean_power> <time>2456904.133184</time> <ra>9.7058932889566</ra> <decl>17.092461483448</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.3419</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0065536</period> </triplet> <triplet> <peak_power>2170.0639648438</peak_power> <mean_power>0.0011643208563328</mean_power> <time>2456904.133184</time> <ra>9.705895118327</ra> <decl>17.092461482727</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.6076</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0131072</period> </triplet> <triplet> <peak_power>1036.6749267578</peak_power> <mean_power>0.0011643208563328</mean_power> <time>2456904.133184</time> <ra>9.705895118327</ra> <decl>17.092461482727</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.6076</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0065536</period> </triplet> <triplet> <peak_power>2170.0639648438</peak_power> <mean_power>0.0011643208563328</mean_power> <time>2456904.1331841</time> <ra>9.7058969476975</ra> <decl>17.092461482007</decl> <q_pix>0</q_pix> <freq>1420805664.0625</freq> <detection_freq>1420807021.8733</detection_freq> <barycentric_freq>0</barycentric_freq> <fft_len>8</fft_len> <chirp_rate>40.543618462507</chirp_rate> <rfi_checked>0</rfi_checked> <rfi_found>0</rfi_found> <reserved>0</reserved> <period>0.0065536</period> </triplet> I'll be happy to bundle up the WU and result files and send them off to anyone who'd like check them out in detail. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
OK, a quick look through and those all manifested from the same chirp-fft pair <fft_len>8</fft_len> and have the same miniscule mean power, and calculated detection time. That makes them all part of the same 'event'. Based on the nature of that glitch 'looking' much like a numerical version of a graphical artefact, I would suspect a few things. Through age or manufacturing concerns that silicon may well be operating near capability in terms of frequency &/or heat. Presuming you've checked the latter, with no evidence of overheating etc, Then I recommend either a ~0.05V core voltage bump, or a ~20MHz frequency drop. Either would IMO compensate for a combination of the age of the GPU, and mid range performance/price cards being pushed to the frequency limits (with some number of acceptable artefacts) from factory. Memory clock may also be an issue, though I would expect the nature to be more random with memory glitches than core loading. In this case there appears to be a quite specific borderline circuit. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Okay, I think I've got another one after only about 8 days this time. It's task 3786421579 and, while it is just in an Inconclusive state right now, has the look and smell of one that will go Invalid once the tie-breaking wingman reports, likely in a day or two. Of the 24812 Triplets BoincLogX has saved in sah_boinc_triplets.csv on my Pentium-M host, the strongest has a peak power of 19.57. I think the 782 to 2170 peak power range of those 14 "found" by your processing is probably indicating corrupted data or something similar. If there really were Triplets that strong, IMO they should have been seen at lower chirp rates at less power and caused the task to exit with a result_overflow. Nevertheless, I did grab a copy of the WU directly from the download server and have started offline testing to be sure it isn't some extreme RFI or such, should be done when I awake again. I'll also look at the other values for those Triplets to see if I can spot anything else extraordinary. Joe |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Jason's advice is of course the right thing to do. For the record, offline processing of the WU with ATI OpenCL and CPU did not find any Triplets. I reduced those 14 false Triplets to tabular form, with time converted to seconds from the WU beginning and the diffeence between detected_freq and freq shown as chirp: peak_power mean_power time chirp period --------------- ------------------ -------- --------- --------- 1701.0302734375 0.001165299792774 33.47136 1357.0137 0.0196608 1954.1756591797 0.001165299792774 33.47136 1357.0137 0.0131072 2168.2409667969 0.001165299792774 33.48000 1357.2794 0.0196608 781.6962280273 0.001165299792774 33.48000 1357.2794 0.0065536 2168.2409667969 0.001165299792774 33.48000 1357.5451 0.0131072 1035.8040771484 0.001165299792774 33.48000 1357.5451 0.0065536 2168.2409667969 0.001165299792774 33.48864 1357.8108 0.0065536 1702.4604492188 0.0011643208563328 33.47136 1357.0137 0.0196608 1955.8186035156 0.0011643208563328 33.47136 1357.0137 0.0131072 2170.0639648438 0.0011643208563328 33.48000 1357.2794 0.0196608 782.3534545898 0.0011643208563328 33.48000 1357.2794 0.0065536 2170.0639648438 0.0011643208563328 33.48000 1357.5451 0.0131072 1036.6749267578 0.0011643208563328 33.48000 1357.5451 0.0065536 2170.0639648438 0.0011643208563328 33.48864 1357.8108 0.0065536 That indicates there were actually only 7 Triplets, but they were in the 50% overlap between two successive arrays so each was reported twice. Since the mean power of the second array was slightly lower, the peak power ratios were slightly higher there. About the only thing which can be certain is that the cause was not a single bit flip, the peak power is the strongest of the 3 peaks within a reported triplet so there are at least 5 peaks which were abnormal. Joe |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
OK, a quick look through and those all manifested from the same chirp-fft pair I had just composed this reply yesterday when the site crashed. So, ~22 hours later... Without understanding (or even trying to understand) the electronic intricacies of a GPU, a "quite specific borderline circuit" certainly sounds reasonable to me, inasmuch as it seems to be something that only manifests itself on rare occasions, and perhaps only under very specific conditions. In any event, I tried dropping the core clock by 20MHz (using Precision X) but found that it won't let me lower it below the 970MHz base frequency. Raising the voltage by 0.05V would apparently put it at the red line maximum of 1.15V, so I took a more conservative route (for now), and just raised it by 0.012V to 1.112V. That's still in the "orange" zone, but didn't seem to break anything so I'll let it stay there and wait to see what the future brings! Unfortunately, because this problem is so infrequent, I don't know how long a wait would be necessary to say for certain that it went away. It could lurk for many weeks. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Jason's advice is of course the right thing to do. Thanks for taking the time to make that test and that thorough analysis, Joe. While I won't pretend to understand the technical details, the end result does seem to point to a GPU issue at my end, although your note that "the cause was not a single bit flip" leaves me wondering what I might actually be facing here. I as noted in my earlier reply to Jason (which would have been posted about 22 hours ago had not the site crashed), I've boosted the GPU voltage slightly and will continue to monitor the box (which is currently out of GPU work, however, until enough backlogged tasks get uploaded to allow downloads to resume). |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
...leaves me wondering what I might actually be facing here... Basically the same as the white dot graphical artefacts you'd get overclocking a GPU for gaming, at which point backing off at least two 'notches' is the general corrective method. White dots in graphical glitches imply ~24-32 bits of saturation (bits flipped to on), so yeah many more than a single bit flip, though typically tied to single memory fetches. As with overclocking, It happens with 'gaming grade' GPUs sometimes from factory due to price/performance market pressures, and the non-critical nature of a rare white dot, when parts binning and setting clocks for mid range GPUs, where the competition is steepest. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Basically the same as the white dot graphical artefacts you'd get overclocking a GPU for gaming, at which point backing off at least two 'notches' is the general corrective method. White dots in graphical glitches imply ~24-32 bits of saturation (bits flipped to on), so yeah many more than a single bit flip, though typically tied to single memory fetches. Okay, thanks, Jason. Not being a gamer, I don't really have any experience with overclocking and artifacts ("white dot" or otherwise). I just run my cards at whatever clock rate they come with, so unless they're "factory" overclocked, they run at their design frequency. For that 550Ti, the base rate for the core clock seems to be 970MHz, and Precision X doesn't seem to want to let me drop it any further, so that's why I went with the voltage adjustment, though as I mentioned, I didn't want to push it to the maximum right away. Thanks again for all your advice and insight. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Well, I got another one already, task 3791593451, which found 24 triplets where the wingmen found none. So I guess my incremental voltage increase didn't have any effect. I've gone ahead now and bumped it up the full 0.05v per Jason's recommendation, to 1.150v, which appears to be the maximum for the card. At this point, I guess I don't mind if the card fries. That'll just be my excuse to buy another GTX 750Ti (which I figure would probably pay for itself in about 8 months in reduced electric costs). :^) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
So much for my voltage experiment! I noticed that my RAC on that machine had been slowly declining for the past couple weeks. I didn't think too much about it, since RAC tends to be fairly volatile (no news there). However, I noticed this evening the slow decline had turned into a plummet over the weekend (down over 21% from 2 weeks ago). The machine still seemed to be running okay, with no new errors or invalids, so I thought that perhaps the commode build, which I installed on October 15, was taking longer to run, resulting in a reduced throughput. However, comparing MB run times by angle range before and after October 15 didn't seem to show any significant differences. Until I got to October 24, that is. Suddenly the run times quadrupled. That really got my attention, so I just went and took a closeer look at the machine. I found that the GPU clock, which normally runs at 970MHz, was limping along at only 405MHz. And the voltage, which I had carefully increased to 1.15v on October 22, was now at 0.95v. Not good! I don't know if the increased voltage itself was the problem, or whether Precision X had introduced some other instability into the system. In any event, I disabled Precision X and rebooted the machine. It returned the clock to 970MHz but still maintained the voltage at 1.15v. I had to run Precision X again and tell it to reset all GPU defaults. Then I closed Precision X again, probably permanently. Hopefully, it will run smoothly from here on. I'll just have to live with the occasional phantom triplets, I think. It's definitely starting to feel like another GTX 750Ti in my future. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue, I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.') "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue) No, it runs AP tasks, too. It will run either 2 MBs or 1 MB + 1 AP on the GPU. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue) OK, you'll need to consult the AP app author. as per edit I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.', alongside the possible system driver DPC issues. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.