Phantom Triplets

Message boards : Number crunching : Phantom Triplets
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1575841 - Posted: 22 Sep 2014, 3:47:51 UTC

In the spirit of "You never know what you'll find when you start pulling on a loose thread", I just ran across an example of another problem that's been going on for a long time. I decided to start preemptively watching my "Inconclusive" tasks on that machine, to try to catch these phantom Triplets while the result file might still be available to somebody. I just checked the 4 new ones that appeared today and, while I didn't catch any phantom Triplets, I did find one WU where my task got marked Inconclusive while my first wingman's got immediately marked Invalid. Initially, I assumed that this was an example of the ongoing "-9 overflow with truncated Stderr" problem for which the fix still hasn't been implemented. However, on closer inspection, this wingman turned out to be one of those who is still trying to run v7 tasks through a v6 app, specifically:

setiathome_enhanced 6.11 $Revision: 850 $ g++ (Ubuntu/Linaro 4.5.1-8ubuntu2~ppa1) 4.5.1

Of course, since the host has an Anonymous user, under the present setup the only people who could contact him to set him straight are the project admins. In the meantime, that host shows: Invalid (643) · Error (52)

Not good!
ID: 1575841 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1575958 - Posted: 22 Sep 2014, 11:17:59 UTC - in response to Message 1575841.  
Last modified: 22 Sep 2014, 11:18:24 UTC

In the spirit of "You never know what you'll find when you start pulling on a loose thread", I just ran across an example of another problem that's been going on for a long time. I decided to start preemptively watching my "Inconclusive" tasks on that machine, to try to catch these phantom Triplets while the result file might still be available to somebody. I just checked the 4 new ones that appeared today and, while I didn't catch any phantom Triplets, I did find one WU where my task got marked Inconclusive while my first wingman's got immediately marked Invalid. Initially, I assumed that this was an example of the ongoing "-9 overflow with truncated Stderr" problem for which the fix still hasn't been implemented. However, on closer inspection, this wingman turned out to be one of those who is still trying to run v7 tasks through a v6 app, specifically:

setiathome_enhanced 6.11 $Revision: 850 $ g++ (Ubuntu/Linaro 4.5.1-8ubuntu2~ppa1) 4.5.1

Of course, since the host has an Anonymous user, under the present setup the only people who could contact him to set him straight are the project admins. In the meantime, that host shows: Invalid (643) · Error (52)

Not good!

It's a new host id for the host mentioned in this thread:

Anonymous owner of computer 5940769

Claggy
ID: 1575958 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1576279 - Posted: 22 Sep 2014, 23:54:18 UTC

I ran some GPU memory test programs today, first one called OCCT and then a couple from the Folding@Home Utilities page, MemtestG80 and MemtestCL. I wasn't terribly impressed by the OCCT program, but the other two seem to be reasonable facsimiles of Memtest86, adapted for the GPU. MemtestG80 is just for NVIDIA CUDA-enabled GPUs, while the MemtestCL can run on both NVIDIA and ATI Open-CL cards.

None of them detected any errors on the GTX 550 Ti, even after many iterations and several hours. However, the programs seem to have some limitations in regard to the maximum amount of memory they can test. The max I could get MemtestG80 to look at was 680MB out of the 1024MB on the GPU, even though GPU-Z only reported 81MB being in use prior to running the test. MemtestCL, however, was able to test 924MB under the same conditions. The advantage of MemtestG80 is that it runs about 8-10 times faster than MemtestCL (which took about 2.5 hours to test the 924MB for 50 iterations of its 13 different test schemes).

Of course, the absence of errors doesn't really prove that there isn't a weak bit lurking somewhere in there but, for now, I think I'll just let it ride. I've got BoincLogX running to capture Result files, so if the phantom Triplets show up again, perhaps there will be some more evidence available to help pin it down.
ID: 1576279 · Report as offensive
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1576292 - Posted: 23 Sep 2014, 0:12:18 UTC - in response to Message 1575812.  
Last modified: 23 Sep 2014, 0:13:56 UTC

That's not to say PG&E's electric service is entirely prodicable, though. Sometimes they make California feel like a third world country, with random outages that seem to have no external cause (i.e., perfectly bright, sunny day or calm, clear night, no car meeting power pole, but suddenly, no juice).


Prepare to be thoroughly depressed at the answer to that :)

Enron: The Smartest Guys in the Room (2005)

Well, although the Enron debacle has cost us ratepayers a whole lot of money (thanks to our brain-dead politicians), the (un)reliability issue is something that plagued us here long before those "smart guys" came along! :^)


Thanx for inadvertently reminding me about this doc. Sat up until 5 in the morning watching the whole thing again right after posting (first time I saw it was many, many years ago). It's really good, it's one of those rare impartial ones (well as much as humanly possible anyway). I'll borrow Ebert's take on it: "This is not a political documentary. It is a crime story. No matter what your politics, Enron: The Smartest Guys in the Room will make you mad". And a bit sad I'd add.

Don't want to spoil it for anyone in case anybody actually ever wants to watch it but I will say that apparently PGE has no say in where & how their power is distributed. How & why Californian blackouts occur is painfully explained in great detail halfway through the film.

(Since it's a documentary about Enron, it only focuses on the one company obviously but there are more "players" on the market apparently pulling the same stunts)

http://www.pbs.org/independentlens/enron/film.html
ID: 1576292 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1584040 - Posted: 9 Oct 2014, 17:46:41 UTC - in response to Message 1576279.  

I ran some GPU memory test programs today, first one called OCCT and then a couple from the Folding@Home Utilities page, MemtestG80 and MemtestCL. I wasn't terribly impressed by the OCCT program, but the other two seem to be reasonable facsimiles of Memtest86, adapted for the GPU. MemtestG80 is just for NVIDIA CUDA-enabled GPUs, while the MemtestCL can run on both NVIDIA and ATI Open-CL cards.

None of them detected any errors on the GTX 550 Ti, even after many iterations and several hours. However, the programs seem to have some limitations in regard to the maximum amount of memory they can test. The max I could get MemtestG80 to look at was 680MB out of the 1024MB on the GPU, even though GPU-Z only reported 81MB being in use prior to running the test. MemtestCL, however, was able to test 924MB under the same conditions. The advantage of MemtestG80 is that it runs about 8-10 times faster than MemtestCL (which took about 2.5 hours to test the 924MB for 50 iterations of its 13 different test schemes).

Of course, the absence of errors doesn't really prove that there isn't a weak bit lurking somewhere in there but, for now, I think I'll just let it ride. I've got BoincLogX running to capture Result files, so if the phantom Triplets show up again, perhaps there will be some more evidence available to help pin it down.

Finally had another one of these show up this morning, about 17 days after the last one. It's WU #1608838137, which my host completed and reported at about 11:51 PM local time on October 5. My machine thinks it found 24 Triplets, whereas both wingmen found none.

Unfortunately, when I checked the result file that BoincLogX captured, I found that it only contained the workunit_header information and none of the actual result data. Checking other result files captured around the same time, I found that some did contain result data and others didn't. I have to assume the BoincLogX monitoring interval is to blame, since I hadn't thought to change it from the 15 second default. Sigh... So now I've changed the monitoring interval to 5 seconds and will just have to wait and see if that does the trick the next time the phantom triplets show up (since I doubt if the problem will just go away on its own, even though it's pretty rare).
ID: 1584040 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1584231 - Posted: 9 Oct 2014, 21:51:09 UTC - in response to Message 1584040.  
Last modified: 9 Oct 2014, 22:00:40 UTC

I ran some GPU memory test programs today, first one called OCCT and then a couple from the Folding@Home Utilities page, MemtestG80 and MemtestCL. I wasn't terribly impressed by the OCCT program, but the other two seem to be reasonable facsimiles of Memtest86, adapted for the GPU. MemtestG80 is just for NVIDIA CUDA-enabled GPUs, while the MemtestCL can run on both NVIDIA and ATI Open-CL cards.

None of them detected any errors on the GTX 550 Ti, even after many iterations and several hours. However, the programs seem to have some limitations in regard to the maximum amount of memory they can test. The max I could get MemtestG80 to look at was 680MB out of the 1024MB on the GPU, even though GPU-Z only reported 81MB being in use prior to running the test. MemtestCL, however, was able to test 924MB under the same conditions. The advantage of MemtestG80 is that it runs about 8-10 times faster than MemtestCL (which took about 2.5 hours to test the 924MB for 50 iterations of its 13 different test schemes).

Of course, the absence of errors doesn't really prove that there isn't a weak bit lurking somewhere in there but, for now, I think I'll just let it ride. I've got BoincLogX running to capture Result files, so if the phantom Triplets show up again, perhaps there will be some more evidence available to help pin it down.

Finally had another one of these show up this morning, about 17 days after the last one. It's WU #1608838137, which my host completed and reported at about 11:51 PM local time on October 5. My machine thinks it found 24 Triplets, whereas both wingmen found none.

Unfortunately, when I checked the result file that BoincLogX captured, I found that it only contained the workunit_header information and none of the actual result data. Checking other result files captured around the same time, I found that some did contain result data and others didn't. I have to assume the BoincLogX monitoring interval is to blame, since I hadn't thought to change it from the 15 second default. Sigh... So now I've changed the monitoring interval to 5 seconds and will just have to wait and see if that does the trick the next time the phantom triplets show up (since I doubt if the problem will just go away on its own, even though it's pretty rare).


The result file truncation 'smells' a bit like the boincapi thread safety issues might be concerned (similar to truncated/missing stderr cases). Whether that's just a symptom, or somehow acting as a cause is a totally different, not easily answered, question I guess.

Though the time between events is long, since you are running under anonymous platform could you switch to one of my commode.obj enabled builds ? (intended to test one kindof workaround for boinc thread safety problems). If the same extra triplet issues still appear (different cause) then at least stderr and result content should be complete with this build, improving the likliehood of a firmer diagnosis (such as a driver synchronisation issue lurking).

If they stop appearing with this build, I would investigate potential host DPC ( driver/software-interrupt) latency issues, which can manifest from obscure driver and software quality issues, chipset/RAID/SATA drivers for one possible suspect of many.

Commode.obj enabled builds are provided for diagnostic purposes at:
http://jgopt.org/download.html
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1584231 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1584358 - Posted: 10 Oct 2014, 3:06:28 UTC - in response to Message 1584231.  

Finally had another one of these show up this morning, about 17 days after the last one. It's WU #1608838137, which my host completed and reported at about 11:51 PM local time on October 5. My machine thinks it found 24 Triplets, whereas both wingmen found none.

Unfortunately, when I checked the result file that BoincLogX captured, I found that it only contained the workunit_header information and none of the actual result data. Checking other result files captured around the same time, I found that some did contain result data and others didn't. I have to assume the BoincLogX monitoring interval is to blame, since I hadn't thought to change it from the 15 second default. Sigh... So now I've changed the monitoring interval to 5 seconds and will just have to wait and see if that does the trick the next time the phantom triplets show up (since I doubt if the problem will just go away on its own, even though it's pretty rare).


The result file truncation 'smells' a bit like the boincapi thread safety issues might be concerned (similar to truncated/missing stderr cases). Whether that's just a symptom, or somehow acting as a cause is a totally different, not easily answered, question I guess.

Though the time between events is long, since you are running under anonymous platform could you switch to one of my commode.obj enabled builds ? (intended to test one kindof workaround for boinc thread safety problems). If the same extra triplet issues still appear (different cause) then at least stderr and result content should be complete with this build, improving the likliehood of a firmer diagnosis (such as a driver synchronisation issue lurking).

If they stop appearing with this build, I would investigate potential host DPC ( driver/software-interrupt) latency issues, which can manifest from obscure driver and software quality issues, chipset/RAID/SATA drivers for one possible suspect of many.

Commode.obj enabled builds are provided for diagnostic purposes at:
http://jgopt.org/download.html

Thanks for the suggestion, Jsaon. Actually, my thinking is that even though the result file that BoincLogX captured was truncated, the file that was actually uploaded was probably complete. Richard Haselgrove had warned me (when we were first researching the truncated Stderr / immediate invalid issue last year) that BoincLogX needed a fairly aggressive monitoring interval in order to ensure capturing all result file. Unfortunately, I had forgotten about that until this morning, and had left it at the default interval of 15 seconds. I've now lowered it to 5 seconds, which seemed to work well last year.

Currently, the 3 machines that I have running Anonymous Platform have plain vanilla app_info files and I don't think I want to tinker with any of them until Richard releases the new installer for the AP v7 release and I get those machines updated. Once that's done, I'll look into your commode builds, especially since it looks like Eric is never going to bother to implement that simple server-side fix that Joe Segur came up with for the Immediate Invalid problem (which still occurs with depressing regularity, my latest one being WU 1609856673).
ID: 1584358 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588565 - Posted: 18 Oct 2014, 3:28:33 UTC

Okay, I think I've got another one after only about 8 days this time. It's task 3786421579 and, while it is just in an Inconclusive state right now, has the look and smell of one that will go Invalid once the tie-breaking wingman reports, likely in a day or two.

Where my host found 1 Pulse and 14 Triplets, my initial wingman found 1 Pulse and 0 Triplets. (Of course, I suppose it could be the wingman who got it wrong, but since his machine looks clean and he happens to be today's "User of the Day", I feel comfortable giving him the benefit of the doubt. :^))

Whether thanks to the shorter monitoring interval for BoincLogX, or to Jason's commode.obj build which I installed a couple days ago, it appears that I captured a complete result file this time. I thought perhaps some sort of pattern might jump out at me, but nothing grabs my untrained eye, so here's the results for all 14 triplets, in the hopes that something useful can be found that might point to a possible cause:

<triplet>
  <peak_power>1701.0302734375</peak_power>
  <mean_power>0.001165299792774</mean_power>
  <time>2456904.1331839</time>
  <ra>9.7058914595861</ra>
  <decl>17.092461484168</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.0762</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0196608</period>
</triplet>
<triplet>
  <peak_power>1954.1756591797</peak_power>
  <mean_power>0.001165299792774</mean_power>
  <time>2456904.1331839</time>
  <ra>9.7058914595861</ra>
  <decl>17.092461484168</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.0762</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0131072</period>
</triplet>
<triplet>
  <peak_power>2168.2409667969</peak_power>
  <mean_power>0.001165299792774</mean_power>
  <time>2456904.133184</time>
  <ra>9.7058932889566</ra>
  <decl>17.092461483448</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.3419</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0196608</period>
</triplet>
<triplet>
  <peak_power>781.69622802734</peak_power>
  <mean_power>0.001165299792774</mean_power>
  <time>2456904.133184</time>
  <ra>9.7058932889566</ra>
  <decl>17.092461483448</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.3419</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0065536</period>
</triplet>
<triplet>
  <peak_power>2168.2409667969</peak_power>
  <mean_power>0.001165299792774</mean_power>
  <time>2456904.133184</time>
  <ra>9.705895118327</ra>
  <decl>17.092461482727</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.6076</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0131072</period>
</triplet>
<triplet>
  <peak_power>1035.8040771484</peak_power>
  <mean_power>0.001165299792774</mean_power>
  <time>2456904.133184</time>
  <ra>9.705895118327</ra>
  <decl>17.092461482727</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.6076</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0065536</period>
</triplet>
<triplet>
  <peak_power>2168.2409667969</peak_power>
  <mean_power>0.001165299792774</mean_power>
  <time>2456904.1331841</time>
  <ra>9.7058969476975</ra>
  <decl>17.092461482007</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.8733</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0065536</period>
</triplet>
<triplet>
  <peak_power>1702.4604492188</peak_power>
  <mean_power>0.0011643208563328</mean_power>
  <time>2456904.1331839</time>
  <ra>9.7058914595861</ra>
  <decl>17.092461484168</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.0762</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0196608</period>
</triplet>
<triplet>
  <peak_power>1955.8186035156</peak_power>
  <mean_power>0.0011643208563328</mean_power>
  <time>2456904.1331839</time>
  <ra>9.7058914595861</ra>
  <decl>17.092461484168</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.0762</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0131072</period>
</triplet>
<triplet>
  <peak_power>2170.0639648438</peak_power>
  <mean_power>0.0011643208563328</mean_power>
  <time>2456904.133184</time>
  <ra>9.7058932889566</ra>
  <decl>17.092461483448</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.3419</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0196608</period>
</triplet>
<triplet>
  <peak_power>782.35345458984</peak_power>
  <mean_power>0.0011643208563328</mean_power>
  <time>2456904.133184</time>
  <ra>9.7058932889566</ra>
  <decl>17.092461483448</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.3419</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0065536</period>
</triplet>
<triplet>
  <peak_power>2170.0639648438</peak_power>
  <mean_power>0.0011643208563328</mean_power>
  <time>2456904.133184</time>
  <ra>9.705895118327</ra>
  <decl>17.092461482727</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.6076</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0131072</period>
</triplet>
<triplet>
  <peak_power>1036.6749267578</peak_power>
  <mean_power>0.0011643208563328</mean_power>
  <time>2456904.133184</time>
  <ra>9.705895118327</ra>
  <decl>17.092461482727</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.6076</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0065536</period>
</triplet>
<triplet>
  <peak_power>2170.0639648438</peak_power>
  <mean_power>0.0011643208563328</mean_power>
  <time>2456904.1331841</time>
  <ra>9.7058969476975</ra>
  <decl>17.092461482007</decl>
  <q_pix>0</q_pix>
  <freq>1420805664.0625</freq>
  <detection_freq>1420807021.8733</detection_freq>
  <barycentric_freq>0</barycentric_freq>
  <fft_len>8</fft_len>
  <chirp_rate>40.543618462507</chirp_rate>
  <rfi_checked>0</rfi_checked>
  <rfi_found>0</rfi_found>
  <reserved>0</reserved>
  <period>0.0065536</period>
</triplet>

I'll be happy to bundle up the WU and result files and send them off to anyone who'd like check them out in detail.
ID: 1588565 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1588603 - Posted: 18 Oct 2014, 6:34:47 UTC

OK, a quick look through and those all manifested from the same chirp-fft pair
<fft_len>8</fft_len>
<chirp_rate>40.543618462507</chirp_rate>

and have the same miniscule mean power, and calculated detection time. That makes them all part of the same 'event'.

Based on the nature of that glitch 'looking' much like a numerical version of a graphical artefact, I would suspect a few things. Through age or manufacturing concerns that silicon may well be operating near capability in terms of frequency &/or heat. Presuming you've checked the latter, with no evidence of overheating etc, Then I recommend either a ~0.05V core voltage bump, or a ~20MHz frequency drop. Either would IMO compensate for a combination of the age of the GPU, and mid range performance/price cards being pushed to the frequency limits (with some number of acceptable artefacts) from factory. Memory clock may also be an issue, though I would expect the nature to be more random with memory glitches than core loading. In this case there appears to be a quite specific borderline circuit.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1588603 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1588616 - Posted: 18 Oct 2014, 7:16:50 UTC - in response to Message 1588565.  

Okay, I think I've got another one after only about 8 days this time. It's task 3786421579 and, while it is just in an Inconclusive state right now, has the look and smell of one that will go Invalid once the tie-breaking wingman reports, likely in a day or two.

Where my host found 1 Pulse and 14 Triplets, my initial wingman found 1 Pulse and 0 Triplets. (Of course, I suppose it could be the wingman who got it wrong, but since his machine looks clean and he happens to be today's "User of the Day", I feel comfortable giving him the benefit of the doubt. :^))

Whether thanks to the shorter monitoring interval for BoincLogX, or to Jason's commode.obj build which I installed a couple days ago, it appears that I captured a complete result file this time. I thought perhaps some sort of pattern might jump out at me, but nothing grabs my untrained eye, so here's the results for all 14 triplets, in the hopes that something useful can be found that might point to a possible cause:
...
I'll be happy to bundle up the WU and result files and send them off to anyone who'd like check them out in detail.

Of the 24812 Triplets BoincLogX has saved in sah_boinc_triplets.csv on my Pentium-M host, the strongest has a peak power of 19.57. I think the 782 to 2170 peak power range of those 14 "found" by your processing is probably indicating corrupted data or something similar. If there really were Triplets that strong, IMO they should have been seen at lower chirp rates at less power and caused the task to exit with a result_overflow.

Nevertheless, I did grab a copy of the WU directly from the download server and have started offline testing to be sure it isn't some extreme RFI or such, should be done when I awake again. I'll also look at the other values for those Triplets to see if I can spot anything else extraordinary.
                                                                  Joe
ID: 1588616 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1588832 - Posted: 19 Oct 2014, 16:15:27 UTC

Jason's advice is of course the right thing to do.

For the record, offline processing of the WU with ATI OpenCL and CPU did not find any Triplets.

I reduced those 14 false Triplets to tabular form, with time converted to seconds from the WU beginning and the diffeence between detected_freq and freq shown as chirp:

peak_power       mean_power          time      chirp      period
---------------  ------------------  --------  ---------  ---------
1701.0302734375  0.001165299792774   33.47136  1357.0137  0.0196608
1954.1756591797  0.001165299792774   33.47136  1357.0137  0.0131072
2168.2409667969  0.001165299792774   33.48000  1357.2794  0.0196608
 781.6962280273  0.001165299792774   33.48000  1357.2794  0.0065536
2168.2409667969  0.001165299792774   33.48000  1357.5451  0.0131072
1035.8040771484  0.001165299792774   33.48000  1357.5451  0.0065536
2168.2409667969  0.001165299792774   33.48864  1357.8108  0.0065536
1702.4604492188  0.0011643208563328  33.47136  1357.0137  0.0196608
1955.8186035156  0.0011643208563328  33.47136  1357.0137  0.0131072
2170.0639648438  0.0011643208563328  33.48000  1357.2794  0.0196608
 782.3534545898  0.0011643208563328  33.48000  1357.2794  0.0065536
2170.0639648438  0.0011643208563328  33.48000  1357.5451  0.0131072
1036.6749267578  0.0011643208563328  33.48000  1357.5451  0.0065536
2170.0639648438  0.0011643208563328  33.48864  1357.8108  0.0065536


That indicates there were actually only 7 Triplets, but they were in the 50% overlap between two successive arrays so each was reported twice. Since the mean power of the second array was slightly lower, the peak power ratios were slightly higher there. About the only thing which can be certain is that the cause was not a single bit flip, the peak power is the strongest of the 3 peaks within a reported triplet so there are at least 5 peaks which were abnormal.
                                                                  Joe
ID: 1588832 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588835 - Posted: 19 Oct 2014, 16:32:59 UTC - in response to Message 1588603.  

OK, a quick look through and those all manifested from the same chirp-fft pair
<fft_len>8</fft_len>
<chirp_rate>40.543618462507</chirp_rate>

and have the same miniscule mean power, and calculated detection time. That makes them all part of the same 'event'.

Based on the nature of that glitch 'looking' much like a numerical version of a graphical artefact, I would suspect a few things. Through age or manufacturing concerns that silicon may well be operating near capability in terms of frequency &/or heat. Presuming you've checked the latter, with no evidence of overheating etc, Then I recommend either a ~0.05V core voltage bump, or a ~20MHz frequency drop. Either would IMO compensate for a combination of the age of the GPU, and mid range performance/price cards being pushed to the frequency limits (with some number of acceptable artefacts) from factory. Memory clock may also be an issue, though I would expect the nature to be more random with memory glitches than core loading. In this case there appears to be a quite specific borderline circuit.

I had just composed this reply yesterday when the site crashed. So, ~22 hours later...

Without understanding (or even trying to understand) the electronic intricacies of a GPU, a "quite specific borderline circuit" certainly sounds reasonable to me, inasmuch as it seems to be something that only manifests itself on rare occasions, and perhaps only under very specific conditions.

In any event, I tried dropping the core clock by 20MHz (using Precision X) but found that it won't let me lower it below the 970MHz base frequency. Raising the voltage by 0.05V would apparently put it at the red line maximum of 1.15V, so I took a more conservative route (for now), and just raised it by 0.012V to 1.112V. That's still in the "orange" zone, but didn't seem to break anything so I'll let it stay there and wait to see what the future brings! Unfortunately, because this problem is so infrequent, I don't know how long a wait would be necessary to say for certain that it went away. It could lurk for many weeks.
ID: 1588835 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588841 - Posted: 19 Oct 2014, 16:42:17 UTC - in response to Message 1588832.  

Jason's advice is of course the right thing to do.

For the record, offline processing of the WU with ATI OpenCL and CPU did not find any Triplets.

I reduced those 14 false Triplets to tabular form, with time converted to seconds from the WU beginning and the diffeence between detected_freq and freq shown as chirp:

...

That indicates there were actually only 7 Triplets, but they were in the 50% overlap between two successive arrays so each was reported twice. Since the mean power of the second array was slightly lower, the peak power ratios were slightly higher there. About the only thing which can be certain is that the cause was not a single bit flip, the peak power is the strongest of the 3 peaks within a reported triplet so there are at least 5 peaks which were abnormal.
                                                                  Joe

Thanks for taking the time to make that test and that thorough analysis, Joe. While I won't pretend to understand the technical details, the end result does seem to point to a GPU issue at my end, although your note that "the cause was not a single bit flip" leaves me wondering what I might actually be facing here.

I as noted in my earlier reply to Jason (which would have been posted about 22 hours ago had not the site crashed), I've boosted the GPU voltage slightly and will continue to monitor the box (which is currently out of GPU work, however, until enough backlogged tasks get uploaded to allow downloads to resume).
ID: 1588841 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1589048 - Posted: 20 Oct 2014, 1:11:28 UTC - in response to Message 1588841.  
Last modified: 20 Oct 2014, 1:22:34 UTC

...leaves me wondering what I might actually be facing here...


Basically the same as the white dot graphical artefacts you'd get overclocking a GPU for gaming, at which point backing off at least two 'notches' is the general corrective method. White dots in graphical glitches imply ~24-32 bits of saturation (bits flipped to on), so yeah many more than a single bit flip, though typically tied to single memory fetches.

As with overclocking, It happens with 'gaming grade' GPUs sometimes from factory due to price/performance market pressures, and the non-critical nature of a rare white dot, when parts binning and setting clocks for mid range GPUs, where the competition is steepest.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1589048 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1589065 - Posted: 20 Oct 2014, 1:41:20 UTC - in response to Message 1589048.  

Basically the same as the white dot graphical artefacts you'd get overclocking a GPU for gaming, at which point backing off at least two 'notches' is the general corrective method. White dots in graphical glitches imply ~24-32 bits of saturation (bits flipped to on), so yeah many more than a single bit flip, though typically tied to single memory fetches.

As with overclocking, It happens with 'gaming grade' GPUs sometimes from factory due to price/performance market pressures, and the non-critical nature of a rare white dot, when parts binning and setting clocks for mid range GPUs, where the competition is steepest.

Okay, thanks, Jason. Not being a gamer, I don't really have any experience with overclocking and artifacts ("white dot" or otherwise). I just run my cards at whatever clock rate they come with, so unless they're "factory" overclocked, they run at their design frequency. For that 550Ti, the base rate for the core clock seems to be 970MHz, and Precision X doesn't seem to want to let me drop it any further, so that's why I went with the voltage adjustment, though as I mentioned, I didn't want to push it to the maximum right away.

Thanks again for all your advice and insight.
ID: 1589065 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1590418 - Posted: 22 Oct 2014, 22:56:50 UTC

Well, I got another one already, task 3791593451, which found 24 triplets where the wingmen found none. So I guess my incremental voltage increase didn't have any effect. I've gone ahead now and bumped it up the full 0.05v per Jason's recommendation, to 1.150v, which appears to be the maximum for the card. At this point, I guess I don't mind if the card fries. That'll just be my excuse to buy another GTX 750Ti (which I figure would probably pay for itself in about 8 months in reduced electric costs). :^)
ID: 1590418 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1592774 - Posted: 27 Oct 2014, 4:55:31 UTC

So much for my voltage experiment! I noticed that my RAC on that machine had been slowly declining for the past couple weeks. I didn't think too much about it, since RAC tends to be fairly volatile (no news there). However, I noticed this evening the slow decline had turned into a plummet over the weekend (down over 21% from 2 weeks ago).

The machine still seemed to be running okay, with no new errors or invalids, so I thought that perhaps the commode build, which I installed on October 15, was taking longer to run, resulting in a reduced throughput. However, comparing MB run times by angle range before and after October 15 didn't seem to show any significant differences. Until I got to October 24, that is. Suddenly the run times quadrupled. That really got my attention, so I just went and took a closeer look at the machine. I found that the GPU clock, which normally runs at 970MHz, was limping along at only 405MHz. And the voltage, which I had carefully increased to 1.15v on October 22, was now at 0.95v. Not good!

I don't know if the increased voltage itself was the problem, or whether Precision X had introduced some other instability into the system. In any event, I disabled Precision X and rebooted the machine. It returned the clock to 970MHz but still maintained the voltage at 1.15v. I had to run Precision X again and tell it to reset all GPU defaults. Then I closed Precision X again, probably permanently. Hopefully, it will run smoothly from here on. I'll just have to live with the occasional phantom triplets, I think. It's definitely starting to feel like another GTX 750Ti in my future.
ID: 1592774 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1592781 - Posted: 27 Oct 2014, 4:59:45 UTC - in response to Message 1592774.  
Last modified: 27 Oct 2014, 5:03:38 UTC

That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue, I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.')
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1592781 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1592783 - Posted: 27 Oct 2014, 5:03:35 UTC - in response to Message 1592781.  

That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue)

No, it runs AP tasks, too. It will run either 2 MBs or 1 MB + 1 AP on the GPU.
ID: 1592783 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1592785 - Posted: 27 Oct 2014, 5:04:49 UTC - in response to Message 1592783.  

That sounds a lot like DPC latency issues resulting in driver failsafe, Is Cuda multibeam the only app that runs on there ? (different but related issue)

No, it runs AP tasks, too. It will run either 2 MBs or 1 MB + 1 AP on the GPU.


OK, you'll need to consult the AP app author.
as per edit I can't vouch for the thread safety of any app other than x41zc builds, lack of which can easily cause driver 'sticky downclocks.', alongside the possible system driver DPC issues.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1592785 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Phantom Triplets


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.