Posts by Jeff Buck


log in
1) Message boards : Number crunching : Astropulse 601 - when to give up? (Message 1560949)
Posted 9 days ago by Profile Jeff Buck
I first noticed this problem over a year ago ...
Me too, with the opt-apps from Lunatics 0.41 (AP6_win_x86_SSE_CPU_r1797.exe).

Maybe this has anything to do with the switch away from Intel-Compilers. The "old" opti-apps from Lunatics 0.40 (ap_6.01r557_SSE2_331_AVX.exe)dosen't show this problem.

__W__

I experienced it on the stock AP app, astropulse_6.01_windows_intelx86.exe, so the Lunatics version wasn't an issue for me.
2) Message boards : Number crunching : Astropulse 601 - when to give up? (Message 1560399)
Posted 10 days ago by Profile Jeff Buck
If you using seti-prefereces "use CPU xx%", maybe for temperature reasons, set it back to 100% and use TThrottle from eFMer instead to trottle your CPU and GPU.

I second this recommendation. I first noticed this problem over a year ago and also used Process Explorer to zero in on the thread "wait state" problem. See 3 of my posts in various threads:

http://setiathome.berkeley.edu/forum_thread.php?id=72043&postid=1391036#1391036
http://setiathome.berkeley.edu/forum_thread.php?id=73394&postid=1448493#1448493
http://setiathome.berkeley.edu/forum_thread.php?id=73970&postid=1471695#1471695

After installing TThrottle last September, the problem completely disappeared.
3) Message boards : Number crunching : Astropulse 601 - when to give up? (Message 1549294)
Posted 34 days ago by Profile Jeff Buck
My daily driver is an AMD 4800+ which, at 2.5Ghz is just a bit slower than yours. Back when I was running stock AP, it used to take between 57 and 76 hours, depending on what else I was doing on the machine and whether any throttling was going on for temperature control.
4) Message boards : Number crunching : The current top cruncher (Message 1549287)
Posted 34 days ago by Profile Jeff Buck
Well Charles' peak RAC got to just above 12,500,000, but it's been dropping off just lately too.

It looks like some of his Linux machines started dropping off from S@h about a week ago, like 7334442. Others more recently, such as 7334399. Each of the ones I looked at were still left with exactly 100 tasks "in progress", so it appears that he's stopping cold turkey and those tasks will just hang around until they time out. That's what's happening with 7309756 which last made contact on July 6 with 100 tasks "in progress". It looks like 52 of those have already timed out (shorties, I guess), but the other 48 aren't due until late August. Assuming he's got over 1,000 hosts running, that'll be over 100,000 WUs left in limbo for a while. Oh, well, I guess such a huge contribution to the project will be worth the extra overhead!
5) Message boards : Number crunching : The current top cruncher (Message 1544705)
Posted 43 days ago by Profile Jeff Buck
Well, the only one of the recent crop that I definitely know is his, 7309756, was run from June 12 to July 6, then appears to have stopped cold. All the other large blocks of hosts that I mentioned (speculated on) in my earlier post, started up between July 6 and July 10 and are still going strong. The small number (7) that I captured last year, such as 7056463 just ran from 31 Jul 2013 to 11 Aug 2013.

He's certainly making a tremendous contribution to the project, but it's probably just short term. Hopefully, when he's done he'll wind down the tasks in his queues (or at least abandon them) rather than stop cold turkey like he appears to have done with 7309756, which still has 100 tasks in progress. Multiply that by several hundred hosts and there could be an awful lot of WUs left hanging until they time out.

Edit: BTW, in my earlier post I had identified a total of 501 Anonymous hosts that I had been paired with, in 4 different ranges of host IDs, that looked like likely candidates. I just rechecked my database and found that the number has increased to 658 probables (and a couple of the ranges have expanded slightly, too). Remember, these are only the hosts that I've been paired with as wingman. That is only a small portion of the ranges that I identified. Probably, many of the other Anonymous hosts in those ranges can be identified as having similar configurations. For instance, I've been paired with 7335863 and 7335871, both 40-processor Linux boxes with an Anonymous owner. But if you look at the other 7 hosts between those two IDs, you'll find that they're all identical, too. Kind of mind boggling!
6) Message boards : Number crunching : The current top cruncher (Message 1542776)
Posted 46 days ago by Profile Jeff Buck
Well, I found some circumstantial evidence in my databsse that seems to support the "mostly Mac" theory. I found that over 90% of the host IDs added to my DB between 7330981 and 7331846 are owned by Anonymous, 159 of them, all added to S@h on July 6. (A host gets added to my DB the first time it serves as a wingmate on a WU that one of my hosts received.) I checked a handful of them and all appeared to be 16-processor Darwin boxes, though I saw a couple different Intel chips listed.

As I said, it's just circumstantial, but the addition of these machines seems to coincide with our rising star. I'm sure there are a lot more Anonymous machines in there that I just haven't happened to be paired with (yet), such as the one Juan mentioned.

Edit: Well, if I'd kept going I would have found another group of 84 added on July 7 & 8, between hosts 7333120 and 7333702. There are probably more, but I don't think I'll go looking any further. ;>)

Edit 2: Okay, I guess I lied. I couldn't resist. I've got another group of 24, but this time 40-processor Linux boxes, between 7334278 and 7334442, all added on July 8.

Edit 3: Definitely my last edit on this message. Between hosts 7335266 and 7335876, all added on July 10, my DB has snagged 234 more Anonymoust hosts, and the handful I checked are all 40-processor Linux boxes. That appears to put Linux considerably ahead of OS-X, after all, but certainly an interesting mixture!
7) Message boards : Number crunching : The current top cruncher (Message 1542725)
Posted 47 days ago by Profile Jeff Buck
Posts by Charles Long
Sounds like a bunch of Macs to me. Someone can't read that? He made 3 posts about getting all 16 cores on his 16 core OSX machines to work. Then mentioned his Datacenter machines that are also having the same problem. Seems pretty convincing to me.

From his first post: "This problem only exists with the OS X Seti client; the Linux clients that I am using run approximately one task per CPU."

OS X Seti client is singular, Linux clients is plural. Sounds to me like OS X is in the minority.

One of his current machines, that I happened to catch in my database before he activated the cloaking device, is a 40-processor Linux box, 7309756. He also appears to have done some similar stress-testing about a year ago. I've got 7 more of those machines in my DB, all similar to this 24-processor Linux box, 7056463. Perhaps they've recently upgraded those machines, leading to the new round of testing.
8) Message boards : Number crunching : Weirdness in BOINC Stats for Myself as SETI User (Message 1542387)
Posted 47 days ago by Profile Jeff Buck
112 Mark Lybeck 94,586,582 91,070 730,551 3,650,571 113,906 1
113 DeltaCTC@PJ 94,434,086 0 0 0 0
114 PhoenixdiGB 93,939,133 59,591 367,407 1,477,828 50,458 10
115 jravin 93,764,438 84,592 585,346 2,646,937 84,899 6
116 SameOldFatMan 93,496,780 2,058 20,317 106,682 3,261
117 Dick 92,912,069 8,531 60,232 174,137 6,848 143


In the above, I went from #114 to #115 today in total SETI credits, yet I do not see how anyone passed me. The guy just ahead of me (by ~175K) is running fewer credits/day than me, so he can't be the one.

Am I missing something?

Probably Charles Long, #90. He's rocketing up the charts!
9) Message boards : Number crunching : Panic Mode On (88) Server Problems? (Message 1542380)
Posted 47 days ago by Profile Jeff Buck
However part of the increase could be from this user which seems to be stress testing a freaking data center & chucking out 18,000,000 credits worth of work a day.

Okay, now I'm intrigued. And of course the computer list is hidden.

And at least one of their systems has more than an 8-core OSX machine, because all three posts they've ever done were in a thread asking how to get more than 8 to run at a time in OSX.

I only noticed them because they showed up on one of the stat overtake pages for me with over a million RAC & I thought "that can't be right", but it seems it is.
Their first post did mention something about load testing their data centers IIRC.

Here's one of his machines, 7309756, that got into my database before he hid them. Looks like he ran S@H on it for about 4 weeks, then stopped cold on July 6. A lot of WUs successfully processed, which is terrific, but he might have left 100 in limbo if that machine doesn't connect again. Hope he doesn't do it that way for his whole data center.
10) Message boards : Number crunching : blanked AP tasks (Message 1507106)
Posted 132 days ago by Profile Jeff Buck
Since a "reminder" was called for a couple months ago, I think I'll resurrect this discussion with AP task 3497681666, which ran on a GTX 660 for over 3.5 hours (and used 2.5+ hours of CPU time). The Stderr shows:

percent blanked: 99.85

When an AP task such as 3497598158 with "percent blanked: 0.00" runs on the same GPU in under 46 minutes (less than 20 min. CPU time) and a 100% blanked AP task such as 3497681654 is disposed of in 3.23 seconds (1.03 sec. CPU time), it strikes me as utterly perverse for a task that's 99.85% blanked to burn up so many resources to process so little data. (And that task was followed shortly thereafter by task 3497681640 which was 96.68% blanked and sucked up another 3.2 hours of Run Time.)

There's GOT to be a better way! :^)
11) Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them (Message 1497217)
Posted 156 days ago by Profile Jeff Buck
Despite the lack of interest in this problem, I figured I still should go ahead and test it out in my other configurations while the APs were available. I wanted to find out if the OS and GPU type had any bearing on the results.

My original tests were on NVIDIA GPUs running stock apps under Win 8.1. Here are the results from testing on three of my other machines:

6980751 (Win XP, NVIDIA GPUs, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6979886 (Win Vista, NVIDIA GPU, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6912878 (Win 7, ATI GPU, stock apps): AP GPU and MB GPU task continue to run w/o BOINC, MB CPU tasks shut down properly

So it appears that AP GPU tasks fail to shut down under every configuration that I can test and stock MB GPU tasks fail to shut down when running on an ATI GPU (at least under Win 7).
12) Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them (Message 1496418)
Posted 157 days ago by Profile Jeff Buck
Sunday morning it occurred to me that there was one scenario that I had experienced during my first round of BOINC-on BOINC-off configuration file changes, prior to discovery of the free-range AP tasks. When I first changed the password in the gui_rpc_auth.cfg file, I had done it with only BOINC Manager shut down, not the BOINC client. When I then tried to restart the BOINC Manager, I got a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Well, the directories really were the same, but apparently now the password wasn't. So then I shut the BOINC Manager down once again, this time telling it to stop the running tasks, as well. As far as I knew, it had done that successfully, despite the fact that it said it couldn't connect to the client, because the next restart of BOINC Manager did not experience any problems, connection or otherwise.

Well, I didn't have to wait as long as I feared to get an AP GPU task to test my theory with. And my initial test worked just as I thought it might. With BOINC Manager running but unable to connect to the client, which was controlling 7 MB CPU tasks, 8 MB GPU tasks and 1 AP GPU task, I exited BOINC Manager selecting the option to stop the running tasks. Judging by Process Explorer, it took at least 10 seconds for all the tasks to terminate, except for the one AP task, which continued to run all by itself:



When I restarted BOINC Manager, all the previously running tasks restarted, including a "phantom" AP task, while the original AP task continued to run standalone.



Note that the AP task shown as running under the BOINC Manager and client is not using any CPU resources, while the Zombie task is using 6.85%. The phantom task was only shown for about 35 seconds, until BOINC gave up trying to gain control of the lockfile. It vanished from Process Explorer and its status in BOINC Manager changed to "Waiting to run", at which point another MB task kicked off. Then, each time one of the MB tasks finished, BOINC tried again to run the AP task but kept running up against the lockfile issue. I watched that cycle repeat 3 times before shutting everything down and rebooting.

This would seem to indicate that AP GPU tasks may not recognize when the BOINC client shuts down, if the client wasn't running under the control of the BOINC Manager. Since the earlier reported occurrences of this phenomenon resulted from BOINC crashes, I would guess that there's something in the way the dominoes fall in one of those crashes that results in a similar scenario.

In my test, obviously, I used the BOINC Manager to shut down the client, even though they weren't connected (a bit disturbing in itself that this is possible, I think), but I would guess that manually shutting down the BOINC client even without BOINC Manager might have the same effect. However, I don't really know what the "cleanest" way would be to do that. (Task Manager or Process Explorer could kill it, but I'm afraid there might be collateral damage if I tried that.) I should probably leave that to one of the more experienced testers to try.

By the way, the AP task in my test, 3460974473 is now completed and awaiting validation, if anyone wants to check out the Stderr.
13) Message boards : Number crunching : Really old nVidia EOL! (Message 1496108)
Posted 158 days ago by Profile Jeff Buck
Thankfully most of the 400 series cards are unaffected, except for the 405, which I've never heard of before Nvidia mentioned it.

I've got a 405 on my daily driver, mainly because it only draws 25w and my PSU is rated at just 250w. It's been running just fine on the 314.22 driver and I don't expect I'll have any need to upgrade that.
14) Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them (Message 1495261)
Posted 159 days ago by Profile Jeff Buck
In an earlier thread, 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long", I had reported on some AP GPU tasks that continued to run even after BOINC had crashed. Last Friday afternoon I also found similar "Zombie" AP tasks that continued to run following a deliberate shutdown of the BOINC Manager and client.

Due to a complete hard drive failure on my T7400, host 7057115, the previous Sunday night, by Friday afternoon I had reloaded a replacement HD from the ground up, with OS (Win8.1), drivers, BOINC, and S@H. I was able to get the scheduler to resend all but 7 (VLARs) of the 140+ "lost" tasks for the host. Once they were all downloaded, I went about trying to reconfigure BOINC and S@H back to what I thought I remembered having on the old HD. As part of that process, I shut down and restarted BOINC several times, to pick up changes to the remote_hosts.cfg file, the gui_rpc_auth.cfg file, the firewall ports, etc.

Following what I think was the third BOINC shutdown, I happened to notice that the 3 GPUs were still under a partial load, and the CPUs were showing intermittent activity, as well. So I fired up Process Explorer and discovered 3 AP GPU tasks chugging along without benefit of BOINC, one on each GPU:



I then restarted BOINC, thinking that it would pick up the "undead" tasks, but Process Explorer still showed them running "outside the box". Notice, also, that there are 9 MB GPU tasks running, 3 per GPU. This would be normal if there weren't any AP GPU tasks running, as I try to run 3 MB tasks or 2 MB + 1 AP per GPU. Now it appeared that each GPU had 3 MB tasks running under BOINC's control along with 1 AP task with no supervision at all:



This time I tried exiting BOINC Manager without stopping the tasks running under the BOINC client and, as expected at this point, found that only boincmgr.exe had disappeared from Process Explorer:



When I restarted the BOINC Manager this time, I noticed something interesting in the task list:



Initially, there were 3 AP tasks and 6 MB tasks shown as "Running", with 3 additional MB tasks "Waiting to run". As I was puzzling over this, all 3 AP tasks changed to "Waiting to run" and the 3 MB tasks changed to "Running", as shown above. However, judging by Process Explorer, all 12 tasks appeared to be consuming resources. Checking the stderr.txt files for the AP tasks, I found 7 or 8 iterations of:

Running on device number: 0
DATA_CHUNK_UNROLL at default:2
DATA_CHUNK_UNROLL at default:2
16:32:27 (1464): Can't acquire lockfile (32) - waiting 35s
16:33:02 (1464): Can't acquire lockfile (32) - exiting
16:33:02 (1464): Error: The process cannot access the file because it is being used by another process. (0x20)


It appeared that BOINC might have been trying to run another instance of each AP task under its control, but couldn't because the tasks running outside its control had a grip on the lockfile.

At this point, I figured the safest thing to do, before the AP tasks ended with possible "finish file" errors, was to just reboot the machine. This finally succeeded in bringing the AP tasks back under BOINC's control:



All 3 AP tasks eventually completed normally, were quickly validated and have already vanished from my task detail pages (although I can still provide complete Stderr output if anybody wants it).

Later on Friday evening, I decided to see if I could recreate the "Zombie" scenario with some additional AP tasks that were by then running on the GPUs. I tried various sequences of shutting down BOINC Manager, sometimes specifying that the running tasks should stop also, and sometimes letting the client continue to run. In no instance were any AP tasks left alive when I told BOINC to stop the running tasks, so that left me rather stumped.

Sunday morning it occurred to me that there was one scenario that I had experienced during my first round of BOINC-on BOINC-off configuration file changes, prior to discovery of the free-range AP tasks. When I first changed the password in the gui_rpc_auth.cfg file, I had done it with only BOINC Manager shut down, not the BOINC client. When I then tried to restart the BOINC Manager, I got a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Well, the directories really were the same, but apparently now the password wasn't. So then I shut the BOINC Manager down once again, this time telling it to stop the running tasks, as well. As far as I knew, it had done that successfully, despite the fact that it said it couldn't connect to the client, because the next restart of BOINC Manager did not experience any problems, connection or otherwise.

It's possible, then, that this was the scenario that left the AP tasks running when BOINC thought that it had killed them. Unfortunately, by the time I came up with this theory on Sunday, AP GPU tasks were a distant memory and, unless I happen to snag an AP resend, it looks like it might be almost a week before I can actually test it out. I did recreate the mismatched password scenario on Sunday evening, but the only tasks running at the time were MB GPU, MB CPU and AP CPU. All shut down nicely when told politely to do so.

Anyway, for now this is just a theory of how these Zombie AP tasks might occur, but, unless somebody else still has an AP task running on an NVIDIA GPU and wants to take a crack at proving/disproving the theory, I'll just have to wait for the next round of APs.
15) Message boards : Number crunching : BOINC 7.2.42 - Reduced info in Event Log? (Message 1494856)
Posted 160 days ago by Profile Jeff Buck
The <cpu_sched>1</cpu_sched> log flag does the trick. I added it to cc_config this evening and the "Starting task" and "Restarting task" log entries pretty much reverted to their behavior in pre-7.2.42 BOINC, just with "[cpu_sched]" added to each message. Thanks for the help, Richard.
16) Message boards : Number crunching : BOINC 7.2.42 - Reduced info in Event Log? (Message 1494694)
Posted 161 days ago by Profile Jeff Buck
Which version were you used to using?

All my other machines are on 7.2.33, so this seems to be a sudden change. Looks like that <cpu_sched>1</cpu_sched> might be what I need to get the app and slot info back. I'll try it this evening when my T7400 comes back up. (It's slways shut down on weekday afternoons.)
17) Message boards : Number crunching : BOINC 7.2.42 - Reduced info in Event Log? (Message 1494670)
Posted 161 days ago by Profile Jeff Buck
On Friday I installed BOINC 7.2.42 on my host 7057115, not so much by choice but because I'd had to install fresh copies of everything on that machine due to a hard drive failure the previous Sunday night.

In the process of trying to research a problem that cropped up, with some AP tasks continuing to run when BOINC was shut down (and which I'll probably discuss in a new thread when I can get to it), I discovered that the Event Log in 7.2.42 seems to have done away with several pieces of very useful information. In 7.2.33, when a new task started, the log would show an entry similar to:

Starting task 17my13aa.23116.13155.438086664205.12.31_1 using setiathome_v7 version 700 (cuda42) in slot 1

In 7.2.42, the application and slot info have disappeared, leaving an entry looking like this:

Starting task 20my13ab.6061.3441.438086664200.12.14_1

On a machine that's only running one task at a time, like my old ThnikPad laptop, I suppose that's perfectly adequate. But on my T7400, which is usually running as many as 17 tasks at a time, trying to identify and follow task activity in the Event Log can be a real pain without that application and slot data. I also noticed that there were no entries in the log for task restarts, which also increases the difficulty of tracking task behavior. There might be other differences that I haven't been impacted by yet.

Does anybody know if these are intentional omissions by the BOINC developers, or if perhaps there's some new option that I need to set to turn this data back on? Under the circumstances, I'm not likely to upgrade my other machines to 7.2.42, and will probably see if I can downgrade on my T7400.
18) Message boards : Number crunching : 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long" (Message 1494659)
Posted 161 days ago by Profile Jeff Buck
I'm not sure whether that's the same problem or not. It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart.

Still, lots of lovely debug information logged, so I'll save that for Jason - looks like the replacement wingmate is a very fast host, so it may disappear too soon.

Got one last evening similar to Juan's, on a stock cuda42 task, 3453302106, if you you want to grab the debugging info before it vanishes.

Based on the timing, I think I caused this one myself, by shutting down and restarting BOINC several times while testing a theory related to the "Zombie" AP tasks that started this thread in the first place. (I'll probably post more on that theory in a separate thread when I can.) It seems as if BOINC actually shuts down before the applications do, and this task must have managed to "finish" after BOINC was already terminated. It then restarted at 86.16% but with the "finish file" already present.
19) Message boards : Number crunching : Panic Mode On (87) Server Problems? (Message 1488713)
Posted 171 days ago by Profile Jeff Buck
Here's the sort of thing that'll hold down the RTS buffer if it happens very often. My T7400, 7057115, just blew through 37 tasks in a row from "tape" 27mr13aa that had perfectly legitimate -9 overflows (i.e., not a runaway rig). Total Run Time = 238.35 seconds! These tasks were shorties to begin with, but this was ridiculous. :^)

All 37 tasks had names starting with "27mr13aa.1310.10292.438086664203.12". (Two other tasks from the same tape, but from a different sequence, produced strange errors. That's a whole different issue, I think.)
20) Message boards : Number crunching : Worth running two WUs simultaneously on mainstream Kepler cards ? (Message 1487644)
Posted 173 days ago by Profile Jeff Buck
I have 2 boxes with GTX660s in them, but both are dedicated crunchers.

On 6980751 a single GTX660 is grouped with a GTX650 and two GTX640s. Because of the slower cards, I limit it to 2 WUs per GPU. That puts the GTX660 at about 92% with 2 Multibeam tasks running (but often much less than that with 1 MB and 1 Astropulse). I'm guessing that you might see a comparable load if you went with 2 per GPU.

On 7057115 two GTX660s are grouped with a GTX670. Originally, that machine only had the 2 GTX660s and I started running with 2 WUs per GPU, which also put the normal load at about 92-93%. When I increased it to 3 WUs per GPU, I got the load up to about 98-99%. Obviously not a great increase, but an increase nonetheless. The RAC also increased a comparable amount, about 5-6%, so not a lot lost to additional overhead. When I added the GTX670, I just kept running the 3 WUs per GPU setup.

Bottom line is, just experiment with your own setup. If your machine is not a dedicated cruncher, you may not actually want to max out the GPU load with SETI, but that probably depends on what else you use it for.


Next 20

Copyright © 2014 University of California