Posts by Jeff Buck


log in
1) Message boards : Number crunching : Panic Mode On (89) Server Problems? (Message 1566785)
Posted 14 days ago by Profile Jeff Buck

If you had a Mac ALL you would run on the GPUs are APs since there isn't an ATI MB App. It's down to 2 GPU tasks while it has downloaded 9 more CPU tasks. The GPUs are about to go Idle again while it keeps getting CPU work.

Yeah, I don't have a MAC and only one of my machines has an ATI card (an old 5450).

I just noticed 2 more APs downloaded, this time to my T7400, and both went to the GPU queue, so whatever problem you're experiencing, it isn't hitting here. Perhaps another MAC / ATI user will have to chime in. Certainly the AP RTS buffer is still running on empty, and the new tapes are throwing errors just like the last batch, so that isn't helping.
2) Message boards : Number crunching : Panic Mode On (89) Server Problems? (Message 1566779)
Posted 14 days ago by Profile Jeff Buck
Well, so far I've only gotten 1 AP task in the 3+ hours since they started loading new tapes, and it went to a GPU on my xw9400.
3) Message boards : Number crunching : Panic Mode On (89) Server Problems? (Message 1566768)
Posted 14 days ago by Profile Jeff Buck
Has anyone noticed the server is sending out CPU tasks even if your GPUs don't have any work? All 3 of my Hosts are receiving CPU work even though they are not needed. I have a Mac whose 3 GPUs are going idle while the server sends it CPU work. If I change the preferences to GPU only, the server doesn't send any work :-(

What I'm seeing over the last several days is that sometimes a work fetch for both CPU and GPU will only return one or more VLARs for the CPU, but nothing for the GPU. However, usually the next request will fill up the GPU queue, too. It seems as if when the scheduler only finds VLARs at the top of its RTS buffer, it sometimes doesn't dig very deep to find non-VLARs for the GPU if it's pressed for time. I haven't had any hosts stop getting GPU tasks altogether.

Edit: Ah, I see you're talking about AP-only hosts. Never mind! AP tasks are rare in my world.
4) Message boards : Number crunching : Panic Mode On (89) Server Problems? (Message 1566522)
Posted 14 days ago by Profile Jeff Buck
I see a thread over in Q&A, Is SETI@HOME down?, which appears to relate. The host is also still using BOINC 6.2.19.
5) Questions and Answers : Windows : Is SETI@HOME down? (Message 1566520)
Posted 14 days ago by Profile Jeff Buck
There appears to have been some server-side change which has affected hosts still using BOINC 6.2.19. Take a look at Message 1566331 and others around it in that thread. Best bet would appear to be to install a newer version of BOINC.
6) Message boards : Number crunching : Astropulse 601 - when to give up? (Message 1560949)
Posted 26 days ago by Profile Jeff Buck
I first noticed this problem over a year ago ...
Me too, with the opt-apps from Lunatics 0.41 (AP6_win_x86_SSE_CPU_r1797.exe).

Maybe this has anything to do with the switch away from Intel-Compilers. The "old" opti-apps from Lunatics 0.40 (ap_6.01r557_SSE2_331_AVX.exe)dosen't show this problem.

__W__

I experienced it on the stock AP app, astropulse_6.01_windows_intelx86.exe, so the Lunatics version wasn't an issue for me.
7) Message boards : Number crunching : Astropulse 601 - when to give up? (Message 1560399)
Posted 27 days ago by Profile Jeff Buck
If you using seti-prefereces "use CPU xx%", maybe for temperature reasons, set it back to 100% and use TThrottle from eFMer instead to trottle your CPU and GPU.

I second this recommendation. I first noticed this problem over a year ago and also used Process Explorer to zero in on the thread "wait state" problem. See 3 of my posts in various threads:

http://setiathome.berkeley.edu/forum_thread.php?id=72043&postid=1391036#1391036
http://setiathome.berkeley.edu/forum_thread.php?id=73394&postid=1448493#1448493
http://setiathome.berkeley.edu/forum_thread.php?id=73970&postid=1471695#1471695

After installing TThrottle last September, the problem completely disappeared.
8) Message boards : Number crunching : Astropulse 601 - when to give up? (Message 1549294)
Posted 51 days ago by Profile Jeff Buck
My daily driver is an AMD 4800+ which, at 2.5Ghz is just a bit slower than yours. Back when I was running stock AP, it used to take between 57 and 76 hours, depending on what else I was doing on the machine and whether any throttling was going on for temperature control.
9) Message boards : Number crunching : The current top cruncher (Message 1549287)
Posted 51 days ago by Profile Jeff Buck
Well Charles' peak RAC got to just above 12,500,000, but it's been dropping off just lately too.

It looks like some of his Linux machines started dropping off from S@h about a week ago, like 7334442. Others more recently, such as 7334399. Each of the ones I looked at were still left with exactly 100 tasks "in progress", so it appears that he's stopping cold turkey and those tasks will just hang around until they time out. That's what's happening with 7309756 which last made contact on July 6 with 100 tasks "in progress". It looks like 52 of those have already timed out (shorties, I guess), but the other 48 aren't due until late August. Assuming he's got over 1,000 hosts running, that'll be over 100,000 WUs left in limbo for a while. Oh, well, I guess such a huge contribution to the project will be worth the extra overhead!
10) Message boards : Number crunching : The current top cruncher (Message 1544705)
Posted 60 days ago by Profile Jeff Buck
Well, the only one of the recent crop that I definitely know is his, 7309756, was run from June 12 to July 6, then appears to have stopped cold. All the other large blocks of hosts that I mentioned (speculated on) in my earlier post, started up between July 6 and July 10 and are still going strong. The small number (7) that I captured last year, such as 7056463 just ran from 31 Jul 2013 to 11 Aug 2013.

He's certainly making a tremendous contribution to the project, but it's probably just short term. Hopefully, when he's done he'll wind down the tasks in his queues (or at least abandon them) rather than stop cold turkey like he appears to have done with 7309756, which still has 100 tasks in progress. Multiply that by several hundred hosts and there could be an awful lot of WUs left hanging until they time out.

Edit: BTW, in my earlier post I had identified a total of 501 Anonymous hosts that I had been paired with, in 4 different ranges of host IDs, that looked like likely candidates. I just rechecked my database and found that the number has increased to 658 probables (and a couple of the ranges have expanded slightly, too). Remember, these are only the hosts that I've been paired with as wingman. That is only a small portion of the ranges that I identified. Probably, many of the other Anonymous hosts in those ranges can be identified as having similar configurations. For instance, I've been paired with 7335863 and 7335871, both 40-processor Linux boxes with an Anonymous owner. But if you look at the other 7 hosts between those two IDs, you'll find that they're all identical, too. Kind of mind boggling!
11) Message boards : Number crunching : The current top cruncher (Message 1542776)
Posted 63 days ago by Profile Jeff Buck
Well, I found some circumstantial evidence in my databsse that seems to support the "mostly Mac" theory. I found that over 90% of the host IDs added to my DB between 7330981 and 7331846 are owned by Anonymous, 159 of them, all added to S@h on July 6. (A host gets added to my DB the first time it serves as a wingmate on a WU that one of my hosts received.) I checked a handful of them and all appeared to be 16-processor Darwin boxes, though I saw a couple different Intel chips listed.

As I said, it's just circumstantial, but the addition of these machines seems to coincide with our rising star. I'm sure there are a lot more Anonymous machines in there that I just haven't happened to be paired with (yet), such as the one Juan mentioned.

Edit: Well, if I'd kept going I would have found another group of 84 added on July 7 & 8, between hosts 7333120 and 7333702. There are probably more, but I don't think I'll go looking any further. ;>)

Edit 2: Okay, I guess I lied. I couldn't resist. I've got another group of 24, but this time 40-processor Linux boxes, between 7334278 and 7334442, all added on July 8.

Edit 3: Definitely my last edit on this message. Between hosts 7335266 and 7335876, all added on July 10, my DB has snagged 234 more Anonymoust hosts, and the handful I checked are all 40-processor Linux boxes. That appears to put Linux considerably ahead of OS-X, after all, but certainly an interesting mixture!
12) Message boards : Number crunching : The current top cruncher (Message 1542725)
Posted 63 days ago by Profile Jeff Buck
Posts by Charles Long
Sounds like a bunch of Macs to me. Someone can't read that? He made 3 posts about getting all 16 cores on his 16 core OSX machines to work. Then mentioned his Datacenter machines that are also having the same problem. Seems pretty convincing to me.

From his first post: "This problem only exists with the OS X Seti client; the Linux clients that I am using run approximately one task per CPU."

OS X Seti client is singular, Linux clients is plural. Sounds to me like OS X is in the minority.

One of his current machines, that I happened to catch in my database before he activated the cloaking device, is a 40-processor Linux box, 7309756. He also appears to have done some similar stress-testing about a year ago. I've got 7 more of those machines in my DB, all similar to this 24-processor Linux box, 7056463. Perhaps they've recently upgraded those machines, leading to the new round of testing.
13) Message boards : Number crunching : Weirdness in BOINC Stats for Myself as SETI User (Message 1542387)
Posted 64 days ago by Profile Jeff Buck
112 Mark Lybeck 94,586,582 91,070 730,551 3,650,571 113,906 1
113 DeltaCTC@PJ 94,434,086 0 0 0 0
114 PhoenixdiGB 93,939,133 59,591 367,407 1,477,828 50,458 10
115 jravin 93,764,438 84,592 585,346 2,646,937 84,899 6
116 SameOldFatMan 93,496,780 2,058 20,317 106,682 3,261
117 Dick 92,912,069 8,531 60,232 174,137 6,848 143


In the above, I went from #114 to #115 today in total SETI credits, yet I do not see how anyone passed me. The guy just ahead of me (by ~175K) is running fewer credits/day than me, so he can't be the one.

Am I missing something?

Probably Charles Long, #90. He's rocketing up the charts!
14) Message boards : Number crunching : Panic Mode On (88) Server Problems? (Message 1542380)
Posted 64 days ago by Profile Jeff Buck
However part of the increase could be from this user which seems to be stress testing a freaking data center & chucking out 18,000,000 credits worth of work a day.

Okay, now I'm intrigued. And of course the computer list is hidden.

And at least one of their systems has more than an 8-core OSX machine, because all three posts they've ever done were in a thread asking how to get more than 8 to run at a time in OSX.

I only noticed them because they showed up on one of the stat overtake pages for me with over a million RAC & I thought "that can't be right", but it seems it is.
Their first post did mention something about load testing their data centers IIRC.

Here's one of his machines, 7309756, that got into my database before he hid them. Looks like he ran S@H on it for about 4 weeks, then stopped cold on July 6. A lot of WUs successfully processed, which is terrific, but he might have left 100 in limbo if that machine doesn't connect again. Hope he doesn't do it that way for his whole data center.
15) Message boards : Number crunching : blanked AP tasks (Message 1507106)
Posted 149 days ago by Profile Jeff Buck
Since a "reminder" was called for a couple months ago, I think I'll resurrect this discussion with AP task 3497681666, which ran on a GTX 660 for over 3.5 hours (and used 2.5+ hours of CPU time). The Stderr shows:

percent blanked: 99.85

When an AP task such as 3497598158 with "percent blanked: 0.00" runs on the same GPU in under 46 minutes (less than 20 min. CPU time) and a 100% blanked AP task such as 3497681654 is disposed of in 3.23 seconds (1.03 sec. CPU time), it strikes me as utterly perverse for a task that's 99.85% blanked to burn up so many resources to process so little data. (And that task was followed shortly thereafter by task 3497681640 which was 96.68% blanked and sucked up another 3.2 hours of Run Time.)

There's GOT to be a better way! :^)
16) Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them (Message 1497217)
Posted 172 days ago by Profile Jeff Buck
Despite the lack of interest in this problem, I figured I still should go ahead and test it out in my other configurations while the APs were available. I wanted to find out if the OS and GPU type had any bearing on the results.

My original tests were on NVIDIA GPUs running stock apps under Win 8.1. Here are the results from testing on three of my other machines:

6980751 (Win XP, NVIDIA GPUs, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6979886 (Win Vista, NVIDIA GPU, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6912878 (Win 7, ATI GPU, stock apps): AP GPU and MB GPU task continue to run w/o BOINC, MB CPU tasks shut down properly

So it appears that AP GPU tasks fail to shut down under every configuration that I can test and stock MB GPU tasks fail to shut down when running on an ATI GPU (at least under Win 7).
17) Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them (Message 1496418)
Posted 174 days ago by Profile Jeff Buck
Sunday morning it occurred to me that there was one scenario that I had experienced during my first round of BOINC-on BOINC-off configuration file changes, prior to discovery of the free-range AP tasks. When I first changed the password in the gui_rpc_auth.cfg file, I had done it with only BOINC Manager shut down, not the BOINC client. When I then tried to restart the BOINC Manager, I got a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Well, the directories really were the same, but apparently now the password wasn't. So then I shut the BOINC Manager down once again, this time telling it to stop the running tasks, as well. As far as I knew, it had done that successfully, despite the fact that it said it couldn't connect to the client, because the next restart of BOINC Manager did not experience any problems, connection or otherwise.

Well, I didn't have to wait as long as I feared to get an AP GPU task to test my theory with. And my initial test worked just as I thought it might. With BOINC Manager running but unable to connect to the client, which was controlling 7 MB CPU tasks, 8 MB GPU tasks and 1 AP GPU task, I exited BOINC Manager selecting the option to stop the running tasks. Judging by Process Explorer, it took at least 10 seconds for all the tasks to terminate, except for the one AP task, which continued to run all by itself:



When I restarted BOINC Manager, all the previously running tasks restarted, including a "phantom" AP task, while the original AP task continued to run standalone.



Note that the AP task shown as running under the BOINC Manager and client is not using any CPU resources, while the Zombie task is using 6.85%. The phantom task was only shown for about 35 seconds, until BOINC gave up trying to gain control of the lockfile. It vanished from Process Explorer and its status in BOINC Manager changed to "Waiting to run", at which point another MB task kicked off. Then, each time one of the MB tasks finished, BOINC tried again to run the AP task but kept running up against the lockfile issue. I watched that cycle repeat 3 times before shutting everything down and rebooting.

This would seem to indicate that AP GPU tasks may not recognize when the BOINC client shuts down, if the client wasn't running under the control of the BOINC Manager. Since the earlier reported occurrences of this phenomenon resulted from BOINC crashes, I would guess that there's something in the way the dominoes fall in one of those crashes that results in a similar scenario.

In my test, obviously, I used the BOINC Manager to shut down the client, even though they weren't connected (a bit disturbing in itself that this is possible, I think), but I would guess that manually shutting down the BOINC client even without BOINC Manager might have the same effect. However, I don't really know what the "cleanest" way would be to do that. (Task Manager or Process Explorer could kill it, but I'm afraid there might be collateral damage if I tried that.) I should probably leave that to one of the more experienced testers to try.

By the way, the AP task in my test, 3460974473 is now completed and awaiting validation, if anyone wants to check out the Stderr.
18) Message boards : Number crunching : Really old nVidia EOL! (Message 1496108)
Posted 174 days ago by Profile Jeff Buck
Thankfully most of the 400 series cards are unaffected, except for the 405, which I've never heard of before Nvidia mentioned it.

I've got a 405 on my daily driver, mainly because it only draws 25w and my PSU is rated at just 250w. It's been running just fine on the 314.22 driver and I don't expect I'll have any need to upgrade that.
19) Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them (Message 1495261)
Posted 176 days ago by Profile Jeff Buck
In an earlier thread, 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long", I had reported on some AP GPU tasks that continued to run even after BOINC had crashed. Last Friday afternoon I also found similar "Zombie" AP tasks that continued to run following a deliberate shutdown of the BOINC Manager and client.

Due to a complete hard drive failure on my T7400, host 7057115, the previous Sunday night, by Friday afternoon I had reloaded a replacement HD from the ground up, with OS (Win8.1), drivers, BOINC, and S@H. I was able to get the scheduler to resend all but 7 (VLARs) of the 140+ "lost" tasks for the host. Once they were all downloaded, I went about trying to reconfigure BOINC and S@H back to what I thought I remembered having on the old HD. As part of that process, I shut down and restarted BOINC several times, to pick up changes to the remote_hosts.cfg file, the gui_rpc_auth.cfg file, the firewall ports, etc.

Following what I think was the third BOINC shutdown, I happened to notice that the 3 GPUs were still under a partial load, and the CPUs were showing intermittent activity, as well. So I fired up Process Explorer and discovered 3 AP GPU tasks chugging along without benefit of BOINC, one on each GPU:



I then restarted BOINC, thinking that it would pick up the "undead" tasks, but Process Explorer still showed them running "outside the box". Notice, also, that there are 9 MB GPU tasks running, 3 per GPU. This would be normal if there weren't any AP GPU tasks running, as I try to run 3 MB tasks or 2 MB + 1 AP per GPU. Now it appeared that each GPU had 3 MB tasks running under BOINC's control along with 1 AP task with no supervision at all:



This time I tried exiting BOINC Manager without stopping the tasks running under the BOINC client and, as expected at this point, found that only boincmgr.exe had disappeared from Process Explorer:



When I restarted the BOINC Manager this time, I noticed something interesting in the task list:



Initially, there were 3 AP tasks and 6 MB tasks shown as "Running", with 3 additional MB tasks "Waiting to run". As I was puzzling over this, all 3 AP tasks changed to "Waiting to run" and the 3 MB tasks changed to "Running", as shown above. However, judging by Process Explorer, all 12 tasks appeared to be consuming resources. Checking the stderr.txt files for the AP tasks, I found 7 or 8 iterations of:

Running on device number: 0
DATA_CHUNK_UNROLL at default:2
DATA_CHUNK_UNROLL at default:2
16:32:27 (1464): Can't acquire lockfile (32) - waiting 35s
16:33:02 (1464): Can't acquire lockfile (32) - exiting
16:33:02 (1464): Error: The process cannot access the file because it is being used by another process. (0x20)


It appeared that BOINC might have been trying to run another instance of each AP task under its control, but couldn't because the tasks running outside its control had a grip on the lockfile.

At this point, I figured the safest thing to do, before the AP tasks ended with possible "finish file" errors, was to just reboot the machine. This finally succeeded in bringing the AP tasks back under BOINC's control:



All 3 AP tasks eventually completed normally, were quickly validated and have already vanished from my task detail pages (although I can still provide complete Stderr output if anybody wants it).

Later on Friday evening, I decided to see if I could recreate the "Zombie" scenario with some additional AP tasks that were by then running on the GPUs. I tried various sequences of shutting down BOINC Manager, sometimes specifying that the running tasks should stop also, and sometimes letting the client continue to run. In no instance were any AP tasks left alive when I told BOINC to stop the running tasks, so that left me rather stumped.

Sunday morning it occurred to me that there was one scenario that I had experienced during my first round of BOINC-on BOINC-off configuration file changes, prior to discovery of the free-range AP tasks. When I first changed the password in the gui_rpc_auth.cfg file, I had done it with only BOINC Manager shut down, not the BOINC client. When I then tried to restart the BOINC Manager, I got a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Well, the directories really were the same, but apparently now the password wasn't. So then I shut the BOINC Manager down once again, this time telling it to stop the running tasks, as well. As far as I knew, it had done that successfully, despite the fact that it said it couldn't connect to the client, because the next restart of BOINC Manager did not experience any problems, connection or otherwise.

It's possible, then, that this was the scenario that left the AP tasks running when BOINC thought that it had killed them. Unfortunately, by the time I came up with this theory on Sunday, AP GPU tasks were a distant memory and, unless I happen to snag an AP resend, it looks like it might be almost a week before I can actually test it out. I did recreate the mismatched password scenario on Sunday evening, but the only tasks running at the time were MB GPU, MB CPU and AP CPU. All shut down nicely when told politely to do so.

Anyway, for now this is just a theory of how these Zombie AP tasks might occur, but, unless somebody else still has an AP task running on an NVIDIA GPU and wants to take a crack at proving/disproving the theory, I'll just have to wait for the next round of APs.
20) Message boards : Number crunching : BOINC 7.2.42 - Reduced info in Event Log? (Message 1494856)
Posted 177 days ago by Profile Jeff Buck
The <cpu_sched>1</cpu_sched> log flag does the trick. I added it to cc_config this evening and the "Starting task" and "Restarting task" log entries pretty much reverted to their behavior in pre-7.2.42 BOINC, just with "[cpu_sched]" added to each message. Thanks for the help, Richard.


Next 20

Copyright © 2014 University of California