Posts by Jeff Buck


log in
1) Message boards : Number crunching : blanked AP tasks (Message 1507106)
Posted 1 day ago by Profile Jeff Buck
Since a "reminder" was called for a couple months ago, I think I'll resurrect this discussion with AP task 3497681666, which ran on a GTX 660 for over 3.5 hours (and used 2.5+ hours of CPU time). The Stderr shows:

percent blanked: 99.85

When an AP task such as 3497598158 with "percent blanked: 0.00" runs on the same GPU in under 46 minutes (less than 20 min. CPU time) and a 100% blanked AP task such as 3497681654 is disposed of in 3.23 seconds (1.03 sec. CPU time), it strikes me as utterly perverse for a task that's 99.85% blanked to burn up so many resources to process so little data. (And that task was followed shortly thereafter by task 3497681640 which was 96.68% blanked and sucked up another 3.2 hours of Run Time.)

There's GOT to be a better way! :^)
2) Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them (Message 1497217)
Posted 24 days ago by Profile Jeff Buck
Despite the lack of interest in this problem, I figured I still should go ahead and test it out in my other configurations while the APs were available. I wanted to find out if the OS and GPU type had any bearing on the results.

My original tests were on NVIDIA GPUs running stock apps under Win 8.1. Here are the results from testing on three of my other machines:

6980751 (Win XP, NVIDIA GPUs, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6979886 (Win Vista, NVIDIA GPU, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6912878 (Win 7, ATI GPU, stock apps): AP GPU and MB GPU task continue to run w/o BOINC, MB CPU tasks shut down properly

So it appears that AP GPU tasks fail to shut down under every configuration that I can test and stock MB GPU tasks fail to shut down when running on an ATI GPU (at least under Win 7).
3) Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them (Message 1496418)
Posted 26 days ago by Profile Jeff Buck
Sunday morning it occurred to me that there was one scenario that I had experienced during my first round of BOINC-on BOINC-off configuration file changes, prior to discovery of the free-range AP tasks. When I first changed the password in the gui_rpc_auth.cfg file, I had done it with only BOINC Manager shut down, not the BOINC client. When I then tried to restart the BOINC Manager, I got a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Well, the directories really were the same, but apparently now the password wasn't. So then I shut the BOINC Manager down once again, this time telling it to stop the running tasks, as well. As far as I knew, it had done that successfully, despite the fact that it said it couldn't connect to the client, because the next restart of BOINC Manager did not experience any problems, connection or otherwise.

Well, I didn't have to wait as long as I feared to get an AP GPU task to test my theory with. And my initial test worked just as I thought it might. With BOINC Manager running but unable to connect to the client, which was controlling 7 MB CPU tasks, 8 MB GPU tasks and 1 AP GPU task, I exited BOINC Manager selecting the option to stop the running tasks. Judging by Process Explorer, it took at least 10 seconds for all the tasks to terminate, except for the one AP task, which continued to run all by itself:



When I restarted BOINC Manager, all the previously running tasks restarted, including a "phantom" AP task, while the original AP task continued to run standalone.



Note that the AP task shown as running under the BOINC Manager and client is not using any CPU resources, while the Zombie task is using 6.85%. The phantom task was only shown for about 35 seconds, until BOINC gave up trying to gain control of the lockfile. It vanished from Process Explorer and its status in BOINC Manager changed to "Waiting to run", at which point another MB task kicked off. Then, each time one of the MB tasks finished, BOINC tried again to run the AP task but kept running up against the lockfile issue. I watched that cycle repeat 3 times before shutting everything down and rebooting.

This would seem to indicate that AP GPU tasks may not recognize when the BOINC client shuts down, if the client wasn't running under the control of the BOINC Manager. Since the earlier reported occurrences of this phenomenon resulted from BOINC crashes, I would guess that there's something in the way the dominoes fall in one of those crashes that results in a similar scenario.

In my test, obviously, I used the BOINC Manager to shut down the client, even though they weren't connected (a bit disturbing in itself that this is possible, I think), but I would guess that manually shutting down the BOINC client even without BOINC Manager might have the same effect. However, I don't really know what the "cleanest" way would be to do that. (Task Manager or Process Explorer could kill it, but I'm afraid there might be collateral damage if I tried that.) I should probably leave that to one of the more experienced testers to try.

By the way, the AP task in my test, 3460974473 is now completed and awaiting validation, if anyone wants to check out the Stderr.
4) Message boards : Number crunching : Really old nVidia EOL! (Message 1496108)
Posted 26 days ago by Profile Jeff Buck
Thankfully most of the 400 series cards are unaffected, except for the 405, which I've never heard of before Nvidia mentioned it.

I've got a 405 on my daily driver, mainly because it only draws 25w and my PSU is rated at just 250w. It's been running just fine on the 314.22 driver and I don't expect I'll have any need to upgrade that.
5) Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them (Message 1495261)
Posted 28 days ago by Profile Jeff Buck
In an earlier thread, 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long", I had reported on some AP GPU tasks that continued to run even after BOINC had crashed. Last Friday afternoon I also found similar "Zombie" AP tasks that continued to run following a deliberate shutdown of the BOINC Manager and client.

Due to a complete hard drive failure on my T7400, host 7057115, the previous Sunday night, by Friday afternoon I had reloaded a replacement HD from the ground up, with OS (Win8.1), drivers, BOINC, and S@H. I was able to get the scheduler to resend all but 7 (VLARs) of the 140+ "lost" tasks for the host. Once they were all downloaded, I went about trying to reconfigure BOINC and S@H back to what I thought I remembered having on the old HD. As part of that process, I shut down and restarted BOINC several times, to pick up changes to the remote_hosts.cfg file, the gui_rpc_auth.cfg file, the firewall ports, etc.

Following what I think was the third BOINC shutdown, I happened to notice that the 3 GPUs were still under a partial load, and the CPUs were showing intermittent activity, as well. So I fired up Process Explorer and discovered 3 AP GPU tasks chugging along without benefit of BOINC, one on each GPU:



I then restarted BOINC, thinking that it would pick up the "undead" tasks, but Process Explorer still showed them running "outside the box". Notice, also, that there are 9 MB GPU tasks running, 3 per GPU. This would be normal if there weren't any AP GPU tasks running, as I try to run 3 MB tasks or 2 MB + 1 AP per GPU. Now it appeared that each GPU had 3 MB tasks running under BOINC's control along with 1 AP task with no supervision at all:



This time I tried exiting BOINC Manager without stopping the tasks running under the BOINC client and, as expected at this point, found that only boincmgr.exe had disappeared from Process Explorer:



When I restarted the BOINC Manager this time, I noticed something interesting in the task list:



Initially, there were 3 AP tasks and 6 MB tasks shown as "Running", with 3 additional MB tasks "Waiting to run". As I was puzzling over this, all 3 AP tasks changed to "Waiting to run" and the 3 MB tasks changed to "Running", as shown above. However, judging by Process Explorer, all 12 tasks appeared to be consuming resources. Checking the stderr.txt files for the AP tasks, I found 7 or 8 iterations of:

Running on device number: 0
DATA_CHUNK_UNROLL at default:2
DATA_CHUNK_UNROLL at default:2
16:32:27 (1464): Can't acquire lockfile (32) - waiting 35s
16:33:02 (1464): Can't acquire lockfile (32) - exiting
16:33:02 (1464): Error: The process cannot access the file because it is being used by another process. (0x20)


It appeared that BOINC might have been trying to run another instance of each AP task under its control, but couldn't because the tasks running outside its control had a grip on the lockfile.

At this point, I figured the safest thing to do, before the AP tasks ended with possible "finish file" errors, was to just reboot the machine. This finally succeeded in bringing the AP tasks back under BOINC's control:



All 3 AP tasks eventually completed normally, were quickly validated and have already vanished from my task detail pages (although I can still provide complete Stderr output if anybody wants it).

Later on Friday evening, I decided to see if I could recreate the "Zombie" scenario with some additional AP tasks that were by then running on the GPUs. I tried various sequences of shutting down BOINC Manager, sometimes specifying that the running tasks should stop also, and sometimes letting the client continue to run. In no instance were any AP tasks left alive when I told BOINC to stop the running tasks, so that left me rather stumped.

Sunday morning it occurred to me that there was one scenario that I had experienced during my first round of BOINC-on BOINC-off configuration file changes, prior to discovery of the free-range AP tasks. When I first changed the password in the gui_rpc_auth.cfg file, I had done it with only BOINC Manager shut down, not the BOINC client. When I then tried to restart the BOINC Manager, I got a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Well, the directories really were the same, but apparently now the password wasn't. So then I shut the BOINC Manager down once again, this time telling it to stop the running tasks, as well. As far as I knew, it had done that successfully, despite the fact that it said it couldn't connect to the client, because the next restart of BOINC Manager did not experience any problems, connection or otherwise.

It's possible, then, that this was the scenario that left the AP tasks running when BOINC thought that it had killed them. Unfortunately, by the time I came up with this theory on Sunday, AP GPU tasks were a distant memory and, unless I happen to snag an AP resend, it looks like it might be almost a week before I can actually test it out. I did recreate the mismatched password scenario on Sunday evening, but the only tasks running at the time were MB GPU, MB CPU and AP CPU. All shut down nicely when told politely to do so.

Anyway, for now this is just a theory of how these Zombie AP tasks might occur, but, unless somebody else still has an AP task running on an NVIDIA GPU and wants to take a crack at proving/disproving the theory, I'll just have to wait for the next round of APs.
6) Message boards : Number crunching : BOINC 7.2.42 - Reduced info in Event Log? (Message 1494856)
Posted 29 days ago by Profile Jeff Buck
The <cpu_sched>1</cpu_sched> log flag does the trick. I added it to cc_config this evening and the "Starting task" and "Restarting task" log entries pretty much reverted to their behavior in pre-7.2.42 BOINC, just with "[cpu_sched]" added to each message. Thanks for the help, Richard.
7) Message boards : Number crunching : BOINC 7.2.42 - Reduced info in Event Log? (Message 1494694)
Posted 29 days ago by Profile Jeff Buck
Which version were you used to using?

All my other machines are on 7.2.33, so this seems to be a sudden change. Looks like that <cpu_sched>1</cpu_sched> might be what I need to get the app and slot info back. I'll try it this evening when my T7400 comes back up. (It's slways shut down on weekday afternoons.)
8) Message boards : Number crunching : BOINC 7.2.42 - Reduced info in Event Log? (Message 1494670)
Posted 29 days ago by Profile Jeff Buck
On Friday I installed BOINC 7.2.42 on my host 7057115, not so much by choice but because I'd had to install fresh copies of everything on that machine due to a hard drive failure the previous Sunday night.

In the process of trying to research a problem that cropped up, with some AP tasks continuing to run when BOINC was shut down (and which I'll probably discuss in a new thread when I can get to it), I discovered that the Event Log in 7.2.42 seems to have done away with several pieces of very useful information. In 7.2.33, when a new task started, the log would show an entry similar to:

Starting task 17my13aa.23116.13155.438086664205.12.31_1 using setiathome_v7 version 700 (cuda42) in slot 1

In 7.2.42, the application and slot info have disappeared, leaving an entry looking like this:

Starting task 20my13ab.6061.3441.438086664200.12.14_1

On a machine that's only running one task at a time, like my old ThnikPad laptop, I suppose that's perfectly adequate. But on my T7400, which is usually running as many as 17 tasks at a time, trying to identify and follow task activity in the Event Log can be a real pain without that application and slot data. I also noticed that there were no entries in the log for task restarts, which also increases the difficulty of tracking task behavior. There might be other differences that I haven't been impacted by yet.

Does anybody know if these are intentional omissions by the BOINC developers, or if perhaps there's some new option that I need to set to turn this data back on? Under the circumstances, I'm not likely to upgrade my other machines to 7.2.42, and will probably see if I can downgrade on my T7400.
9) Message boards : Number crunching : 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long" (Message 1494659)
Posted 29 days ago by Profile Jeff Buck
I'm not sure whether that's the same problem or not. It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart.

Still, lots of lovely debug information logged, so I'll save that for Jason - looks like the replacement wingmate is a very fast host, so it may disappear too soon.

Got one last evening similar to Juan's, on a stock cuda42 task, 3453302106, if you you want to grab the debugging info before it vanishes.

Based on the timing, I think I caused this one myself, by shutting down and restarting BOINC several times while testing a theory related to the "Zombie" AP tasks that started this thread in the first place. (I'll probably post more on that theory in a separate thread when I can.) It seems as if BOINC actually shuts down before the applications do, and this task must have managed to "finish" after BOINC was already terminated. It then restarted at 86.16% but with the "finish file" already present.
10) Message boards : Number crunching : Panic Mode On (87) Server Problems? (Message 1488713)
Posted 40 days ago by Profile Jeff Buck
Here's the sort of thing that'll hold down the RTS buffer if it happens very often. My T7400, 7057115, just blew through 37 tasks in a row from "tape" 27mr13aa that had perfectly legitimate -9 overflows (i.e., not a runaway rig). Total Run Time = 238.35 seconds! These tasks were shorties to begin with, but this was ridiculous. :^)

All 37 tasks had names starting with "27mr13aa.1310.10292.438086664203.12". (Two other tasks from the same tape, but from a different sequence, produced strange errors. That's a whole different issue, I think.)
11) Message boards : Number crunching : Worth running two WUs simultaneously on mainstream Kepler cards ? (Message 1487644)
Posted 42 days ago by Profile Jeff Buck
I have 2 boxes with GTX660s in them, but both are dedicated crunchers.

On 6980751 a single GTX660 is grouped with a GTX650 and two GTX640s. Because of the slower cards, I limit it to 2 WUs per GPU. That puts the GTX660 at about 92% with 2 Multibeam tasks running (but often much less than that with 1 MB and 1 Astropulse). I'm guessing that you might see a comparable load if you went with 2 per GPU.

On 7057115 two GTX660s are grouped with a GTX670. Originally, that machine only had the 2 GTX660s and I started running with 2 WUs per GPU, which also put the normal load at about 92-93%. When I increased it to 3 WUs per GPU, I got the load up to about 98-99%. Obviously not a great increase, but an increase nonetheless. The RAC also increased a comparable amount, about 5-6%, so not a lot lost to additional overhead. When I added the GTX670, I just kept running the 3 WUs per GPU setup.

Bottom line is, just experiment with your own setup. If your machine is not a dedicated cruncher, you may not actually want to max out the GPU load with SETI, but that probably depends on what else you use it for.
12) Message boards : Number crunching : 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long" (Message 1487613)
Posted 42 days ago by Profile Jeff Buck
Just for the record, I guess, here's another one. Pretty much the same circumstances. BOINC crashed and the last entry in the Event Log was the "Starting task" message for an AP GPU task at 21:54:44. That task, 3431345398, turned out to be 100% blanked (too much RFI) and called boinc_finish just 3 seconds later, at 21:54:47, but BOINC was already gone. Unfortunately, it was almost 11 hours later before I discovered the outage, deleted the boinc_finish_called file from the AP task's slot directory, and got that machine back in business. No tasks lost, just 11 hours of processing time. :-( For what it's worth, this AP task was running on a different GPU than the one involved in the previous crash on that machine.

Clearly there seems to be some connection between these BOINC crashes and the start of an AP GPU task but, despite these two happening within less than 24 hours, they still seem to be pretty rare. (These are the first two on that particular machine.) Between these two most recent crashes, at least 4 AP GPU tasks were successfully processed without any issues, another one has completed since the last crash, and 3 more are running without any problems right now, as I write this. So there must be some combination of circumstances that have to exist to trigger the crash, besides just the start of an AP GPU task. On the other machine where I've seen this happen, weeks or months pass between crashes, with dozens or even hundreds of AP tasks running without incident. It's certainly mystifying!
13) Message boards : Number crunching : 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long" (Message 1487038)
Posted 43 days ago by Profile Jeff Buck
Discovered this morning that BOINC had crashed during the night on one of my machines (6980751). This is a different machine than the one where I've reported on this problem previously. However, as before, there was an AP running on a GPU at the time BOINC went down. When I checked the slot directory, sure enough, I found a boinc_finish_called file sitting there. I deleted that file before restarting BOINC and the AP restarted (at 96.40%) and finished normally. That task, 3430588916, is now waiting for validation.

I found essentially the same set of circumstances with this BOINC crash as I did with the one on the other machine that I reported in detail on in Message 1474737. That is to say, the last entry in the Event Log prior to the crash is the "Starting task" message for the AP GPU task at 3:00:11 AM. All other running tasks at the time of the crash (7 MB GPU, 4 MB CPU, 3 AP CPU) appear to have exited shortly thereafter, with a "No heartbeat from core client for 30 sec - exiting" message in the Stderr. However, the one AP GPU task apparently continued to run for another 51 minutes before finally calling boinc_finish at 03:51:36.

One interesting difference in this case is that I've just recently started running Lunatics on this particular machine, whereas the previously reported incidents on the other machine were all running the stock AP app. So, apparently, the problem exists in both stock and Lunatics, but only for GPU tasks, not CPU tasks.

In summary, the BOINC crashes seem to occur immediately after an AP GPU task is started, but any AP GPU tasks running at the time of the crash seem to keep running to a normal finish, at which point the call to boinc_finish is left unanswered, since BOINC is by then long gone. Deleting the boinc_finish_called file from the Slot directory for each AP GPU task before restarting BOINC succeeds in allowing the tasks to restart and finish normally (again), without triggering the "finish file present too long" error.
14) Message boards : Number crunching : I'm falling, I bought a parachute. From 100% AP, to 100% MB. (Message 1486544)
Posted 45 days ago by Profile Jeff Buck
15) Message boards : Number crunching : Why is the tape list not being updated? (Message 1486422)
Posted 45 days ago by Profile Jeff Buck
On a very quick'n'dirty comparison, I saw these tapes 'during February' that you didn't:

02jn13ab
09se09af
12mr13ab

I went ahead and parsed a couple more months of data and found the 02jn13ab in my Jan 2014 files, the 09se09af in Dec 2013, and the 12mr13ab in both Dec 2013 and Jan 2014, so it's just a different methodology shifting a few files from one month to another. You'll probably pick up those other two "February" files that I found as "March" files.

Don't be fooled by the displayed RAC on my account - I have access to more records than that. And since I crunch MB only, my throughput is perhaps closer to yours than you might think - 15,308 tasks added to the database in February.

Okay, that explains why what I thought was my larger sample size didn't really pick up anything significant. It still is a bit larger (27,978 tasks for February) but not as much as it initially appeared. So, since it looks like your sample is big enough to identify everything we're likely to see, I won't bother running the rest of my files. Satisfied my curiosity, at least!
16) Message boards : Number crunching : Why is the tape list not being updated? (Message 1486297)
Posted 45 days ago by Profile Jeff Buck
This was started as a simple historical record, and since it draws on only one user's records, it's probably not complete anyway.

Since, based on RAC at least, it appears that I'm probably running through about 6 times as many WUs in a month as you are, I got curious as to whether that greater volume might produce a more "complete" list. It turns out that it doesn't seem to add much, if anything, to what you found. With only two exceptions, none of the tape counts for WUs I've received that were created in February (the only month I've run through so far) exceed those shown in your chart. Those two exceptions are 23 May 2013, for which I found 1 tape (23my13ab) where none is shown in your chart, and 31 Mar 2013, for which I found 4 tapes (31mr13aa, 31mr13ad, 31mr13ag, 31mr13ah) where your chart shows 3.

Of course, I'm just comparing numbers of tapes for each month, since I can't do a one-for-one comparison of actual tape names. Perhaps there are a few other minor deviations. Then, too, these small differences might simply be due to methodology. I'm parsing the info from WU detail pages in my archive, which means the month I'm assigning to a tape comes from WU creation date, rather than the date I actually received a task.

Here's a link to my February 2014 tape list, if you'd like to do a tape-by-tape comparison with your own list. My archives only go back through May 2013 (and the latter part of April), but I'd be happy to produce lists for what I have, if you think it might be useful. (Now that I have a parser set up, I think it'll only take about 5 minutes or so to generate each monthly list.)
17) Message boards : Number crunching : BOINC lost attachment to S@H project (Message 1483870)
Posted 51 days ago by Profile Jeff Buck
I re-installed Boinc and connected, and I have lost no tasks and had no errors.

When you reinstalled BOINC, did you have to reattach S@H or did it find your projects and project directories automatically?

He would have had to reattach because he had lost the following file:

starting with "Couldn't parse account file account_setiathome.berkeley.edu.xml", "Couldn't parse statistics_setiathome.berkeley.edu.xml", and "Project SETI@home is in state file but no account file found".

Had he restored them from another host he may not have had to do that, whether he would have lost his Wu's is another matter.

Claggy

Well, I guess the loss of that account file is the reason that I had to reattach, but I was wondering whether Bernie had to do the same thing when he reinstalled BOINC.

What could have caused that account file to disappear, anyway? (I'm guessing that it was a shutdown problem, but I'm not really sure.)
18) Message boards : Number crunching : BOINC lost attachment to S@H project (Message 1483868)
Posted 51 days ago by Profile Jeff Buck
I re-installed Boinc and connected, and I have lost no tasks and had no errors.

When you reinstalled BOINC, did you have to reattach S@H or did it find your projects and project directories automatically?
19) Message boards : Number crunching : Panic Mode On (87) Server Problems? (Message 1483845)
Posted 51 days ago by Profile Jeff Buck
20) Message boards : Number crunching : BOINC lost attachment to S@H project (Message 1483759)
Posted 51 days ago by Profile Jeff Buck
Friday evening, I discovered that BOINC had lost track of S@H on my host 7057115 following a restart. It was suddenly telling me that "This computer is not attached to any projects". The short version is that I eventually ended up having to add the S@H project back to BOINC, but in the process wound up with 146 Abandoned tasks. The crisis has passed, for now, but I'd like to understand whether there's some other approach I could take to avoid the Abandoned tasks situation if this sort of thing happens again. (I suspect the root cause is an incomplete system shutdown leading to startup anomalies, so unless I can track down that gremlin, BOINC / S@H will continue to be at risk.)

The first thing I did when I saw the "not attached" message was to confirm that the S@H project directory was still present and populated. It was, and as far as I could tell still had all required files present. The Event Log, though, had numerous error messages, starting with "Couldn't parse account file account_setiathome.berkeley.edu.xml", "Couldn't parse statistics_setiathome.berkeley.edu.xml", and "Project SETI@home is in state file but no account file found". I won't try to post everything here unless somebody thinks it necessary, but basically there are a lot of "... outside project in state file" messages for every WU and task (the ones that ultimately got abandoned).

I initially tried just shutting down the BOINC Manager and client, but still got the same "not attached" message, although this time the Event Log only showed the "Couldn't parse ..." messages without the myriad of additional error messages. At this point I finally just decided to go ahead and add S@H back to BOINC, hoping that it would recognize the existing project directory and tasks, but of course it didn't. It immediately tried to download two new tasks and all the associated application files. However, those downloads were failing with a series of "Can't create HTTP response output file ..." messages, one for each task and app file.

So then I just threw up my hands and rebooted the machine. When BOINC came back up this time, the new downloads proceeded without any problems and I eventually got all the application files back (well, the MB ones, anyway) and a whole queue full of new tasks, but of course by now the original 146 tasks had been abandoned. I guess BOINC just overwrote the entire S@H project directory, since I also lost my app_config.xml and mbcuda.cfg files, which I had to recreate.

As I say, the crisis has passed and BOINC seems to be running smoothly again, but I'd like to know if there's a way to reconnect BOINC to the project without such complete carnage, should it every lose its memory like this again. Any suggestions?


Next 20

Copyright © 2014 University of California