Posts by Jeff Buck


log in
1) Message boards : Number crunching : Perhaps my 7th wingman will be the charm! (or maybe the 8th) (Message 1590386)
Posted 29 minutes ago by Profile Jeff Buck
LOL

Well, it was certainly entertaining while it lasted! I see one last bit of mystery in that last host's Stderr, which looks to be truncated, after multiple restarts, with no pulse counts included. A fitting finish.
2) Message boards : Number crunching : Perhaps my 7th wingman will be the charm! (or maybe the 8th) (Message 1590190)
Posted 6 hours ago by Profile Jeff Buck
And now I have my 8th wingman, after the 7th one timed out. I thought 7 would be the limit, but I guess we'll soon find out if it stops at 8, since number 8's task summary doesn't indicate a particularly successful host.

State: All (70) · In progress (8) · Validation pending (0) · Validation inconclusive (0) · Valid (1) · Invalid (0) · Error (61)


This is downright comical!

Edit: Although now that I take a second look at his task list, the one and only Valid task he has is an AP v6. Maybe there's still hope!
3) Message boards : Number crunching : Panic Mode On (91) Server Problems? (Message 1590025)
Posted 15 hours ago by Profile Jeff Buck
The sense I'm getting is that if there's only a small number of tasks to report, they go through fine, but if there's a large quantity, they appear to fail. However, I'd almost bet that some of those are actually getting reported even though it doesn't look like any of them are. The scheduler appears to take an all or nothing approach when it comes to clearing the queue. My #1 cruncher finally managed to report 184 tasks about an hour and a half ago, but now the backlog is building up again. Still haven't received any new work, though.
4) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1589992)
Posted 16 hours ago by Profile Jeff Buck
I don't wish to rain on your parade but a couple of those tasks have validated with your wingperson and the others are just waiting for your wingpersons to return their results. ;-)

Cheers.

Dry as bone here in California. ;^)

If you're referring to the tasks from the last BOINC crash, the reason they're fine is that I deleted all the "finish" files from the slot directories before restarting BOINC. If you look down through the Stderr for each of them, you'll see two calls to boinc_finish, such as these in task 3793460876:

05:29:08 (1344): called boinc_finish(0)
11:40:54 (1236): called boinc_finish(0)

The first was generated when the zombie task originally completed (after continuing to run for about 45 minutes following BOINC's crash at 04:47). The second is for the completion after I discovered the crash, deleted the original finish file and restarted BOINC. The task goes back to the last checkpoint, then generally finishes again in a few minutes.
5) Message boards : Number crunching : Panic Mode On (91) Server Problems? (Message 1589924)
Posted 18 hours ago by Profile Jeff Buck
I can report completed tasks, though sometimes it takes several tries, but I haven't been able to D/L any new ones. One of my machines is out of GPU work. If it doesn't get any before I go to bed, I'll probably shut it down.
6) Message boards : Number crunching : not getting jobs from SETI (Message 1589904)
Posted 18 hours ago by Profile Jeff Buck
I do know there's a problem with older GPUs running Astropulse tasks with the 340.52 driver, as discussed in @Pre-FERMI nVidia GPU users: Important warning, so those might be blocked from your machine. However, I'm not aware of any issue with that GPU/driver combination affecting the Multibeam tasks (cuda, in your case).
7) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1589880)
Posted 19 hours ago by Profile Jeff Buck
Well, I had another BOINC crash at 4:47 this morning, just 3 days after the last one, but on a different machine, my xw9400. Same apparent trigger, but with a single AP already running on each of the other 3 GPUs, I ended up with 4 AP zombies, tasks 3793460876, 3793445511, 3793445524 and 3793460874. Based on the event log, that last one appears to have been the trigger. I also found an MB task, 3793085246 with a "boinc_finish_called" file it its slot directory. I suspect that it was just unlucky enough to get caught in its termination phase when the crash happened. In any event, deleting the "finish" file before restarting BOINC also allowed it to restart and then finish again normally.

I got to thinking about the apparent gap of 6+ months between these BOINC crashes and the subsequent 2 crashes in 3 days on different machines. What was different during those 6+ months? One promising theory that I had was that the period roughly coincided with the span where we were processing mostly older 2008 and 2009 data. Then we recently jumped ahead to more recent tapes, 2010-2014.

That theory almost works, but I found one flaw in it. In reviewing the BOINC crash occurrences, I found one that I had forgotten about, on June 30. That one turned out to be a 2009 file. So far, it's the only fly in the ointment, though.

Anyway, as long as I've dug this additional info out, I'll go ahead and post it in the hopes that it may yet prove useful when more clues surface. Here a list of all my BOINC crashes which generated AP zombie tasks. The list shows the date and host ID, followed by the dataset name of the AP task which appears to have triggered the crash. (I didn't capture the stdoutdae file for the December 30, 2013, crash, so I don't know for sure which "zombie" was the last one to start.)

20131230_7057115: ap_16oc13ac_B3_P0_00113_20131229_06439.wu_2 (don't know which of these 2 tasks was trigger)
20131230_7057115: ap_16oc13ad_B6_P1_00200_20131229_05567.wu_1 (don't know which of these 2 tasks was trigger)
20140104_7057115: ap_17oc13ac_B1_P0_00131_20140103_01567.wu_1
20140209_7057115: ap_10ap13aa_B5_P1_00191_20140208_30452.wu_1
20140310_6980751: ap_28my13ad_B1_P0_00265_20140309_08199.wu_2 (morning crash)
20140310_6980751: ap_28my13ad_B3_P1_00211_20140310_25141.wu_1 (evening crash)
20140630_6980751: ap_13mr09ab_B3_P0_00183_20140628_15930.wu_0
20141018_7057115: ap_21no10ab_B0_P0_00325_20141016_19978.wu_2
20141021_6980751: ap_06se14aa_B6_P1_00113_20141020_01141.wu_1
8) Message boards : Number crunching : Problem with AstroPulse v7 (Message 1589825)
Posted 21 hours ago by Profile Jeff Buck
Get this WU ended on an error after 1 hours 44 min 10 sec of processing time on a 670.

http://setiathome.berkeley.edu/result.php?resultid=3792993814

<message>
finish file present too long
</message>

I remember see that error before but at the time was on MB WU before de commode builds.

Did BOINC crash, or stop unexpectedly?

My most recent thread on this is "Zombie" AP tasks - still alive in AP v7. There are links to 2 earlier threads in there.

I just experienced another one this morning also. I'm still putting together the info to add to the thread.
9) Message boards : Number crunching : The current top cruncher (Message 1589438)
Posted 2 days ago by Profile Jeff Buck
Even though Charles' RAC has dropped a lot lately he's still the 1st to pass 1,000,000,000 cobblestones mark (not bad for just under 15 months of crunching).

Congratz Charles.

Cheers.

Wow! Let's see, at my current rate, I might reach that milestone sometime in the year 2040....that is, if they let me keep running my machines in the old f[olks|ogeys|ellas|arts] home!
10) Message boards : Number crunching : The GTX750(Ti) Thread (Message 1589417)
Posted 2 days ago by Profile Jeff Buck
This would be.

-use_sleep -unroll 10 -oclfft_plan 256 16 256 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1 -tune 2 64 4 1

Okay, thanks, Mike. I'll try those out. Since I run both APs and MBs, less than 3% of the tasks I receive tend to be APs, so I don't expect the overall boost to be that great, but it should be interesting to try!
11) Message boards : Number crunching : The GTX750(Ti) Thread (Message 1589414)
Posted 2 days ago by Profile Jeff Buck
Up until recently, on my old xw9400 I had been running a mixed GPU configuration with a GTX660, GTX650 and two GTX640s. Purely by coincidence, in the 2 weeks leading up to the AP v7 rollout, I had gradually upgraded that host to two GTX660s and two GTX750Tis, still mixed, but a bit more closely matched.

Under the old configuration and running AP v6, I had made what I felt was conservative use of the ap_cmdline capabilities, with just a simple:

-unroll 6 -ffa_block 2048 -ffa_block_fetch 1024 -hp

This seemed to be accommodate the compute unit disparities between the 3 different GPU types I was running and I was content with it.

Since AP v7 rolled out the day after I made the last upgrade to that machine, I thought I might just let it go back to the defaults, inasmuch as AP v7 was supposed to adjust those parameters according to the specific GPU it was running on. However, I've noticed that on my T7400, which runs stock, AP v7 is assigning the equivalent of "-unroll 6 -ffa_block 1536 -ffa_block_fetch 768" to a GTX660 which is a matching unit to one of the ones in the xw9400, with 6 CUs. These appear to be even more conservative values than what I'm already running on the xw9400, so I left the ap_cmdline alone for the time being.

Anyway, now that the GPUs are more closely matched on that host (1 with 6 CUs and 3 with 5 CUs), and seeing that the AP v7 default values appear to be even more conservative than I am, what might a recommended ap_cmdline for that host look like? (For that matter, what values might work for the T7400, which currently has a mixed bag of GTX780, GTX670, and GTX660?)


Since the slowest GPU in that host has 6 CU`s you can use.

-use_sleep -unroll 12 -oclfft_plan 256 16 256 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1 -tune 2 64 4 1

Lets just run for a few days and maybe you can increas a little bit more.

I assume you mean that for the T7400. How about for the xw9400, with the two GTX660s (one with 6 CUs and one with 5 CUs) and two GTX750Tis (both with 5 CUs)?
12) Message boards : Number crunching : Panic Mode On (90) Server Problems? (Message 1589397)
Posted 2 days ago by Profile Jeff Buck
Perhaps it's an Internet backbone issue. Everything's loading normally for me, but I'm only about 100 miles from Berkeley.

EDIT: Interesting. Just editting for the sake of editting, but still pretty much instantaneous here!
13) Message boards : Number crunching : The GTX750(Ti) Thread (Message 1589354)
Posted 2 days ago by Profile Jeff Buck
Up until recently, on my old xw9400 I had been running a mixed GPU configuration with a GTX660, GTX650 and two GTX640s. Purely by coincidence, in the 2 weeks leading up to the AP v7 rollout, I had gradually upgraded that host to two GTX660s and two GTX750Tis, still mixed, but a bit more closely matched.

Under the old configuration and running AP v6, I had made what I felt was conservative use of the ap_cmdline capabilities, with just a simple:

-unroll 6 -ffa_block 2048 -ffa_block_fetch 1024 -hp

This seemed to be accommodate the compute unit disparities between the 3 different GPU types I was running and I was content with it.

Since AP v7 rolled out the day after I made the last upgrade to that machine, I thought I might just let it go back to the defaults, inasmuch as AP v7 was supposed to adjust those parameters according to the specific GPU it was running on. However, I've noticed that on my T7400, which runs stock, AP v7 is assigning the equivalent of "-unroll 6 -ffa_block 1536 -ffa_block_fetch 768" to a GTX660 which is a matching unit to one of the ones in the xw9400, with 6 CUs. These appear to be even more conservative values than what I'm already running on the xw9400, so I left the ap_cmdline alone for the time being.

Anyway, now that the GPUs are more closely matched on that host (1 with 6 CUs and 3 with 5 CUs), and seeing that the AP v7 default values appear to be even more conservative than I am, what might a recommended ap_cmdline for that host look like? (For that matter, what values might work for the T7400, which currently has a mixed bag of GTX780, GTX670, and GTX660?)
14) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1589073)
Posted 2 days ago by Profile Jeff Buck
With the low occurrence of 5 times between 2 machines in a year it does make it hard to troubleshoot. If you wanted to install some debug logging software. Then you could have very detailed crash logs.

Having logs that only state "whatever.exe has stopped unexpectedly" can drive you bonkers. My old i7-860 would sometimes shutdown or restart without warming. The only log was "windows shutdown unexpectedly at time stamp". I dealt with that for years before finding out, just about a week ago, it was a faulty chassis power switch. Which would sometimes close when not pushed. I only found out that's what it was when it failed completely.

Yep. Windows 8.1 seems to do a lot more logging than previous versions, but they still don't seem to tell me diddly squat! In any event, I don't think the low frequency of occurrence would justify the extra overhead that I would expect a debugging package to introduce. Actually, there's probably some log switch in BOINC that might generate an additional line or two subsequent to the Starting Task message but, again, probably not worth the extra overhead without more frequent crashes. Anyway, it's the "Finish file present too long" and zombie tasks issues that I'm more interested in resolving at this point, since I think there's enough solid evidence to act on there, rather than the BOINC crashes themselves.
15) Message boards : Number crunching : Phantom Triplets (Message 1589065)
Posted 2 days ago by Profile Jeff Buck
Basically the same as the white dot graphical artefacts you'd get overclocking a GPU for gaming, at which point backing off at least two 'notches' is the general corrective method. White dots in graphical glitches imply ~24-32 bits of saturation (bits flipped to on), so yeah many more than a single bit flip, though typically tied to single memory fetches.

As with overclocking, It happens with 'gaming grade' GPUs sometimes from factory due to price/performance market pressures, and the non-critical nature of a rare white dot, when parts binning and setting clocks for mid range GPUs, where the competition is steepest.

Okay, thanks, Jason. Not being a gamer, I don't really have any experience with overclocking and artifacts ("white dot" or otherwise). I just run my cards at whatever clock rate they come with, so unless they're "factory" overclocked, they run at their design frequency. For that 550Ti, the base rate for the core clock seems to be 970MHz, and Precision X doesn't seem to want to let me drop it any further, so that's why I went with the voltage adjustment, though as I mentioned, I didn't want to push it to the maximum right away.

Thanks again for all your advice and insight.
16) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1589056)
Posted 2 days ago by Profile Jeff Buck
There is an issue with science apps not stopping when they should. However, BOINC & most defiantly Windows Explorer should not crash when tasks start.
If I were having crashes that coincided with tasks starting I would look for some bit of hardware that is having issues. If this last happened 6 months ago when it was changing from cold to warm and now it is occurring when it is going from warm to cold I am thinking something thermal expanding/contracting.

I think the key words there are "should not", rather than "could not". On my daily driver running Vista, Windows Explorer is actually the most crash-prone application I have. Often, simply right-clicking on a folder in the tree will cause a lockup and crash. Other times, I have no idea what triggers it. In any event, in this case, it appears to be BOINC that crashes first, taking its parent process, explorer.exe down with it. It would be helpful if the Windows logs provided more info, but they don't (or perhaps I just don't know where to find it).

As to weather, if only it were that easy to correlate the occurrences with something external. However, it's happened once each in December, February, March, and now October on my #1 cruncher. It also happened once on my #2 cruncher in March. And here on the central coast of California, ambient temperatures are more likely to fluctuate dramatically from day to day or from hour to hour, depending on the influence of the marine layer, than from season to season. I certainly can't rule anything out, but the fact that the one consistent element is the start of an AP GPU task, I have to think that that's one piece of the puzzle. (Although I run both MB and AP tasks, I still run hundreds of AP tasks through those two boxes over time, and with the crashes being fairly rare, I'm sure there must be another element required, but at this point I haven't any other clues.)
17) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1589029)
Posted 2 days ago by Profile Jeff Buck
The heartbeat mechanism isn't about passing messages, but simply confirming that everything is working normally. There are actually 'old' and 'new' versions of this: the term 'heartbeat' strictly applies only to the old one, in use up to about BOINC v7.0.36: after that, the applications are supposed to monitor the PID (process identifier) of the client. But either should do, and heartbeats should still be used unless both the client and the application understand the new PID mechanism.

But whatever the inter-process communication in use, it's clear that the application - Astropulse, in this case - should take responsibility for shutting itself down if the BOINC client 'goes away' - crashes, or is deliberately killed in the course of performing this sort of test. We have been through a number of changes in this area with Astropulse over the last couple of months, but maybe there's still a need to double-check that the basic procedure works in all cases.

Thanks for that explanation, Richard. When I was originally running my tests back in March, it appeared that the tasks that did successfully shut down may have been using the "old" heartbeat mechanism, as a message stating "No heartbeat from core client for 30 sec - exiting" would appear in the Stderr.

Just a quick follow-up on "old" vs. "new" heartbeat mechanism. Back in March, I was still running BOINC 7.2.33 when the "No heartbeat..." message was produced. Currently, on the machine where BOINC crashed yesterday, I'm running BOINC 7.2.42 but got the same message for the MB GPU tasks that were running at the time of the crash, since x41zc app hasn't changed. However, I notice that the stock MB CPU tasks such as this one included "02:51:48 (4180): BOINC client no longer exists - exiting" in the Stderr. That timestamp appears to be about 5 seconds after the time that I think BOINC crashed.
18) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1589012)
Posted 2 days ago by Profile Jeff Buck
Jeff, can you supply any more info about the "BOINC crashed on my T7400 at 2:51 AM local time" indications? I have no idea how Win 8.1 might have handled that, but on older Windows I consider it quite possible that a messagebox saying "This application is no longer responding and will be shut down." may very well not free the application PID until after the user reacts. If so, the BOINC developers may need to rethink that method of detecting when the BOINC client has died.

I'll try taking a look at the Windows logs on that machine, probably a bit later today.

Okay, finally got around to looking at those Windows logs and, as I suspected, the only relevant entry I could find is this one:

Log Name: Application Source: Microsoft-Windows-Winlogon Date: 10/18/2014 2:51:45 AM Event ID: 1002 Task Category: None Level: Information Keywords: Classic User: N/A Computer: T7400 Description: The shell stopped unexpectedly and explorer.exe was restarted. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="Microsoft-Windows-Winlogon" Guid="{DBE9B383-7CF3-4331-91CC-A3CB16A3B538}" EventSourceName="Winlogon" /> <EventID Qualifiers="16384">1002</EventID> <Version>0</Version> <Level>4</Level> <Task>0</Task> <Opcode>0</Opcode> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2014-10-18T09:51:45.000000000Z" /> <EventRecordID>8943</EventRecordID> <Correlation /> <Execution ProcessID="0" ThreadID="0" /> <Channel>Application</Channel> <Computer>T7400</Computer> <Security /> </System> <EventData> <Data>explorer.exe</Data> </EventData> </Event>

It doesn't indicate why Explorer stopped "unexpectedly", just that it was restarted. That event occurs approximately 2 seconds after the last entries in the BOINC log prior to the apparent BOINC crash:
18-Oct-2014 02:51:43 [SETI@home] Computation for task 24se14aa.26295.17654.438086664195.12.91_0 finished 18-Oct-2014 02:51:43 [SETI@home] [cpu_sched] Preempting 04se14aa.7333.4157.438086664199.12.169_0 (removed from memory) 18-Oct-2014 02:51:43 [SETI@home] Starting task ap_21no10ab_B0_P0_00325_20141016_19978.wu_2 18-Oct-2014 02:51:43 [SETI@home] [cpu_sched] Starting task ap_21no10ab_B0_P0_00325_20141016_19978.wu_2 using astropulse_v7 version 705 (opencl_nvidia_100) in slot 4 18-Oct-2014 08:14:06 [---] Starting BOINC client version 7.2.42 for windows_x86_64

As you can see, the last entries shown are for the start of an AP task. This has been the case for every one of these BOINC crashes that I've encountered.
19) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1588903)
Posted 3 days ago by Profile Jeff Buck
The heartbeat mechanism isn't about passing messages, but simply confirming that everything is working normally. There are actually 'old' and 'new' versions of this: the term 'heartbeat' strictly applies only to the old one, in use up to about BOINC v7.0.36: after that, the applications are supposed to monitor the PID (process identifier) of the client. But either should do, and heartbeats should still be used unless both the client and the application understand the new PID mechanism.

But whatever the inter-process communication in use, it's clear that the application - Astropulse, in this case - should take responsibility for shutting itself down if the BOINC client 'goes away' - crashes, or is deliberately killed in the course of performing this sort of test. We have been through a number of changes in this area with Astropulse over the last couple of months, but maybe there's still a need to double-check that the basic procedure works in all cases.

Thanks for that explanation, Richard. When I was originally running my tests back in March, it appeared that the tasks that did successfully shut down may have been using the "old" heartbeat mechanism, as a message stating "No heartbeat from core client for 30 sec - exiting" would appear in the Stderr.

Also, in my last post in my original "Zombie" thread, I summarized my tests on various systems, as follows:
My original tests were on NVIDIA GPUs running stock apps under Win 8.1. Here are the results from testing on three of my other machines:

6980751 (Win XP, NVIDIA GPUs, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6979886 (Win Vista, NVIDIA GPU, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6912878 (Win 7, ATI GPU, stock apps): AP GPU and MB GPU task continue to run w/o BOINC, MB CPU tasks shut down properly

So it appears that AP GPU tasks fail to shut down under every configuration that I can test and stock MB GPU tasks fail to shut down when running on an ATI GPU (at least under Win 7).

As you can see, I also ran into the problem with stock MB GPU tasks running on an ATI GPU under Win 7. I don't know if that app has changed since then, but inasmuch as I stopped running that box several weeks ago, I can't retest the scenario.
20) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1588885)
Posted 3 days ago by Profile Jeff Buck
Jeff, can you supply any more info about the "BOINC crashed on my T7400 at 2:51 AM local time" indications? I have no idea how Win 8.1 might have handled that, but on older Windows I consider it quite possible that a messagebox saying "This application is no longer responding and will be shut down." may very well not free the application PID until after the user reacts. If so, the BOINC developers may need to rethink that method of detecting when the BOINC client has died.

I'll try taking a look at the Windows logs on that machine, probably a bit later today. As I recall, when I dug into them once before for the same sort of BOINC crash, there was little to find other than a few messages indicating Windows Explorer had also had a failure of some sort and had been restarted at about the same time. It wasn't really possible to tell which came first, the BOINC crash or the Explorer problem, but they seemed to be related. In any event, while Explorer restarts cleanly, BOINC doesn't. When it happens in the middle of the night, I don't discover it until the next morning when I realize that the room where that machine is located is much quieter and much cooler than it should be! :^) No message box remaining on the screen, if there even was one to begin with.


Next 20

Copyright © 2014 University of California