Posts by Jeff Buck


log in
1) Message boards : Number crunching : AP V7 (Message 1591767)
Posted 18 hours ago by Profile Jeff Buck
Scenario one Running 3 MB:
0.33 + 0.33 + 0-33 = 0-99. Now one MB quits -> only 0.33 + 0.33 = 0.66 running -> can start new AP. Now running 0.33+0.33+0.34=1.00 .

Scenario two Running one AP and two MB:
0.34 + 0.33 + 0.33 = 1.00- Now one MB finishes. Runnung 0.34 + 0.33 = 0.67. Can start a new MB or if an AP is scheduled to run next must wait for more free GPU.

Three: running 2 AP = 0.68. Must wait. After the other AP finishes can start AP or MB.

Scenarios one and three behave as expected. It's the "Scenario two" where BOINC is inconsistent. Most of the time, it will start a new MB whether or not an AP is the next task at the top of the queue (which is what I assume you mean by "scheduled"). That AP (and sometimes several APs) will remain in a "Ready to start" status until another AP finishes and frees up that .34 GPU. MBs that are lower in the queue will be started ahead of the APs if only .33 is available. Normally, the only time 2 APs run on a single GPU is when the 2 MBs running with the first AP happen to finish simultaneously (or nearly so).

But every once in awhile, BOINC lets an MB finish without starting another task of any kind. On rare occasions, it will actually do that on two different GPUs. Then, when the final MB finishes on one of the GPUs, it will go ahead and start a second AP on that GPU while simultaneously starting the next available MB on the second GPU. I've never been able to identify a pattern in any of this. :^)
2) Message boards : Number crunching : AP V7 (Message 1591761)
Posted 19 hours ago by Profile Jeff Buck
When AP runs MB doesn't. So, I would expect that 2 AP's could run at the same time when MB isn't doing anything.

With your app_config settings, that seems odd. Are you sure BOINC is actually reading the app_config.xml file? When you start BOINC, or if you select "Read config files" on the Advanced menu, does your Event Log show a "Found app_config.xml" entry?


I got BOINC to finally re-read the app_config.xml. It is now crunching 2 AP WU's at a time. I will monitor for any problems.

Thanks everyone. :-)

You're welcome. Glad you got it sorted out.
3) Message boards : Number crunching : AP V7 (Message 1591739)
Posted 19 hours ago by Profile Jeff Buck
When AP runs MB doesn't. So, I would expect that 2 AP's could run at the same time when MB isn't doing anything.

With your app_config settings, that seems odd. Are you sure BOINC is actually reading the app_config.xml file? When you start BOINC, or if you select "Read config files" on the Advanced menu, does your Event Log show a "Found app_config.xml" entry?
4) Message boards : Number crunching : AP V7 (Message 1591729)
Posted 20 hours ago by Profile Jeff Buck
You didn't mention whether, when a single AP task is running on your gpu, an MB task is also running. Based on your app_config.xml, I'd expect that to be the case.

I've found that BOINC is sometimes unpredictable when a host is running mixed AP and MB, and the <gpu_usage> values are different for the two types. On my T7400, which currently has a GTX 780, a GTX 670, and a GTX 660, I've tried to set up the app_config.xml to run either 3 MB tasks on each GPU, or 1 AP and 2 MB tasks, which I've found makes about the most efficient use of those GPUs.

Theoretically, <gpu_usage> of .34 for the AP and .33 for the MB should do that, and it does.....most of the time. When an AP comes to the top of the queue, it will start up when an MB finishes, if there isn't already an AP running on that GPU. Once all 3 GPUs have a single AP running, BOINC will usually bypass APs and replace an MB that finishes with the next MB in line. However, sometimes it doesn't do that. In those cases, when an MB finishes on a GPU that already has an AP running, it won't start the next MB but will instead wait until the last MB finishes and then launch a second AP on that GPU, which makes that GPU noticeably underutilized.

So, sometimes I just set both AP and MB <gpu_usage> values to .33 and take my chances that I won't wind up with 3 APs running at the same time on one GPU. That's usually not a problem when APs are scattered in the queue, as is usually the case, but every once in a while the scheduler sends a whole block of APs at once, or on consecutive work fetches.

In any event, with your settings, you might not be able to get 2 APs running at the same time unless the 2nd and 3rd MBs finish at exactly the same time, or if BOINC decides to wait for all MBs to finish in order to launch the 2nd AP.

EDIT: Actually, the more I think about it, the more it seems the difficulty might be in launching the first AP, since when a single MB finishes, it only frees up .33 of a GPU, while the AP needs .5 GPU to start. BOINC would have to wait for a second MB to finish before it could launch the AP.
5) Message boards : Number crunching : AP V7 (Message 1591694)
Posted 22 hours ago by Profile Jeff Buck
Seems to me that the <max_concurrent> lines might be your problem. If I recall, that limits the total number of tasks that the application (i.e., astropulse_v7, astropulse_v6, setiathome_v7) can run on your machine, not the total number for a given GPU. You should probably remove those lines.


Jeff,

I got the <max_current> lines from Joe Segur when I asked about creating an app_config.xml for Beta. The changes I made to the app_config.xml for MB for Main and Beta work flawlessly. They should also work, then, for AP - shouldn't they???


TL

Yeah, I guess if you only have the one GPU and you're not running any APs on your CPU, it shouldn't make any difference.
6) Message boards : Number crunching : Panic Mode On (91) Server Problems? (Message 1591640)
Posted 1 day ago by Profile Jeff Buck
It's late Friday afternoon in Berkeley, almost 5 PM...anybody seen this before? ;^)
7) Message boards : Number crunching : Lunatics Windows Installer v0.43 Release Notes (Message 1591551)
Posted 1 day ago by Profile Jeff Buck
The oclFFT_plan will more than compensate it.
It speeds up at least by 10% if set correctly.
Try this for your multi GPU host.

-use_sleep -unroll 10 -oclFFT_plan 256 16 256 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1 -tune 2 64 4 1

Okay, thanks. -oclFFT_plan it will be on both. ;^)

Now I just need to start getting AP tasks again.
8) Message boards : Number crunching : Lunatics Windows Installer v0.43 Release Notes (Message 1591476)
Posted 1 day ago by Profile Jeff Buck
-oclFFT_plan is case sensitive.

Uh, oh. I actually just cut and pasted the recommendations exactly the way you had provided them over in The GTX750(Ti) Thread for my two boxes, one being
-use_sleep -unroll 12 -oclfft_plan 256 16 256 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1 -tune 2 64 4 1

and the other being
-use_sleep -unroll 10 -oclfft_plan 256 16 256 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1 -tune 2 64 4 1


Could that be why I've been seeing about a 15%-40% increase in my AP run times on those boxes? (Huge decrease in CPU times, though.)


The -use_sleep command will cause the time to increase, but will allow you to use the CPU that is normally dedicated to the app for something else.

Well, since the GPUs are the real workhorses, I'm afraid the modest gains in CPU availability won't offset the large increase in run times on the GPUs.
9) Message boards : Number crunching : Lunatics Windows Installer v0.43 Release Notes (Message 1591441)
Posted 1 day ago by Profile Jeff Buck
-oclFFT_plan is case sensitive.

Uh, oh. I actually just cut and pasted the recommendations exactly the way you had provided them over in The GTX750(Ti) Thread for my two boxes, one being
-use_sleep -unroll 12 -oclfft_plan 256 16 256 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1 -tune 2 64 4 1

and the other being
-use_sleep -unroll 10 -oclfft_plan 256 16 256 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1 -tune 2 64 4 1


Could that be why I've been seeing about a 15%-40% increase in my AP run times on those boxes? (Huge decrease in CPU times, though.)
10) Message boards : Number crunching : AP V7 (Message 1591428)
Posted 1 day ago by Profile Jeff Buck
Seems to me that the <max_concurrent> lines might be your problem. If I recall, that limits the total number of tasks that the application (i.e., astropulse_v7, astropulse_v6, setiathome_v7) can run on your machine, not the total number for a given GPU. You should probably remove those lines.
11) Message boards : Number crunching : Perhaps my 7th wingman will be the charm! (or maybe the 8th) (Message 1590829)
Posted 2 days ago by Profile Jeff Buck
I did send a PM.
Joe

If he takes notice of your PM and updates his host, at least something positive will come out of this little episode. I think at least a few of the other wingmen on this WU could use a similar nudge. From time to time, I've tried sending PMs to other users when I saw a wingman's host that had just recently appeared to go off the rails, but only one of them ever responded, and I think even that took about a month.

It's a shame that the project doesn't have some functionality in one of the servers that would automatically generate an email to a user when a host crosses some defined threshold of Invalids and Errors, perhaps when those results exceed 50% of Valid results. The email wouldn't have to diagnose the problem, just point out to the user that a problem appears to exist and direct them to the Message Boards if they need assistance. I can't help but feel that such a process could enable a whole lot of hosts to regain lost productivity, which would surely be a good thing for the project.

The current "system", which relies on individual users to occasionally PM other users, with what I suspect are widely varying degrees of diplomacy, doesn't seem like it accomplishes much. Then, too, quite a few of the wayward rigs belong to Anonymous users who can't be PM'd in the first place. Only the project administrators, or an automated system they implement, can reach the Anonymous ones.

Automatically generating some emails, perhaps once every couple of weeks or once a month, can't be that big a deal, can it?
12) Message boards : Number crunching : blanked AP tasks, Part II (Message 1590810)
Posted 2 days ago by Profile Jeff Buck
Yes, fairly rare. Out of 36 100% blanked AP v7 tasks that have validated on my hosts, only 5 have inflated credits, between 246.57 and 534.44. I think these will slowly disappear as more hosts reach their 11 validations threshold.
13) Message boards : Number crunching : Phantom Triplets (Message 1590418)
Posted 3 days ago by Profile Jeff Buck
Well, I got another one already, task 3791593451, which found 24 triplets where the wingmen found none. So I guess my incremental voltage increase didn't have any effect. I've gone ahead now and bumped it up the full 0.05v per Jason's recommendation, to 1.150v, which appears to be the maximum for the card. At this point, I guess I don't mind if the card fries. That'll just be my excuse to buy another GTX 750Ti (which I figure would probably pay for itself in about 8 months in reduced electric costs). :^)
14) Message boards : Number crunching : Perhaps my 7th wingman will be the charm! (or maybe the 8th) (Message 1590386)
Posted 3 days ago by Profile Jeff Buck
LOL

Well, it was certainly entertaining while it lasted! I see one last bit of mystery in that last host's Stderr, which looks to be truncated, after multiple restarts, with no pulse counts included. A fitting finish.
15) Message boards : Number crunching : Perhaps my 7th wingman will be the charm! (or maybe the 8th) (Message 1590190)
Posted 3 days ago by Profile Jeff Buck
And now I have my 8th wingman, after the 7th one timed out. I thought 7 would be the limit, but I guess we'll soon find out if it stops at 8, since number 8's task summary doesn't indicate a particularly successful host.

State: All (70) · In progress (8) · Validation pending (0) · Validation inconclusive (0) · Valid (1) · Invalid (0) · Error (61)


This is downright comical!

Edit: Although now that I take a second look at his task list, the one and only Valid task he has is an AP v6. Maybe there's still hope!
16) Message boards : Number crunching : Panic Mode On (91) Server Problems? (Message 1590025)
Posted 3 days ago by Profile Jeff Buck
The sense I'm getting is that if there's only a small number of tasks to report, they go through fine, but if there's a large quantity, they appear to fail. However, I'd almost bet that some of those are actually getting reported even though it doesn't look like any of them are. The scheduler appears to take an all or nothing approach when it comes to clearing the queue. My #1 cruncher finally managed to report 184 tasks about an hour and a half ago, but now the backlog is building up again. Still haven't received any new work, though.
17) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1589992)
Posted 3 days ago by Profile Jeff Buck
I don't wish to rain on your parade but a couple of those tasks have validated with your wingperson and the others are just waiting for your wingpersons to return their results. ;-)

Cheers.

Dry as bone here in California. ;^)

If you're referring to the tasks from the last BOINC crash, the reason they're fine is that I deleted all the "finish" files from the slot directories before restarting BOINC. If you look down through the Stderr for each of them, you'll see two calls to boinc_finish, such as these in task 3793460876:

05:29:08 (1344): called boinc_finish(0)
11:40:54 (1236): called boinc_finish(0)

The first was generated when the zombie task originally completed (after continuing to run for about 45 minutes following BOINC's crash at 04:47). The second is for the completion after I discovered the crash, deleted the original finish file and restarted BOINC. The task goes back to the last checkpoint, then generally finishes again in a few minutes.
18) Message boards : Number crunching : Panic Mode On (91) Server Problems? (Message 1589924)
Posted 3 days ago by Profile Jeff Buck
I can report completed tasks, though sometimes it takes several tries, but I haven't been able to D/L any new ones. One of my machines is out of GPU work. If it doesn't get any before I go to bed, I'll probably shut it down.
19) Message boards : Number crunching : not getting jobs from SETI (Message 1589904)
Posted 3 days ago by Profile Jeff Buck
I do know there's a problem with older GPUs running Astropulse tasks with the 340.52 driver, as discussed in @Pre-FERMI nVidia GPU users: Important warning, so those might be blocked from your machine. However, I'm not aware of any issue with that GPU/driver combination affecting the Multibeam tasks (cuda, in your case).
20) Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7 (Message 1589880)
Posted 3 days ago by Profile Jeff Buck
Well, I had another BOINC crash at 4:47 this morning, just 3 days after the last one, but on a different machine, my xw9400. Same apparent trigger, but with a single AP already running on each of the other 3 GPUs, I ended up with 4 AP zombies, tasks 3793460876, 3793445511, 3793445524 and 3793460874. Based on the event log, that last one appears to have been the trigger. I also found an MB task, 3793085246 with a "boinc_finish_called" file it its slot directory. I suspect that it was just unlucky enough to get caught in its termination phase when the crash happened. In any event, deleting the "finish" file before restarting BOINC also allowed it to restart and then finish again normally.

I got to thinking about the apparent gap of 6+ months between these BOINC crashes and the subsequent 2 crashes in 3 days on different machines. What was different during those 6+ months? One promising theory that I had was that the period roughly coincided with the span where we were processing mostly older 2008 and 2009 data. Then we recently jumped ahead to more recent tapes, 2010-2014.

That theory almost works, but I found one flaw in it. In reviewing the BOINC crash occurrences, I found one that I had forgotten about, on June 30. That one turned out to be a 2009 file. So far, it's the only fly in the ointment, though.

Anyway, as long as I've dug this additional info out, I'll go ahead and post it in the hopes that it may yet prove useful when more clues surface. Here a list of all my BOINC crashes which generated AP zombie tasks. The list shows the date and host ID, followed by the dataset name of the AP task which appears to have triggered the crash. (I didn't capture the stdoutdae file for the December 30, 2013, crash, so I don't know for sure which "zombie" was the last one to start.)

20131230_7057115: ap_16oc13ac_B3_P0_00113_20131229_06439.wu_2 (don't know which of these 2 tasks was trigger)
20131230_7057115: ap_16oc13ad_B6_P1_00200_20131229_05567.wu_1 (don't know which of these 2 tasks was trigger)
20140104_7057115: ap_17oc13ac_B1_P0_00131_20140103_01567.wu_1
20140209_7057115: ap_10ap13aa_B5_P1_00191_20140208_30452.wu_1
20140310_6980751: ap_28my13ad_B1_P0_00265_20140309_08199.wu_2 (morning crash)
20140310_6980751: ap_28my13ad_B3_P1_00211_20140310_25141.wu_1 (evening crash)
20140630_6980751: ap_13mr09ab_B3_P0_00183_20140628_15930.wu_0
20141018_7057115: ap_21no10ab_B0_P0_00325_20141016_19978.wu_2
20141021_6980751: ap_06se14aa_B6_P1_00113_20141020_01141.wu_1


Next 20

Copyright © 2014 University of California