Posts by Bob Mahoney Design

1) Message boards : Number crunching : I QUIT! (Message 915689)
Posted 8 Jul 2009 by Profile Bob Mahoney Design
Post:
Quote of Bill Walker:
Are you sure you're Canadian? you(sic) over simplify(sic) complicated arguements(sic) like a German, or an American.

If I leave my old clunker running 9 to 5 because I'm expecting business related e-mails, and then give the spare CPU cycles to BOINC, there is no net increase in wasted energy, green house gasses(sic), etc. (Bob's comment: In reality, energy use doubles.)

If I build a multi CPU multi GPU speed machine to occaisonally(sic) read those same e-mails, and let it run 24/7 to get RAC to prop up my fragile ego(sick), THAT is Environmentally Irresponsible.

OK, time for a mid-thread review.

The message quoted here is the most, um, 'interesting' of the bunch, and it does help me to draw a serious conclusion......

Seriously, that must be some strong beer and Schnapps you've got in The Great White North!

Bob
2) Message boards : Number crunching : I'm quitting Seti 'cause of increased power cost coming from "Cap and Trade" (Message 912376)
Posted 28 Jun 2009 by Profile Bob Mahoney Design
Post:
... Not to bash on those who choose to go the extra mile, but the project never expected anyone to build entire farms and shoulder more responsibility (power consumption) than anyone else.

The entire point of Distributed Computing is that it is done by spare CPU cycles around the world, meaning when the computer is powered on doing other things but not using all of its CPU cycles. Turning off your computer when you are done, or when its hot out, or when you go on vacation - its all the same.

This is a down-side to the natural competitiveness that many people have; their yearning to reach the top and collect points and be number 1. In this game, it costs real money to do that. Congrats to those who can. The rest of us have a budget and other problems, especially in this economic troubling time.

Well put, Ozzfan!

I can vouch for your last paragraph. Quite often, our competitive nature turns a hobby into something that feels like a job... A difficult job that takes over too much of our life.

Bob
3) Message boards : Number crunching : CLOSED** SETI/BOINC Milestones (tm) XVII **CLOSED (Message 908926)
Posted 18 Jun 2009 by Profile Bob Mahoney Design
Post:

Best Wishes on your next project!

@ all

Congratulation! to all on reaching your own personal Milestones!

Byron

Hey Byron! Yeah, SETI@home is the ultimate test for any computer. The next one is going to get tortured just like the others. It will be fun.

Hey everybody with milestones - You found SETI@home, you found the forums, you found the Milestones thread, you experienced a personal milestone, and you posted it. Isn't that a milestone in itself? It took me 4 years before I stumbled over to the forums. Doh.

Bob
4) Message boards : Number crunching : CLOSED** SETI/BOINC Milestones (tm) XVII **CLOSED (Message 908790)
Posted 18 Jun 2009 by Profile Bob Mahoney Design
Post:
I retired the Top Host that served so well for so long. It was a wonderful machine.

Check out my profile here for a picture of the computer, plus some technical data.

Bob


Congrats! That was an amazing host, and your next project seems very ambitious indeed!


Off topic, but is that the same ExecPC that started out as a BBS back in the 80s and 90s? ExecPC was one of my top favorite BBSes to dial into, and it always made me want to start my own BBS because it seemed so fun.

Yep, that ExecPC BBS.

Still available via Telnet at bbs.execpc.com

My good friend Curt Shambeau is still running it in his basement with an original Novell file server and 80386 hardware. Yes, 80386. He's going for the duration record. I birthed it in 1983, I left in 1998. We spanned the era from 110baud to broadband, Arpanet to Internet. Wow, what a trip.

Past employees are now re-forming the collective on Facebook.

Anecdotes? Oh my. Visits from the FBI (sometimes with concealed weapons) to ask me if I am illegally downloading HBO movies from satellites with "those dozens of 2400baud modems"...Death threats from users gone crazy...Employees and/or users coming together in marriage...The employee party where they drank no beer, just Mountain Dew, and slam-danced themselves dizzy (and someone made our company logo out of Spam)...The international file compression competition...The Million Caller Milestone competitions...We used up all of the phone lines in an entire industrial park before we went to fiber...And the intensity of working with 100 brilliant employees plus thousands of blazingly intelligent users (peaked at 85,000 active customers during my tour of duty), all for the same cause - to advance mankind's progress through new and better ways to share knowledge and information.

It was fun. The good people who run SETI@home and the loyal users of SETI@home are, for me, a wonderful deja vu of the experience of creating and running ExecPC. That is one of the biggest compliments I can give.

Thanks, OzzFan.

Bob
5) Message boards : Number crunching : CLOSED** SETI/BOINC Milestones (tm) XVII **CLOSED (Message 908014)
Posted 16 Jun 2009 by Profile Bob Mahoney Design
Post:
I retired the Top Host that served so well for so long. It was a wonderful machine.

Check out my profile here for a picture of the computer, plus some technical data.

Bob
6) Message boards : Number crunching : 6.6.36 Released - FYI (Message 907896)
Posted 15 Jun 2009 by Profile Bob Mahoney Design
Post:
The scheduler in 6.6.36 seems to be totally broken. Starting a machine from scratch last night I've received probably 300 6.03s. Meanwhile my GPUs have been idle for 16 hours. S@H only on the machine in question.

Funny, the night before that I received about 300 608s and only two 603s. Then got some 603s, one took a looong time to run, it adjusted the Duration Correction Factor (DCF), that put the system into 'hurry up' mode (EDF), that created a string of "Waiting to run", that over-ran the GPU memory, that locked up the computer, that influenced me to detach it from the project. Now the computer is sitting in the penalty box. :)

The host in question will be allowed to play again, but only after it and BOINC decide to get along. Waiting patiently for BOINC 6.10.xx, which should have multiple DCF, one for each class of task on that host.

Just a note of support: IMHO, SETI@home and BOINC are the most ambitious projects of their type in all the world. All research projects have periods of instability and adaptation. We are in the middle of one of those times. It is the nature of the beast. Bleeding edge is painful but exciting.

Just an opinion from a SETI@home fan...

Bob
7) Message boards : Number crunching : BOINC v6.6.31 available (Message 906190)
Posted 11 Jun 2009 by Profile Bob Mahoney Design
Post:
1) AP tasks influencing the anticipated running times of CUDA tasks is a design flaw in BOINC which we're going to have to live with until at least BOINC v6.10

The worst effects can be mitigated by careful fine-tuning of the FLOPs figures in app_info.xml: if you get it right (carefully balanced for the performance bias of your particular hardware), you can avoid EDF almost entirely.

2) Except that then, you wouldn't have discovered the mis-behaviour of CUDA under stress! From what I'm reading, there are problems:

a) When there are two or more CUDA devices in a single host (which rules me out for testing, sadly)

b) When a combination of cache size/DCF/deadlines brings 'shorties' forward in EDF - which to my mind should be flagged as 'High Priority', but doesn't seem to be.

I think someone - not me, I'm single GPU only - is going to have to do some logging with <coproc_debug> (see Client configuration), try and work out what's happening, and report the analysis and logs to boinc_alpha.

Thanks, Richard, that sums it up nicely.

The Lunatics Unified Installer was so much fun to play with, I was hoping to avoid doing the fpops/flops process on another computer. Now that I've done the fpops/flops thing, "To completion" times have settled down, so the odds of EDF mode will decrease.

I have also decreased the CUDA cache to 2 days.

Funny, isn't it, that the answer for another BOINC-exception-case on SETI@home once again is "keep your cache under 2 or 3 days"?

Bob
8) Message boards : Number crunching : BOINC v6.6.31 available (Message 905949)
Posted 10 Jun 2009 by Profile Bob Mahoney Design
Post:
Fred said: ... When one EDF CUDA WU finished another started and the partner EDF CUDA WU continued to completion. When a CPU MB WU completed and uploaded, there was no effect on the CUDA WU's in flight. And when I gave up and restored the original settings (including removing the config flags) and allowed the pre-empted CUDA WU's to complete, they did so without error (no -5!!).

The only difference on my host at the moment is that I am running down my cache..

The only difference I can point out with my situation is as follows:

Whenever the "Waiting to run" problem happened, I had a queue of waiting CPU tasks that was way beyond my "Additional work buffer" setting.

On one computer I had 5 days of CPU tasks waiting, "Additional work buffer" set to 1.6 days.

The other computer had 10+ days of CPU tasks waiting, "Additional work buffer" set to 4.5 days.

Perhaps BOINC (or whatever) only gets confused when the CPU tasks queue is full of extra days of work in comparison to the GPU queue?

Bob
9) Message boards : Number crunching : BOINC v6.6.31 available (Message 905800)
Posted 10 Jun 2009 by Profile Bob Mahoney Design
Post:
Fred, is there an imbalance in cache size between your 608 vs. 603+AP queue on your system? I mean, do you have an intentional difference in cache size, achieved by filling up one, then going back to a shorter queue for running? On mine there is a full 10 days of AP for the CPU, and half that duration of MB for CUDA. I'm wondering if BOINC is confused by the cache size contrast?

I work on the pricipal that the minimum turn-round time specified for MB tasks is 7-days, so with a 3-day cache I am rarely going to run into EDF. I don't try to micro-manage the cache - I just tell it 3-days (with a 0.1 day connect interval) and leave Boinc to it.
You have a (approx) 5-day cache for MB. If your cache is filled up when your DCF is at or near its minimum value and then the DCF is doubled by a task taking longer than expected (I have watched this happen), then any 7-day return task that was toward the end of your queue is in deadline trouble and you are into EDF. Running a 3 (or less) day cache gives much more headroom in this respect and I have not run out of work at any time recently, even with the major network problems we have experienced. I am pretty sure that halving your cache size would virtually elimate the "waiting to run's".

F.

That is it! That is exactly what I just observed. I turned the host back on for 5 minutes and observed the following:

1. An AP task completed.
2. This influenced the CUDA tasks to nearly triple their "To completion" estimates from 8min to 22min.
3. EDF mode arrived, and CUDA task hijacking began.

Yesterday, after restoring 10+ days of AP, and running ONLY AP (no CUDA), my system put the 8 running AP tasks into "Running high priority", with no ill effect. Problem is only with EDF on CUDA, possibly only while running CUDA in conjunction with AP or 603 on CPU.

Troubling is the fact that completion of an AP task on the CPU should heavily influence the "time to completion" estimate for CUDA tasks. If that is how the current BOINC works, that does not seem appropriate.

Bob
10) Message boards : Number crunching : BOINC v6.6.31 available (Message 905783)
Posted 10 Jun 2009 by Profile Bob Mahoney Design
Post:
If it's anything like mine (running v6.6.28, .31, or .33), then once the GPU is in EDF mode (and it doesn't show "high priority" in BM) then for every 2 tasks that start, one will be left in "waiting to run" when the next 2 start until it comes out of EDF mode. I have recently noticed that with v6.6.33, when the GPU is in EDF mode, the completion of a CPU task will put *both* GPU tasks into "wating to run" and start a new CPU tasks and 2 new CUDA tasks. The only way to avoid all this seems to be to stay out of EDF mode on the CUDA which I do manage to do most of the time with my 3-day cache.

F.

I think that is exactly what happened. It looked like the first "waiting to run" is proper, later deadline, single task suspended. As you say, the next time (minutes later) a GPU task (typically a shorty) starts, it caused two running GPU tasks to revert to "waiting to run". After that, suspension happened in multiples of two. Keep in mind we are running two GPU.

Let us not forget that Questor is experiencing this with a single GPU. So he must be getting singles of "waiting to run" as it happens.

It does NOT happen when I run ONLY CPU or ONLY GPU. I can beat the heck out of such a setup and it never fails. As soon as I add a task to the other side of the computer, it eventually has this issue.

Fred, is there an imbalance in cache size between your 608 vs. 603+AP queue on your system? I mean, do you have an intentional difference in cache size, achieved by filling up one, then going back to a shorter queue for running? On mine there is a full 10 days of AP for the CPU, and half that duration of MB for CUDA. I'm wondering if BOINC is confused by the cache size contrast?

Note: Don't anyone get too upset about the big AP cache here - a 3.3.31 "waiting to run" crash took most of my hard drive with it last week. I finally got it running long enough to retrieve the 'lost' AP units yesterday.

Bob
11) Message boards : Number crunching : BOINC v6.6.31 available (Message 905763)
Posted 10 Jun 2009 by Profile Bob Mahoney Design
Post:
Wating to Run issue:

...and it just created 3 more in the past few minutes.

Bob

Hoo Hah! It just created another seven! This thing is a "Waiting to run" factory!

Bob

(Edit changed one to two. Oh yeah.)
(Edit again, changed two to seven. Wow.)
(Last edit: I'm shutting this computer off. Ready for debugging.)
12) Message boards : Number crunching : BOINC v6.6.31 available (Message 905761)
Posted 10 Jun 2009 by Profile Bob Mahoney Design
Post:
Wating to Run issue:

I have a system here that created 12 waiting to run last night (from which it crashed). I rebooted it, aborted the 12 hung tasks, and it just created 3 more in the past few minutes.

If it is not too tough to do, I'll volunteer it for debugging.

(Thanks to everyone who responded so far. No obvious pattern has emerged re. this problem.)

Bob
13) Message boards : Number crunching : Warning once again when upgrading Boinc to newer versions (Message 905751)
Posted 10 Jun 2009 by Profile Bob Mahoney Design
Post:
I've had all of those problems with both 6.6.31 and 6.6.33. 12 tasks were "waiting to run", and this system crashed out last night.

The odd part is Vyper is NOT running any tasks on the CPU (is that correct?), and I thought this problem only appeared in the CPU+GPU situation.

There is no good pattern to the problem yet.

Bob
14) Message boards : Number crunching : Panic Mode On (16) Server problems (Message 905537)
Posted 9 Jun 2009 by Profile Bob Mahoney Design
Post:
Here I wait
Spinning Atom CPU
In an idle state
Not a WU to do!

Entropy encroaching
Little processor that could
Time to do some coaxing
Kick the servers! Kick them good!



Hee hee.
15) Message boards : Number crunching : BOINC v6.6.31 available (Message 904018)
Posted 5 Jun 2009 by Profile Bob Mahoney Design
Post:
Re. the "waiting to run" issue, where tasks say "waiting to run" and never get restarted...

It happens to Fred W when he is running CUDA tasks plus either AP or MB on his CPU. Same for me.

Is everyone who is experiencing this problem running tasks on their CPU as well as on CUDA? If so, this might be a very good clue about the bug.

Bob
16) Message boards : Number crunching : BOINC v6.6.31 available (Message 903682)
Posted 4 Jun 2009 by Profile Bob Mahoney Design
Post:
Bob, you may be thinking of the way they have narrowed the range on the VLARs. I've got a few that would have been -6'd before but are now deemed to be acceptable to run. Raistmer was a bit on the cautious side when he first figured the range.

Good point. I'll watch for a near-VLAR to see if the duration factor goes out of wack and forces high priority.

Also, I'll try starting with a cache of .1day to decrease the odds of a long-running WU skewing durations too far.

Bob
17) Message boards : Number crunching : BOINC v6.6.31 available (Message 903680)
Posted 4 Jun 2009 by Profile Bob Mahoney Design
Post:
From Richard:
...
Starting with v6.6.23 (I think they missed a couple of release numbers), CUDA tasks should run in the order they're received from the server, unless there's a deadline problem. Then, they would switch to 'earliest deadline' mode, but should also show a flag for running in "High Priority". Are you seeing that? (You may need to extend the width of the 'status' column to be sure). Fred's screenshot doesn't show High Priority, so the shorty tasks may simply have been the next due to run in FIFO order: unfortunately, with a bespoke sort order set, we can't see the tasks in issue order.

[Tip:
If, like Fred, you have a bespoke sort order set, you can clear it from the registry by clearing these two registry keys:

[HKEY_CURRENT_USER\Software\Space Sciences Laboratory, U.C. Berkeley\BOINC Manager\Tasks]
"SortColumn"=dword:ffffffff
"SortAscending"=dword:00000001

It's the only way I've found.
/Tip]

With a 1.6 day cache, there should never be any need for High Priority running, and hence no EDF. But beware: if you have a full cache, and something goes wrong with your Duration Correction Factor or other time metrics, a 1.6 day cache can suddenly evaluate to seven or more days, and trigger EDF. I've had that happen with a near-VLAR which escaped my rebranding. Look and see if your current BOINC estimates for unstarted tasks seem to be realistic.

If that doesn't throw up any clues, we may be moving into the territory of extended debug logging - see Client configuration. Are any of you up for that? We're probably talking about <coproc_debug>, <cpu_sched> and <cpu_sched_debug>.

I miss the old "Accessible view" option in BOINC - where tasks resorted to 'natural' order.

Points:

1. I run only SETI@home, no other projects at this time.
2. Apparent EDF mode (preempting) still has NONE of my tasks saying "High Priority"
3. Preempted tasks usually have a deadline EARLIER THAN the task that replaced it.
4. "Waiting to run" tasks never get restarted.
5. I don't think my duration correction factor ever got skewed enough (from near-VLAR runtime influence) to force EDF. But I will double check this.
6. I DID have AP running on the CPU on both computers. This might be a factor.

Before I volunteer the big system for sacrificial testing, I'll try running it as CUDA-only, with no AP. This will eliminate some obvious questions in order to purify the test environment. As soon as it sees the first "waiting to run", you can tell me how to torture the computer in any way you like. Then we will at least know it is not related to AP on the same system. I'm ready to retire it and take it to storage, but if it can help out with the problem, let's do it.

Bob
18) Message boards : Number crunching : BOINC v6.6.31 available (Message 903640)
Posted 4 Jun 2009 by Profile Bob Mahoney Design
Post:

I wondered if you'd had so many VLAR kills it had cleared all your tasks out - making it look like you'd had lots of the preempt problems.

The other -5 I mentioned is the memory problem rather than then the pre-empt issue. 1247338546
...
<core_client_version>6.6.31</core_client_version>
<![CDATA[
<message>
- exit code -5 (0xfffffffb)
</message>
<stderr_txt>
...
setiathome_CUDA: Found 2 CUDA device(s):
Work Unit Info:
...............
WU true angle range is : 8.984127
Cuda error 'cudaMemcpy(dev_cx_DataArray, cx_DataArray, NumDataPoints * sizeof(*cx_DataArray), cudaMemcpyHostToDevice)' in file 'd:/BTR/SETI6/SETI_MB_CUDA/client/cuda/cudaAcceleration.cu' in line 262 : unspecified launch failure.
SETI@home error -5 Can't open file
(work_unit.sah) in read_wu_state() errno=2

</stderr_txt>
]]>

Perhaps it is a bad GPU. I assumed it was overloaded with resident "wait to run" tasks, maybe it is actually bad memory. I will remove that card and try some test runs without it.

Bob
19) Message boards : Number crunching : BOINC v6.6.31 available (Message 903630)
Posted 4 Jun 2009 by Profile Bob Mahoney Design
Post:
From Fred W:
...
So the completion and upload of an EDF WU on one half of my GTX295 causes the WU running on the other half to be pre-empted. I then end up with a whole list of "Waiting to run" WU's which error out with: [-5]

From Richard:
Unfortunately, all my CUDA cards are single-core, so I can't follow this one up with personal observations. But it sure sounds as if there's another bug still waiting to be tracked down.

My system with two single-core CUDA cards (two GTX285) was creating "wait to run" tasks at a rate comparable, card per card, with my 6xGTX295 system.

Fred's observation is alarming.

Since the preempting is also happening on my single-core GPU system, it goes back to my question: Why is the preempting happening in the first place? In other words, with a short cache (1.6 days), and plenty of time for all tasks to complete before deadline, why does a WU go EDF in the first place? Is the WU born and flagged that way before I get it? I thought EDF was a calculated state based on the immediate context within the local host?

Bob
20) Message boards : Number crunching : BOINC v6.6.31 available (Message 903612)
Posted 4 Jun 2009 by Profile Bob Mahoney Design
Post:

You also seem to have been unlucky and have snagged a whole bunch of VLARs which have been killed off :-

The VLARs don't seem to be a problem. The autokill has been working OK, I think.

...
I see one -5 task which looks like the output from the pre-empt problem from this morning at 5:35 which I assume was after you've upgraded to 6.6.31? However other -5's say they are VLAR kills as well.

Yes, that was after the upgrade. Still, it does not make sense that tasks with later deadlines are doing the preempting, and the QX9770 system was only running a cache of 1.6 days. I have not kept up on the pre-empt situation as discussed in the forums here. It has been surprising how much it happens on my systems since upgrades above BOINC 6.6.21. It finally got bad enough that both systems are now detached and turned off... out of necessity - they just don't work anymore with today's configurations of software and WUs.

</stderr_txt>
]]>

Other errors include :-
Cuda error 'cudaMemcpy(dev_cx_DataArray, cx_DataArray, NumDataPoints * sizeof(*cx_DataArray), cudaMemcpyHostToDevice

which seems to point at a memory problem with CUDA - are your machines ever rebooted. The extra waiting tasks may still be in CUDA memory and it just ran out - famous last words - perhaps a reboot will clear the problem.

Last night I reset the system before going to bed. Unfortunately, the big system (6xGTX295) got so many preempts during my 6 hours of sleep, well, it died.


I have started to get a whole bunch of waiting to runs - I think this is because the preempting is no longer erroring them when the tasks switch - before they just errored and disappeared.

I really abused 6.6.31 when I upgraded it and coouldn't make it -5 once - all with WindowsXPSP3.

John.

Thanks for the input, John.

I'm guessing the nature of the 6xGTX295 exaggerates such problems, and the end result (failure) arrives sooner. That's why I tried building the system for SETI with only two GPU. The problems are more subtle, yet it was also creating 3 or 4 "waiting to run"'s per day for the past couple days.

Question: What is the decision process that makes BOINC preempt in this situation? Is this a bug, or is there some logic to it?

Bob


Next 20


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.