Boinc 6.6.20 has problem getting new Einstein S5R5 units?

Message boards : Number crunching : Boinc 6.6.20 has problem getting new Einstein S5R5 units?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile John

Send message
Joined: 5 Jun 99
Posts: 30
Credit: 71,930,436
RAC: 16,588
United States
Message 888511 - Posted: 26 Apr 2009, 16:44:17 UTC

I am placing this under Seti because it appears to me that Boinc 6.6.20 is the culprit. Anyway, I am running several projects on 4 machines and as everybody knows Einstein had recent server problems with their server. In that interlude I ran dry on Einstein and con-currently discovered that there was a newer version of Boinc (6.6.20). I decided to try it as I hoped that they had fixed some problems which had made me decide not to run Cuda units. I upgraded to 6.6.20 on all my machines except one which did not have a Cuda capable GPU. It was left at version 6.2.14. (As a side note the improvements in 6.6.20 with regard to running Cuda made me decide to stay with it) When Einstein fixed their problems and began delivering new units the 6.2.14 machine immediately downloaded the new S5R5 executables and new units and began processing. The machine with 6.6.20 tried to get new units but alway got zero. This went on for several days. Finally I decided to revert to Boinc version 6.4.7 on this machine. It immediately began downloading the new executables and new units. Seems like there is a bug there somewhere! I would also ask the developers to go back to announcing the numbers of seconds of work being requested. This is helpful to me.
ID: 888511 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 13181
Credit: 154,063,658
RAC: 200,406
United Kingdom
Message 888516 - Posted: 26 Apr 2009, 17:04:58 UTC
Last modified: 26 Apr 2009, 17:05:26 UTC

I second the motion to provide more information on work fetch requests. Just the number of seconds may not be perfect, but we need more than at present.

There is a difficult, fiddly way of getting more information, by setting logging flags in a file called cc_config.xml - which you have to create from scatch the first time you use it (it isn't there normally). If you had set up 'work_fetch_debug' for BOINC v6.6.20, you would probably have seen that it was asking Einstein for CUDA work (only) - and getting none because Einstein haven't released a CUDA application yet. Perhaps less surprising when you look at it that way.

As to why it wasn't asking for standard CPU work: there have been other changes in the work fetch procedures in v6.6.20 (apart from asking for CPU and CUDA work separately). I don't know how much you've been following recent debates - you've been a SETI member for a long time, but it doesn't look as if you've used these message boards much - but you may be familiar with the concept of 'debt', used to keep track of the balance between your crunching for different projects. Debt still exists in v6.6.20, but the way it works has changed radically, and I don't think anybody fully understands the new setup yet. But if you are having difficulty getting work for a project after upgrading to v6.6.20, to the extent that you have no work at all for that project, it can be worth 'resetting' that project from BOINC Manager (resetting cancels all work in progress for the project, which is why I stress only doing it when you have no work at all). Resetting the project also resets the debt, and you should start downloading new work again: once v6.6.20 has been kick-started this way, it seems to continue running more smoothly.
ID: 888516 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 888529 - Posted: 26 Apr 2009, 18:02:32 UTC - in response to Message 888511.  

I am placing this under Seti because it appears to me that Boinc 6.6.20 is the culprit.

Keep in mind, however that BOINC and SETI@Home are not the same, and while there is some cross-over in personnel, a posting to the SETI@Home forums is not necessarily going be read by the BOINC developers.

ID: 888529 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 13181
Credit: 154,063,658
RAC: 200,406
United Kingdom
Message 888531 - Posted: 26 Apr 2009, 18:05:38 UTC - in response to Message 888529.  

I am placing this under Seti because it appears to me that Boinc 6.6.20 is the culprit.

Keep in mind, however that BOINC and SETI@Home are not the same, and while there is some cross-over in personnel, a posting to the SETI@Home forums is not necessarily going be read by the BOINC developers.

Nor is a posting to the BOINC forums, as Jord keeps reminding us!
ID: 888531 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 888534 - Posted: 26 Apr 2009, 18:22:12 UTC - in response to Message 888531.  
Last modified: 26 Apr 2009, 18:22:38 UTC

I am placing this under Seti because it appears to me that Boinc 6.6.20 is the culprit.

Keep in mind, however that BOINC and SETI@Home are not the same, and while there is some cross-over in personnel, a posting to the SETI@Home forums is not necessarily going be read by the BOINC developers.

Nor is a posting to the BOINC forums, as Jord keeps reminding us!

In every case, if you want to talk to the developers, you need to talk to the developers.

... and they pick the venue.

It seems that the best venue, if you must contact them, is the BOINC Developers mailing list.

It also seems appropriate to post this on the Einstein forums, in case it is something the Einstein administrators can address.

I'm not saying it is inappropriate here -- just that it might not actually reach the people the OP wants to reach.
ID: 888534 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 888558 - Posted: 26 Apr 2009, 20:01:06 UTC

I'm a non-CUDA person who runs both SETI and Einstein, and had three hosts run out of Einstein work during the recent unpleasantness.

One of these hosts was a 6.6.20 host. The recovery behavior there was quite different than on the 5.10.45 and 5.10.20 hosts.

I run with 4% SETI, 96% Einstein work share allocation, so encounter the various issues with highly asymmetric work share sometimes.

The 6.6.20 host had pulled something like two days supply of SETI down during the Einstein famine, not unreasonable given that I had requested a total queue just over two days.

However, once Einstein came to life, and gave it _one_ WU, 6.6.20 once again applied the 4% share, decided that SETI was in grave danger of missing deadlines, and began running four SETI jobs at high priority, and not fetching any more Einstein! I think it actually posted a deadline miss message or status somewhere, though the deadlines were weeks in the future.

While one can see the internal logic at each step, the overall behavior was not pleasing to me.

I'll confess, after watching this for a few hours, I tampered. Setting SETI to no work fetch, I then suspended it, which promptly induced a useful amount of Einstein download (though nowhere near two days worth). However, when I removed the SETI suspension, 6.6.20 immediately suspended all Einstein work to resume running 4xSETI at high priority.

So I took the more severe step of aborting nearly all the unstarted SETI work, as the price of not extending my several day Einstein outage by two additional days.

None of which is to say it is nonfunctional, but in this case the combination of a strongly asymmetric work share request and a multi-day outage on the high share project gave behavior not to my liking.
ID: 888558 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 888567 - Posted: 26 Apr 2009, 20:21:05 UTC - in response to Message 888558.  


I run with 4% SETI, 96% Einstein work share allocation, so encounter the various issues with highly asymmetric work share sometimes.

<snip>

None of which is to say it is nonfunctional, but in this case the combination of a strongly asymmetric work share request and a multi-day outage on the high share project gave behavior not to my liking.

Not only is it functional, but it looks like it was trying to do what it could to meet your requirements.

If you'd let it run, BOINC would have honored the commitment to SETI (finish the work it was given, on time) and then concentrated on Einstein-only until the extra time for SETI was "paid back" to Einstein.
ID: 888567 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 888571 - Posted: 26 Apr 2009, 20:39:32 UTC
Last modified: 26 Apr 2009, 20:45:02 UTC

The behavior seen by archae86 is normal although objectionable to many. The problem is that resource share is used for work scheduling as well as work fetch.

From Boinc point of view, it could only give Seti 4% of the CPU time if it followed the rules. Its only safe option to meet deadline and get back to a point it could give Seti only 4% was to go into EDF mode and burn through the Seti WU quickly. It would then only download and crunch Einstein for a while until long term debt was brought into balance. The behavior may be annoying in the short term, but left to its own device Boinc would probably recover within a day or so and start making the time up to Einstein over the next few days before trying to get more Seti work again. One advantage of multicore system is that resource share balance can recover fairly quickly.

Under normal circumstances this works very well especially if you have multiple projects. On my C2D system, I usually see a newly downloaded AP WU go into EDF for a few hours until Boinc is satisfied, it's 5% share can be accommodated in normal round robin. By the time the AP has been completed the long term debt is already in balance and it can ask for more work from Seti.
ID: 888571 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1348
Credit: 10,594,765
RAC: 3,669
New Zealand
Message 888576 - Posted: 26 Apr 2009, 20:51:48 UTC
Last modified: 26 Apr 2009, 20:52:44 UTC

News from Einstein front page dated Apr 24 09

The Einstein@Home project is finally back up and running properly. The binary radio pulsar search may not be sending out new work for a couple of more days while we fix some VPN problems between the WU generator in Hannover and the project database in Milwaukee. However the S5R5 gravitational-wave search IS sending out plenty of work. Many thanks again to our loyal volunteers and contributors for their patience!
ID: 888576 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 888641 - Posted: 27 Apr 2009, 0:20:41 UTC

I guess I was not clear enough.

The really undesirable aspect was not the assignment of four cores to SETI, though that was not quite my preference (and won't be made up so quickly as suggested, when you recall the 4% prescribed share), but rather the failure to download even a single quad of Einstein work.

No, I don't expect it to be omniscient, but, in case it is not obvious, this was bad for my personal case as it failed to take advantage of potentially limited server uptime and work available to build a buffer. No, I'd not expect instantly to get two days of Einstein, but I would expect it to backfill Einstein as it depleted the SETI. So long as I watched, this did not happen, and I suspect it would not have resumed remotely normal Einstein work fetch until it had nearly cleared the SETI work completely.

To those who slam down all scheduler comments with "that's normal", I was not speaking to you, but rather to others who might be considering whether to upgrade older versions, or to interpret behavior on their host.

Regarding Einstein, it has sent my hosts multiple binary radio pulsar search results since yesterday--the front page comment may be out of date in that respect, or perhaps they are depleting previously generated work.
ID: 888641 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 888673 - Posted: 27 Apr 2009, 2:43:01 UTC

@archae86

There are other work fetch issues under discussion. My recommendation is for those that see these issues is to actually back down from the "recommended" version. Sadly, as has happened in the past a version that is not really ready is promoted as being ready ... *MY* feeling based on observations on several projects and comment LIKE yours that 6.6.20 and its later cousins are not suitable for use in many circumstances.

The version that *I* personally recommend is, and has been, and likely will be for some time, is 6.5.0 if you do CUDA ... your choice if you do not... but none of the 6.6.x, in my opinion have stood the test of time across the broad spectrum of participants.

When you have "unbalanced" work load configurations the new 6.6.x work fetch does not seem to keep itself in balance. I have one system that I have to use cc_config to reset debts about once every 24-48 hours ... as soon as I note that I cannot keep 4 GPU Gird tasks queued. Because I know then the test version (6.6.23) that I am running is unrecoverable. Another participant on Rosetta has the opposite problem on their where he cannot get the full queue of Rosetta work.

If history is a guide, a few more dot releases we MAY have the debt logic working better though I have strong doubts that this really will be true because we don't really have a history report that shows actuals other than the running numbers which may not be correctly calculated.

By this I mean, what are the actual seconds per day allocated to each project ...those we can then add up to see if they match what we SHOULD have accumulated.
ID: 888673 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 888674 - Posted: 27 Apr 2009, 2:46:03 UTC - in response to Message 888641.  

I guess I was not clear enough.

The really undesirable aspect was not the assignment of four cores to SETI, though that was not quite my preference (and won't be made up so quickly as suggested, when you recall the 4% prescribed share), but rather the failure to download even a single quad of Einstein work.

No, I don't expect it to be omniscient, but, in case it is not obvious, this was bad for my personal case as it failed to take advantage of potentially limited server uptime and work available to build a buffer. No, I'd not expect instantly to get two days of Einstein, but I would expect it to backfill Einstein as it depleted the SETI. So long as I watched, this did not happen, and I suspect it would not have resumed remotely normal Einstein work fetch until it had nearly cleared the SETI work completely.

To those who slam down all scheduler comments with "that's normal", I was not speaking to you, but rather to others who might be considering whether to upgrade older versions, or to interpret behavior on their host.

Regarding Einstein, it has sent my hosts multiple binary radio pulsar search results since yesterday--the front page comment may be out of date in that respect, or perhaps they are depleting previously generated work.

I'm not arguing what's normal, I'm saying that in my opinion, it is desirable.

The reason I think it's the right thing to do is because BOINC does not (and should not) try to predict the deadlines on the next Einstein work unit. BOINC does not try to figure out of the deadlines are constant, or variable because a project (as Einstein has) can introduce new science applications, or crunch time can change dramatically (as with SETI at different angle ranges).

When BOINC sees deadline pressure, there is always the chance that a newly assigned work unit could have an earlier deadline -- that new work could change a workable situation into an impossible one.

The other part of what I said was "yes, you'll end up doing more SETI than you strictly care to do, but you'll do more Einstein later as a result" -- the impact on your particular case is temporary.

Dismissing everyone who disagrees with you just because "it's normal" is to dismiss those who tried to constructively explain why BOINC is doing the right thing.
ID: 888674 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 888677 - Posted: 27 Apr 2009, 2:56:29 UTC - in response to Message 888673.  
Last modified: 27 Apr 2009, 2:59:46 UTC

When you have "unbalanced" work load configurations the new 6.6.x work fetch does not seem to keep itself in balance. I have one system that I have to use cc_config to reset debts about once every 24-48 hours ... as soon as I note that I cannot keep 4 GPU Gird tasks queued. Because I know then the test version (6.6.23) that I am running is unrecoverable. Another participant on Rosetta has the opposite problem on their where he cannot get the full queue of Rosetta work.

The big change from 6.4.x to 6.6.x is that 6.4.x tried to keep the cache full, and would pull from any project to "top up."

It wouldn't necessarily get work from the "most owed" project.

6.6.20 won't use projects that are ahead of the game to fill the "extra days" -- it will only use the "least owed" projects to fill up to the "connect every 'x'" size.

Among other things, that makes me think that users who have "extra days" at zero could see very different results from those who have extra days near the maximum.

I also suspect that clients that have built up much debt under 6.4.5 are going to take a lot of time to "burn off" months of accumulated debt, and a lot of reports will center around that.

That is of course assuming that people have the patience to watch over weeks and months of work, and not hours or days.

I suspect that resource share needs to be kept for GPU and CPU work separately, and those who are most interested in seeing every slot filled might select a different set of design flaws.

[edit]I'll note, Paul, that archae86 says he isn't doing CUDA. In my opinion, 6.6.20 will come closer to keeping his system busy and averaging out to his desired resource share, even if it looks out of balance at times.[/edit]
ID: 888677 · Report as offensive
Nick: ID 666
Volunteer tester

Send message
Joined: 18 May 99
Posts: 13054
Credit: 36,531,359
RAC: 21,040
United Kingdom
Message 888684 - Posted: 27 Apr 2009, 3:33:28 UTC

With a resource share of 4% on a quad, you are normally allocating just under one hour per day to Seti. With Einstein off and BOINC filling your two day cache with Seti, it means that when Einstein recovered and started sending you work that your Seti cache is now is effectively over 192 days of work (2 * 4 * 24). As that is much greater than the longest Seti deadline, then priority, EDF or panic mode must come into operation.

ID: 888684 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 888706 - Posted: 27 Apr 2009, 6:11:48 UTC - in response to Message 888684.  

With a resource share of 4% on a quad, you are normally allocating just under one hour per day to Seti. With Einstein off and BOINC filling your two day cache with Seti, it means that when Einstein recovered and started sending you work that your Seti cache is now is effectively over 192 days of work (2 * 4 * 24). As that is much greater than the longest Seti deadline, then priority, EDF or panic mode must come into operation.

But it does not have to refuse to download Einstein work.

ID: 888706 · Report as offensive
Nick: ID 666
Volunteer tester

Send message
Joined: 18 May 99
Posts: 13054
Credit: 36,531,359
RAC: 21,040
United Kingdom
Message 888709 - Posted: 27 Apr 2009, 6:23:16 UTC - in response to Message 888706.  

With a resource share of 4% on a quad, you are normally allocating just under one hour per day to Seti. With Einstein off and BOINC filling your two day cache with Seti, it means that when Einstein recovered and started sending you work that your Seti cache is now is effectively over 192 days of work (2 * 4 * 24). As that is much greater than the longest Seti deadline, then priority, EDF or panic mode must come into operation.

But it does not have to refuse to download Einstein work.

My observation of BOINC operation says that if it has tasks from all projects attached to, then it will try to process tasks as per the STD and resource share. Which in your case would just put the computer further into the mire.
Probably it is better not to download Einstein work, or download a minimum until sufficient Seti tasks have been cleared out of the way. And then, if possible, not download any more Seti tasks for the foreseeable future.
ID: 888709 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 888990 - Posted: 28 Apr 2009, 0:26:16 UTC - in response to Message 888706.  

With a resource share of 4% on a quad, you are normally allocating just under one hour per day to Seti. With Einstein off and BOINC filling your two day cache with Seti, it means that when Einstein recovered and started sending you work that your Seti cache is now is effectively over 192 days of work (2 * 4 * 24). As that is much greater than the longest Seti deadline, then priority, EDF or panic mode must come into operation.

But it does not have to refuse to download Einstein work.

It isn't refusing Einstein it is refusing to add work until it can safely complete everything on time.

Think of "not downloading work" as "not pouring gasoline on a fire."
ID: 888990 · Report as offensive
Profile Ghery S. Pettit
Avatar

Send message
Joined: 7 Nov 99
Posts: 298
Credit: 27,360,150
RAC: 3,267
United States
Message 889037 - Posted: 28 Apr 2009, 3:09:06 UTC - in response to Message 888673.  


The version that *I* personally recommend is, and has been, and likely will be for some time, is 6.5.0 if you do CUDA ... your choice if you do not... but none of the 6.6.x, in my opinion have stood the test of time across the broad spectrum of participants.



I'm curious about why this recommendation. I'm running CUDA and run 100% SETI except for those rare occasions (they've happened twice) when SETI craters and then I run Rosetta just to keep the machines busy in case the SETI WUs run out. What is the problem with 6.6.x with CUDA? I'll update all other machines, but I'll hold off on the one CUDA capable machine until I hear more about this.

Thanks.
ID: 889037 · Report as offensive

Message boards : Number crunching : Boinc 6.6.20 has problem getting new Einstein S5R5 units?


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.