Message boards :
Number crunching :
New Twist For "Aborted by project" Issue ?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
zombie67 [MM] Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0 ![]() |
5492 limit from 4 to 8, and make it a compile-time parameter So there are two variables in play. Has the code been updated, and has the per-thread limit been changed? It could be, for example that the new code has been implemented, yet the per-thread limit reduced to 50 at the same time. That would result in no change. In any case, I have noticed no change. Dublin, California Team: SETI.USA ![]() |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
5492 limit from 4 to 8, and make it a compile-time parameter For those of us on single cores, we'd notice if our quotas got cut in half... The detail page for my machines still has:
The way I read what was posted, each project could compile the scheduler code independently of each other, providing the compile-time value for the max (up to 8). If that is the case, then the compilation for SAH could've just been given a value of 4. Additionally, your "need" for additional caching is being disproved by the fact that you are having large numbers of project-side aborts. The more of those type of aborts that you have, the more it is indiative of your machine not being fast enough to process what you have in the cache before quorums are being met. That is also a justification for my viewpoint that the cap of 400/day does not need to be raised at this time. You simply cannot process that much that quickly on a normal day and have it all be scientifically needed. All a higher quota is going to do is allow you to "tank up" quicker, but you're already able to do that faster than you can process anyway, otherwise the messages about hitting the quota would never go away... They do end up going away, by your own admission. The only reason for maintaining such a large cache as what you've got is to survive massive outages. My 3-day cache has weathered all but Thumper crashing... When that happened, I went idle for a few days, then fired up Einstein for a few days... |
![]() ![]() Send message Joined: 5 Jan 00 Posts: 2892 Credit: 1,499,890 RAC: 0 ![]() |
IF by 'per-thread' you mean 'per-cpu', then no, it is still at a max of 100/CPU/day, giving a maximum possible quota on a host with 4 or more CPUs of 400/day under the old setup, and 800/day on a host with 8 or more CPUs if they have accepted the change using its default value of 8. This leaves 4 possibilities: 1. The scheduler code has not yet been updated to this version. 2. The change was later backed out and the scheduler has been updated beyond it. I will research the changelog in a few minutes to see if it was later backed out. 3. The scheduler code has been updated recently to use 8 CPUs per host as a maximum value in calculating a host's total quota of results per day, but you have yet to notice it. 4. The scheduler code has been updated as above, but they used a compile-time switch to override the default of 8 and set it back down to 4. As I said earlier, I am not sure if Berkeley has implemented the change in the scheduler's code or not. I thought they had, but am not sure. On a side note: on this particular host I am typing this on (a dual-core AMD64), I am beginning to see these aborts as well. I am running 3 projects on this one using equal resource shares (100/100/100), with a connect interval of 0 and a cache override of half a day (and yes, I am using a version of BOINC that groks the override (5.10.x)). Maybe I will drop the override to one-third or one-quarter of a day and see if that stops the project aborts, or at least makes them more rare. https://youtu.be/iY57ErBkFFE #Texit Don't blame me, I voted for Johnson(L) in 2016. Truth is dangerous... especially when it challenges those in power. |
![]() ![]() Send message Joined: 5 Jan 00 Posts: 2892 Credit: 1,499,890 RAC: 0 ![]() |
I'm now starting to see these aborts on a host with an average turnaround on this project of 0.4 days and 3 projects (equal shares) and a 0.5 day cache. I'm going to drop my cache to a bit less than 0.4 days and see if they go away, or at least become less frequent. I realize that its not for everyone, but I've chosen to weather potential outages here by crunching on multiple projects, instead of tanking up a large cache. YMMV, SSFD. https://youtu.be/iY57ErBkFFE #Texit Don't blame me, I voted for Johnson(L) in 2016. Truth is dangerous... especially when it challenges those in power. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
It is likely #1, but could be #4... I think it is wise to make it a configurable option, so it's doubtful it was backed out. Zombie, for whatever reason, feels it is "needed" right now to be changed to 8. Joe Segur mentioned in another thread about a temporary increase to 128/cpu. I don't think that's needed either unless there is a very lengthy run of immediate -9 results that completely decimates someone's cache. Howerver, what's to say that the other 112 pulled would be any different if it has gotten that bad? The point seems to be missed that two things are true: 1) Even the fastest systems are still not able to process more than about 300 units on average per day. 300 is less than 400 by a considerable percentage. 2) Astropulse and multibeam will increase processing time, reducing even further the number of results that can be processed in any "average" day. This will mean that 400/day will still be able to fill up a cache quicker. All-in-all, unless something really bad starts happening, none of this is anything to worry about... Database stability is something to be more concerned about... Being able to reduce the size of a database will generally make it more stable, requiring less memory to do queries, queries being faster, taking less time to compress, less time to do backups, etc, etc, etc... IMO, YMMV, etc, etc, etc... |
![]() ![]() Send message Joined: 5 Jan 00 Posts: 2892 Credit: 1,499,890 RAC: 0 ![]() |
Nope. Just checked the changelog, and if it WAS backed out the changelog doesn't list it. |
zombie67 [MM] Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0 ![]() |
For those of us on single cores, we'd notice if our quotas got cut in half... ?? No. The ratio is no different. So there would be no difference in noticeability. Additionally, your "need" for additional caching is being disproved by the fact that you are having large numbers of project-side aborts. The more of those type of aborts that you have, the more it is indiative of your machine not being fast enough to process what you have in the cache before quorums are being met. No. The need for a large queue is *not* to process tasks before quorums are met. It is to keep the CPUs fully utilized. CPUs not working generate no work. The assumption being any work issued is needed. The only reason for maintaining such a large cache as what you've got is to survive massive outages. My 3-day cache has weathered all but Thumper crashing... Check out my graph here: http://www.boincstats.com/stats/user_graph.php?pr=sah&id=1828996 See the big flat spot? A larger cache would have allowed continued crunching during the duration with the ramp line continued when the project was back on line. When that happened, I went idle for a few days, then fired up Einstein for a few days... Idle? Blasphemy! Dublin, California Team: SETI.USA ![]() |
zombie67 [MM] Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0 ![]() |
IF by 'per-thread' you mean 'per-cpu', No. I mean per-thread. For example, a P4 with HT has two threads. BOINC sees this as two cores. But it is a single core. Another example is my Dual Xeon machine. It has 4 cores, with 2 threads per core. So BOINC sees it as 8 cores. But it is really 8 threads and 4 cores. Dublin, California Team: SETI.USA ![]() |
![]() ![]() Send message Joined: 5 Jan 00 Posts: 2892 Credit: 1,499,890 RAC: 0 ![]() |
Uhhh... hmmm... Well, while Astropulse (at least in its current 'debugging' form -- its still very much Alpha) is dog slow (estimated 150+ hours for a 100% run on my fast host, undoubtedly it will speed up somewhat as it gets closer to Beta and eventual Release status), once the backlog is crunched there will only be one AP workunit generated for every 256 or so S@H_Enhanced workunits generated. As far as multibeam goes, the version currently in Beta (5.21) is slightly faster (at least in my experience, YMMV) than the stock 5.15 app over here in the main project. I don't think that multibeam is going to increase processing time very much at all, at least not in its current form. What it will do is provide several times the workunits per time-period our Recorder is recording compared to the old feed and recorder. However, from what I understand, the Multibeam feed and recorder will be available and functioning less often than the old feed and recorder. So, its all likely to even out in the wash. I do agree with your points about Database reliability. Also, given the projected long-term work unit shortage, I think the possibility exists that the daily quota might even be decreased, to give more people a chance to crunch some of the data that is available at the expense of the big-guns always getting a full load. |
zombie67 [MM] Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0 ![]() |
Zombie, for whatever reason, feels it is "needed" right now to be changed to 8. I don't know that it needs to be increased to 8, or if the current 400 max limit just needs to be increased. In other words, the way we get there is not important. I don't see a downside of having a limit that is so high it will never be met. 1) Even the fastest systems are still not able to process more than about 300 units on average per day. 300 is less than 400 by a considerable percentage. I don't care about averages. I want to ensure there is never down-time. 2) Astropulse and multibeam will increase processing time, reducing even further the number of results that can be processed in any "average" day. This will mean that 400/day will still be able to fill up a cache quicker. Fair enough. If/when that is in place, *then* we can talk about it. All-in-all, unless something really bad starts happening, none of this is anything to worry about... The past month of downtime proves we need it right now. Dublin, California Team: SETI.USA ![]() |
![]() ![]() Send message Joined: 5 Jan 00 Posts: 2892 Credit: 1,499,890 RAC: 0 ![]() |
IF by 'per-thread' you mean 'per-cpu', I am fully aware of the existance of HT. However, as you stated, BOINC 'sees' it as two CPUs. This project uses the term 'CPU' in its webpages in regards to this quota. I was not trying to correct you in your use of the term 'per-thread', it is a more correct term. For instance, on this dual-core box, BOINC sees two CPUs. So it runs two threads. However, if I run something else that is CPU intensive, and set its core affinity to one of the two CPUs, BOINC's two threads will spend the great majority of their time sharing the remaining CPU (getting about 49 or 50% of that CPU each). Again, I was not trying to correct you, merely to verify that we were, in fact, talking about the same thing by refering to the term used on this project's website, however incorrect it may be. |
zombie67 [MM] Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0 ![]() |
Again, I was not trying to correct you, merely to verify that we were, in fact, talking about the same thing by refering to the term used on this project's website, however incorrect it may be. Fair enough. People tend to use the terms "CPU" "Chip" "Core" interchangeably, and it leads to mucho confusion. Particularly when it comes to BOINC. If we use thread, there is no way to mess it up....well, less way. Dublin, California Team: SETI.USA ![]() |
![]() ![]() Send message Joined: 5 Jan 00 Posts: 2892 Credit: 1,499,890 RAC: 0 ![]() |
If you want to ensure that there is never down-time on your hosts, that is that they are always working on a project, the only real way to do so is to attach to multiple projects, as you have done. S@H/BOINC has never guaranteed that they will have 100% uptime, nor have they ever guaranteed that they will always have work available. Over the years, project staff have stated this repeatedly. To quote from their transition document http://setiathome.berkeley.edu/transition.php Finite work supply and Why is SETI@home switching to BOINC? Especially given its underfunded status, expecting both 100% uptime on this project's servers and an 'all you can eat' availability of work 100% of the time just isn't very realistic. |
zombie67 [MM] Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0 ![]() |
I don't care about averages. I want to ensure there is never down-time. Yes, yes, yes. I know all this. Anyone who has read even a few of the messages around here knows this. I have said many times that I am trying to benchmark my 8-way machine. Attaching to other projects is not a solution to that. But that's just a selfish goal. *But* I am also arguing this from a couple of other perspectives. 1) Some people are dedicated to a project, not DC/BOINC. For them, there is no other project. 2) Yeah, I have what is considered a "hot" machine now. A year from now, it will be a energy wasting slug that I will have a hard time justifying the energy to run. Projects need to start adjusting their perspective on what is a reasonable limit. Now. Dublin, California Team: SETI.USA ![]() |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
All-in-all, unless something really bad starts happening, none of this is anything to worry about... Let's say you tanked up to full quickly after an outage. Let's say that uploading and reporting are working fine, but downloading is not, which is the situation we've had happen a few times lately. Have you thought through the impact when this happens? Here's what will happen:
1) Even the fastest systems are still not able to process more than about 300 units on average per day. 300 is less than 400 by a considerable percentage. Even if it means that ensuring you don't have "down-time" means that perhaps others do? You are attached to multiple projects, so I know that you know that there are more and don't object to attaching to them. IMO, and you can tell me if I'm wrong, you want your quad attached to here and at 100% uptime because of the rivalry between the Mac users / team and Francois. Is that rivalry really that important to you? He'll be just as impacted as you are...and will gain the same benefit that you seem to wish to gain... I don't see the big advantage, if this is your primary motivation... :shrug: Someone who I wouldn't have expected to be in favor of upping it was in favor of upping it. I'm not going to name them, but if they read this they'll know I'm talking about them... ;-) Their point was that the restriction was originally put in place to stop a host that was generating a lot of error results (real errors, not aborts) to get stopped and not be putting continual junk into the result database; IOW to stop a "runaway train" scenario... They felt so long as your system wasn't generating errors, then they had no problem with increasing the number of CPUs to 8 and even bumping the 100 to 150. I never took it up with this person, but at that time, just like now, I felt like perhaps it was the original intent, but with the fragility of the project and the low amounts of available work, the original intent is not the only benefit... |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Projects need to start adjusting their perspective on what is a reasonable limit. Now. Then give them the money to do so. Now. People can wish for & plead & demand as much as they like- if the project doesn't have the resources to do it, then it can't be done. People need to adjust their perspective on what is a reasonable expectation. Grant Darwin NT |
zombie67 [MM] Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0 ![]() |
Projects need to start adjusting their perspective on what is a reasonable limit. Now. Do you not see the star next to my name? Dublin, California Team: SETI.USA ![]() |
zombie67 [MM] Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0 ![]() |
Let's say you tanked up to full quickly after an outage. Let's say that uploading and reporting are working fine, but downloading is not, which is the situation we've had happen a few times lately. Have you thought through the impact when this happens? Here's what will happen: The way around all that (I think) is to turn off network activity. IMO, and you can tell me if I'm wrong, you want your quad (zombie say: OCTO)attached to here and at 100% uptime because of the rivalry between the Mac users / team and Francois. Is that rivalry really that important to you? He'll be just as impacted as you are...and will gain the same benefit that you seem to wish to gain... I don't see the big advantage, if this is your primary motivation... :shrug: Frankly, it is a benchmark *and* rivalry. I want to know how well my mac with alexkan's application performs against windows with chicken's. Is there some amount of competition? Heck yeah! That's the whole point of the the points and statistics pages. The following stuff? No idea what you are talking about. Someone who I wouldn't have expected to be in favor of upping it was in favor of upping it. I'm not going to name them, but if they read this they'll know I'm talking about them... ;-) Their point was that the restriction was originally put in place to stop a host that was generating a lot of error results (real errors, not aborts) to get stopped and not be putting continual junk into the result database; IOW to stop a "runaway train" scenario... They felt so long as your system wasn't generating errors, then they had no problem with increasing the number of CPUs to 8 and even bumping the 100 to 150. Dublin, California Team: SETI.USA ![]() |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
For those of us on single cores, we'd notice if our quotas got cut in half... My response you quoted above was in response to your proposed thought that an increase to 8 cores happened in conjunction with a decrease to 50 per core. This is one reason why you and I are butting heads on this - you won't do and/or don't understand some very simple math logic. While there would be effectively no change to you (400 = 400), for me 100 turns into 50. My effective quota would indeed be cut in half, while the effective result for you is equivalence. Extending this line of thinking, if I ran into a large batch of -9 results, since I have one core and no hyperthreading (I have 1 AMD system and one pre-HT Intel system), then I am a lot more easily "screwed" out of available work than you are as it stands already. If you were concerned about *everyone*, then if you really "don't care how we get there", then asking for 200/core with a 4 core max still in place would seem considerably less "greedy", IMO, YMMV, etc, etc, etc...
The only thing that allows you to have a "larger cache" is the new BOINC client and that is a different issue than raising the quota. Getting 800/day would still end up at the maximum of 10-days cached based on the number of seconds of work required for your system, unless you have the newer client. The length of the Thumper outage was either beyond the capability of a 10-day cache to support or on the extreme end of being able to support it. Fundamentally, I "get" the idea that you want your quad at 100%. It's just you don't seem to be looking at the entire picture...
Due to the costs of electricity and other misgivings about the current state of DC in general, if things have not improved when I reach 500K BOINC-wide, I'm currently considering a reduction to non-24x7 operation. I should hit that target figure sometime in August. If by year-end things are still not better (specifically with Einstein and with LHC, and if there has been no change in Orbit's situation, not so much worried about SETI, they'll figure this stuff out by then), then I'm considering stopping DC as a whole. I can't afford to buy an energy efficient system at this point. If that hasn't changed, then financial realities must be considered. Brian |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19691 Credit: 40,757,560 RAC: 67 ![]() ![]() |
Looking at my families five computers, it would appear that over 80% of WU's are validated within 2 days. So any cache longer than this has a high chance of meeting the auto-abort situation. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.