How to Fix the current Issues - One man's opinion

Author	Message
popandbob Volunteer tester Send message Joined: 19 Mar 05 Posts: 551 Credit: 4,673,015 RAC: 0	Message 2029472 - Posted: 27 Jan 2020, 0:47:53 UTC I've been a long term SETI cruncher and I would like to put forward my thoughts on a fix for the current issues. I propose that the current replica server be used to create a separate project for the "Power Users" - those who have high speed computers with lots of GPU's or CPU's This separate project would have vastly increased task sizes - 10x larger at least, with greatly reduced deadlines (1-2 weeks maximum) There should be a credit bonus (say 10%?) for the increased task sizes and reduced deadlines. This would encourage those power users to switch to this lower overhead project. I would estimate half of the current servers here at main could be shifted over to this "Power Users" Project. As an added benefit, the projects efficiency would increase due to the reduced overlap present in the larger tasks. Power users here create a massive workload for the servers currently. The top 10 computers alone (as of when I checked last night) had over 280,000 tasks in the database with zero in progress. If things were running well they would easily have another 40,0000 + tasks in the database. I understand this would cause an increased workload on the team running 2 projects instead of 1, however the reduced workload on the servers would also help alleviate some of the workload from issues such as those we are currently facing. As an added benefit, when one project has issues, the other can take some of the load which would allow the team to not have to prioritize repairs to keep things running to the same degree they do now. I also understand this would require the main servers to handle all of the task viewing and read only tasks, however several years ago when the number of tasks was vastly less, it used to handle it just fine and I do believe with the above changes it could handle it again since most set and forget users which will remain on this project will not be checking up on tasks anywhere near as often. Those "Power Users" who do check up on tasks would create less workload on the "Power Users" Project due to the increased task sizes. These changes will allow the project to still have those older, part time contributors able to do their bit while allowing the "Power Users" to use their ever increasing hardware capabilities to contribute also. ID: 2029472 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 2029475 - Posted: 27 Jan 2020, 1:12:23 UTC - in response to Message 2029472. Last modified: 27 Jan 2020, 1:46:05 UTC I propose that the current replica server be used to create a separate project Splitting things in to 2 projects will more than double the work the servers need to do, not to mention the staff, and the fact is the present hardware is incapable of dealing with the current level of demand. And the the project wants and needs more systems than it presently has in order to process the data it's already got. The simple fact is the project has pretty much reached maxed out the capabilities & reached the limits of it's present database servers. Re-jigging the database can do only so much, but the project needs more work to be processed than is presently done, and new hardware is the only way forward that i can see. Edit- Actually it wouldn't double the server work, but the fact is it would result in an increased load over what the servers are already dealing with. Grant Darwin NT ID: 2029475 ·

popandbob Volunteer tester Send message Joined: 19 Mar 05 Posts: 551 Credit: 4,673,015 RAC: 0	Message 2029482 - Posted: 27 Jan 2020, 2:57:48 UTC - in response to Message 2029475. Last modified: 27 Jan 2020, 2:58:02 UTC I'm not sure how you figure it would increase the server work Grant. Creating a separate project with greatly increased unit sizes would vastly decrease server loading because it would have at least 10x less tasks/work units to handle! ID: 2029482 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 2029487 - Posted: 27 Jan 2020, 3:34:24 UTC - in response to Message 2029482. I'm not sure how you figure it would increase the server work Grant. You would be running an extra database on the same server. Same amount of data, same type of data, but in more tables. More tables is good when normalising a database (up to a point), but when you are creating more tables to effectively double up on some of the data, and keeping it on the same hardware, you're increasing the load that hardware has to deal with. It would be better to fix the issue, than just work around it. Hardware is going to continue to improve, so the load on the servers is going to continue to increase even if there are no more people or new systems joining the project. The present database servers have reached their limits, shuffling things around may provide some temporary relief, but it doesn't address the underlying issue- the hardware has reached it's limit. Grant Darwin NT ID: 2029487 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19043 Credit: 40,757,560 RAC: 67	Message 2029508 - Posted: 27 Jan 2020, 11:52:58 UTC Last modified: 27 Jan 2020, 11:54:50 UTC Define "Power Users". Is by Participant or Host. If Participant, then it might be a person running lots of low performance hardware. If by Host then you will find that some Participants fall into both categories. How often do you promote or demote to or from the "Power User" group. I am not a power user, but my single host, single 2060 GPU, is in the top 800, but as a participant I'm not in the top 1000. Therefore your "Power Users" are going to be a small band of brothers and as they are crunching 1000's of tasks/day there will be lots of pairing with the same host. What happens when one or more of these hosts install an incompatible GPU driver. ID: 2029508 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20252 Credit: 7,508,002 RAC: 20	Message 2029517 - Posted: 27 Jan 2020, 15:16:55 UTC - in response to Message 2029508. Define "Power Users"... As in running enough compute to keep your home warm in winter? ;-) Happy crunchin'! Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 2029517 ·

popandbob Volunteer tester Send message Joined: 19 Mar 05 Posts: 551 Credit: 4,673,015 RAC: 0	Message 2029566 - Posted: 27 Jan 2020, 22:15:42 UTC - in response to Message 2029487. You would be running an extra database on the same server. Same amount of data, same type of data, but in more tables. More tables is good when normalising a database (up to a point), but when you are creating more tables to effectively double up on some of the data, and keeping it on the same hardware, you're increasing the load that hardware has to deal with.. No it would not be an extra database on the same server - my plan was to stop using the replica boinc database and use that server to host the separate project. As to your other comment, yes hardware will continue to improve so do you A) tell participants with older hardware not to bother or b) set up a separate project for those with faster hardware. It could be done under this project as well but it adds user complexity. ID: 2029566 ·

popandbob Volunteer tester Send message Joined: 19 Mar 05 Posts: 551 Credit: 4,673,015 RAC: 0	Message 2029568 - Posted: 27 Jan 2020, 22:20:11 UTC - in response to Message 2029508. Define "Power Users". Everyone would be free to sign up for the separate project. Its up to the user to make sure their computer is able to meet the deadlines no different than any other short deadline project. ID: 2029568 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2029585 - Posted: 27 Jan 2020, 23:24:14 UTC Last modified: 27 Jan 2020, 23:35:20 UTC This is not only one man's opinion. See my post* regarding this matter in the server issues thread. I thought of starting a new thread about it myself, but here it is. EDIT: let me quote myself, as we should discuss it in this thread. Retvari Zoltan wrote:* ... this project should seriously consider doubling the length of its workunits, while reducing the max allowed to 50+50. That would halve the number of the entries of the tables the server need to keep. You can name it sah v9. After a test period it could be decided to go back to sah v8, or double the length of the workunits again (reducing limits to 25+25), even keep both alive. The variety in the performance of the devices connected to this project is so large that ~~it could be seen even from the Moon~~ it makes reasonable for this project to let go its "one fits for all" attitude, because this is the root cause of the server crashes. The practical problems we face every day is only the consequence of that. Tinkering with the server components and micro-managing the acute problems covers it for a long while, but the time spent with it could be put into making the project more future proof instead. The outages won't go away until the root cause is present in the system. It hurts every cruncher (though it hurts the top performers the most) therefore it hurts the performance of the whole project. ID: 2029585 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19043 Credit: 40,757,560 RAC: 67	Message 2029611 - Posted: 28 Jan 2020, 4:42:28 UTC - in response to Message 2029585. Last modified: 28 Jan 2020, 4:44:05 UTC This is not only one man's opinion. See my post* regarding this matter in the server issues thread. I thought of starting a new thread about it myself, but here it is. EDIT: let me quote myself, as we should discuss it in this thread. Retvari Zoltan wrote:* ... this project should seriously consider doubling the length of its workunits, while reducing the max allowed to 50+50. That would halve the number of the entries of the tables the server need to keep. You can name it sah v9. After a test period it could be decided to go back to sah v8, or double the length of the workunits again (reducing limits to 25+25), even keep both alive. The variety in the performance of the devices connected to this project is so large that ~~it could be seen even from the Moon~~ it makes reasonable for this project to let go its "one fits for all" attitude, because this is the root cause of the server crashes. The practical problems we face every day is only the consequence of that. Tinkering with the server components and micro-managing the acute problems covers it for a long while, but the time spent with it could be put into making the project more future proof instead. The outages won't go away until the root cause is present in the system. It hurts every cruncher (though it hurts the top performers the most) therefore it hurts the performance of the whole project. The science data base only allows 30 items of interest to be recorded, this is not going to change. That is why we get the -9 overflows. The workunits have already been doubled once from ~350kb (250kb + (2 * 50kb overlap)) to the present day 700kb. If your suggestion of doubling and doubling again, the science database would only be storing a an eighth of the items of interest it was originally. I don't know if that is significant but I am sure it will not please those who study our findings. ID: 2029611 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2029615 - Posted: 28 Jan 2020, 5:06:40 UTC - in response to Message 2029611. The science data base only allows 30 items of interest to be recorded, this is not going to change. That is why we get the -9 overflows. The clients could process the bigger workunits in several parts producing multiple independent sets of results each covering similar time window as before. The assimilators would have more work to do per workunit but not any more per source tape than with the small workunits. ID: 2029615 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19043 Credit: 40,757,560 RAC: 67	Message 2029620 - Posted: 28 Jan 2020, 6:38:36 UTC - in response to Message 2029615. The science data base only allows 30 items of interest to be recorded, this is not going to change. That is why we get the -9 overflows. The clients could process the bigger workunits in several parts producing multiple independent sets of results each covering similar time window as before. The assimilators would have more work to do per workunit but not any more per source tape than with the small workunits. Wouldn't that just produce the same number of results as the present system and we would still have 11 million "Results returned and awaiting validation". ID: 2029620 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2029621 - Posted: 28 Jan 2020, 6:40:49 UTC - in response to Message 2029611. Last modified: 28 Jan 2020, 6:44:57 UTC The science data base only allows 30 items of interest to be recorded, this is not going to change. What law of nature forbids it to change? ID: 2029621 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19043 Credit: 40,757,560 RAC: 67	Message 2029622 - Posted: 28 Jan 2020, 6:49:00 UTC - in response to Message 2029621. The science data base only allows 30 items of interest to be recorded, this is not going to change. What forbids it to change? How many results does the Science database hold, considering it has been running for over 20 years. Conservative estimate 3 million/day * 365 days * 20 years = ... ID: 2029622 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2029623 - Posted: 28 Jan 2020, 7:30:13 UTC - in response to Message 2029622. The science data base only allows 30 items of interest to be recorded, this is not going to change. What forbids it to change? How many results does the Science database hold, considering it has been running for over 20 years. Conservative estimate 3 million/day * 365 days * 20 years = ... This amount is exponentially decaying as we go back in time, but the volunteers of this project can provide the computing power to convert (even re-calculate) that amount of data (as the computing power is exponentially growing), but I'm not sure if it should be converted at all. The architecture of the science database can be changed without changing the meaning the data in it, so this project can use a different architecture in the future. ID: 2029623 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 2029624 - Posted: 28 Jan 2020, 7:34:20 UTC - in response to Message 2029566. No it would not be an extra database on the same server - my plan was to stop using the replica boinc database and use that server to host the separate project. Which is just asking for grief. The whole point of the replica, is so that if there is an unrecoverable failure of the main database, everything can be restored from the replica. Nothing is lost. Worse comes to worst, and the main server itself dies a horrible dead, the replica could become the main server. Once again- nothing lost. It is not an option IMHO. Grant Darwin NT ID: 2029624 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 2029625 - Posted: 28 Jan 2020, 7:42:45 UTC - in response to Message 2029585. Last modified: 28 Jan 2020, 7:50:20 UTC this project should seriously consider doubling the length of its workunits, This has already been done over the life of he project as the application has been developed (presently v8). Is has been suggested in the past to increase the processing time to alleviate server issues, however the response has been that at present any further work on the data would only result in more computation time, with absolutely no improvement in the science data. And the fact is there is heaps of data to be processed, so the faster people can process it, the better. Increasing the time it take to process a WU, with no science benefit, would just be wasting people's time & power and computing resources. Better to fix the server problems, than implement kludges desgined to work around it, that will only need to be redone over & over as hardware improves, more people/join the project. Grant Darwin NT ID: 2029625 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19043 Credit: 40,757,560 RAC: 67	Message 2029627 - Posted: 28 Jan 2020, 7:48:36 UTC - in response to Message 2029623. The science data base only allows 30 items of interest to be recorded, this is not going to change. What forbids it to change? How many results does the Science database hold, considering it has been running for over 20 years. Conservative estimate 3 million/day * 365 days * 20 years = ... This amount is exponentially decaying as we go back in time, but the volunteers of this project can provide the computing power to convert (even re-calculate) that amount of data (as the computing power is exponentially growing), but I'm not sure if it should be converted at all. The architecture of the science database can be changed without changing the meaning the data in it, so this project can use a different architecture in the future. Not as exponential as you might think, as the number of active users has diminished over that period. For several reasons, BOINC and credit screw to name but two. ID: 2029627 ·

MarkJ Volunteer tester Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5	Message 2029629 - Posted: 28 Jan 2020, 8:16:49 UTC It was suggested some time back (under some other message thread) that they could shorten the deadlines. That should reduce the work units out in the field. Quite a few projects use 2 week deadlines which should be enough time for the slower hosts to complete a MB and maybe an extra week for Astropulse. BOINC blog ID: 2029629 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 2029632 - Posted: 28 Jan 2020, 8:30:13 UTC - in response to Message 2029629. Last modified: 28 Jan 2020, 8:30:50 UTC It was suggested some time back (under some other message thread) that they could shorten the deadlines. That should reduce the work units out in the field. Quite a few projects use 2 week deadlines which should be enough time for the slower hosts to complete a MB and maybe an extra week for Astropulse. The vast majority of work is returned within a few days (results of someone's efforts looking at it several years back), so a very short deadline wouldn't be an issue. However, people do have system issues (hardware dying, software updates trashing things), fires, cyclones & hurricanes & flooding etc tend to interrupt their power supply & communications. So i feel a 1 month deadline would be a good compromise. It helps reduce the time it takes for work to be returned, but it also allows for people to have issues, and still get their work back before the deadline in most cases. Grant Darwin NT ID: 2029632 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.