Panic Mode On (108) Server Problems?

Author	Message
David@home Volunteer tester Send message Joined: 16 Jan 03 Posts: 755 Credit: 5,040,916 RAC: 28	Message 1904348 - Posted: 2 Dec 2017, 8:45:06 UTC Great news that the SETI team managed to solve the database problem. Had to do a manual update as BOINC manager was in a 4 hour deep sleep. Picked up GPU work only but hopefully will pick up CPU work as the splitters catch up with demand. Wonder who the lucky ones were that got a cache of all those Astropulse redos that were building up, I missed out on those. ID: 1904348 ·

Sid Volunteer tester Send message Joined: 12 Jun 07 Posts: 16 Credit: 10,968,872 RAC: 0	Message 1904352 - Posted: 2 Dec 2017, 9:48:15 UTC Last modified: 2 Dec 2017, 9:52:45 UTC 167 on both Linux and Windoze machines. . . .looks like we're back. ID: 1904352 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004	Message 1904357 - Posted: 2 Dec 2017, 10:09:07 UTC Well. Meowmeowmeow. Middle of a Friday night and Seti comes back online? Kudos to whoever was working on things this late!! Thankyouthankyouthankyou! Meow! "Time is simply the mechanism that keeps everything from happening all at once." ID: 1904357 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004	Message 1904360 - Posted: 2 Dec 2017, 10:15:44 UTC - in response to Message 1904358. Last modified: 2 Dec 2017, 10:16:17 UTC But, as expected, there's trouble in paradise. SSP stopped updating: [As of 2 Dec 2017, 9:40:04 UTC] And what usually follows after that.....well, you all know that :-( Well, I am hoping the success was not that short lived, and the SSP snag is just due to the heavy load things must be under. Kitties are hopeful. Meow. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1904360 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004	Message 1904362 - Posted: 2 Dec 2017, 10:19:49 UTC - in response to Message 1904361. But, as expected, there's trouble in paradise. SSP stopped updating: [As of 2 Dec 2017, 9:40:04 UTC] And what usually follows after that.....well, you all know that :-( Well, I am hoping the success was not that short lived, and the SSP snag is just due to the heavy load things must be under. Kitties are hopeful. Meow. The SSP just updated. Thanks Dog for that :-) I prefer to thank the kitties. Meow! "Time is simply the mechanism that keeps everything from happening all at once." ID: 1904362 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1856 Credit: 268,616,081 RAC: 1,349	Message 1904363 - Posted: 2 Dec 2017, 10:25:39 UTC All nice, full caches on all machines. Definitely came roaring back ... ID: 1904363 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004	Message 1904366 - Posted: 2 Dec 2017, 10:33:11 UTC - in response to Message 1904363. Last modified: 2 Dec 2017, 10:34:57 UTC All nice, full caches on all machines. Definitely came roaring back ... Then you got lucky and hit the servers just after they came back up. Now it's going to be like the server lottery trying to get work with all the hungry computers to feed. And THAT depends on things staying glued together under the heavy load. Meowpatience. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1904366 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874	Message 1904382 - Posted: 2 Dec 2017, 12:30:35 UTC It's unfortunate that, as happened so often after past outages, the luck of the draw has thrown tapes full of Arecibo shorties into the splitter just as we need a steady supply of good, chewy, work. One of my GPU machines has finally filled its 200 task queue, and it's got precisely 100 shorties and 100 guppies - nothing in between. ID: 1904382 ·

Advent42 Send message Joined: 23 Mar 17 Posts: 175 Credit: 4,015,683 RAC: 0	Message 1904395 - Posted: 2 Dec 2017, 13:48:58 UTC - in response to Message 1904382. Yeah!! The search continues...:-) ID: 1904395 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1904421 - Posted: 2 Dec 2017, 16:04:15 UTC - in response to Message 1904066. It's all fine. The project will be back in January. Maybe a little earlier than that. Which January? . . :) Stephen :) ID: 1904421 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1904425 - Posted: 2 Dec 2017, 16:18:19 UTC - in response to Message 1904225. I too have wondered why SETI has stuck with the extremely long deadlines I assess were implemented for the original hardware used on the project. That kind of hardware is 18 years in the past and does not need to continue to be supported. I agree with you Jeff, I would expect the sizes of databases and the strain they put on the project would be greatly lessened if the deadlines were reduced by a month, lets say from the current 2 month deadline. . . That would get my support :) Stephen .. ID: 1904425 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1904427 - Posted: 2 Dec 2017, 16:27:54 UTC - in response to Message 1904227. I too have wondered why SETI has stuck with the extremely long deadlines I assess were implemented for the original hardware used on the project. That kind of hardware is 18 years in the past and does not need to continue to be supported. I agree with you Jeff, I would expect the sizes of databases and the strain they put on the project would be greatly lessened if the deadlines were reduced by a month, lets say from the current 2 month deadline. As I understand it, the reason there has been no adjustment is because Eric does not wish to disenfranchise anybody from participating in this project. And that would include folks with very meager hardware resources. Not everybody can afford what some of us are able to. That is why. . . Hi Mark, . . I have heard that reasoning before but there is no rational reason for any rig, no matter how slow, to download more work than they can process in a month. If the gear can only process 2 tasks per week than set the cache so that you only download half a dozen jobs and a one month deadline is still not an issue. Since every time you upload/report results you get fresh work (OK I'm an optimist 8^} ) such a rig would still be productive. Why should any rig have 100 tasks cached if it would take that rig 6 months to process them? I completely agree that such disproportionate downloading makes a great argument for shorter deadlines. Stephen .. ID: 1904427 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1904430 - Posted: 2 Dec 2017, 16:38:32 UTC - in response to Message 1904244. Don't forget it isn't just the raw crunching time you need to consider for deadlines - it's also all the dead time when the computer is switched off or in use. And fir Android, when it's away from the charger. . . But if any given device cannot process a single task within a month, whether due to insufficient crunching power or lack of run time, then is that device really fit for purpose ?? Stephen ?? ID: 1904430 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874	Message 1904439 - Posted: 2 Dec 2017, 17:10:34 UTC - in response to Message 1904427. Why should any rig have 100 tasks cached if it would take that rig 6 months to process them? If the pattern is persistent, it wouldn't be able to. Work is requested by time (use the <sched_op_debug> flag, and really read the Event Log): the maximum time request is 20 days, not 6 months. ID: 1904439 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1904447 - Posted: 2 Dec 2017, 17:44:16 UTC Given the arguments in favor of the roughly 8-week deadlines for normal AR and VLAR MB tasks, in order to accommodate even the most laggardly of hosts, can anyone then explain the 3-week deadlines for AP tasks? On my machines, regardless of the CPU or GPU or OS, AP tasks take longer to run than the longest-running of those MB tasks, in some cases, 2 or 3 times as long. If 3 weeks is an adequate deadline for APs, why not for MBs? Mind you, I'm not advocating for that short a deadline for either of those categories of tasks, but that's always struck me as a glaring inconsistency. ID: 1904447 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1904448 - Posted: 2 Dec 2017, 18:00:28 UTC - in response to Message 1904439. Last modified: 2 Dec 2017, 18:08:07 UTC Why should any rig have 100 tasks cached if it would take that rig 6 months to process them? If the pattern is persistent, it wouldn't be able to. Work is requested by time (use the <sched_op_debug> flag, and really read the Event Log): the maximum time request is 20 days, not 6 months. . . But that is the system's weakness. If a rig can crunch a WU in 2 hours (say an old i7 using just one CPU core) and they set their work request to the 20 day maximum allowed they will get the full server limited allocation of 100 tasks, even if they are returning only a few per week or only invalid results. I have come across many wingmen like that. If the allocation were made on average return time as others have suggested, rather than average run time, then the allocation numbers would be appropriately reduced. . . BTW, just how do you set flags in the event log?? Stephen . . ID: 1904448 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874	Message 1904449 - Posted: 2 Dec 2017, 18:05:19 UTC - in response to Message 1904447. Given the arguments in favor of the roughly 8-week deadlines for normal AR and VLAR MB tasks, in order to accommodate even the most laggardly of hosts, can anyone then explain the 3-week deadlines for AP tasks? On my machines, regardless of the CPU or GPU or OS, AP tasks take longer to run than the longest-running of those MB tasks, in some cases, 2 or 3 times as long. If 3 weeks is an adequate deadline for APs, why not for MBs? Mind you, I'm not advocating for that short a deadline for either of those categories of tasks, but that's always struck me as a glaring inconsistency. No, but... Some of it is covered in the (extremely ancient) Astropulse FAQ page. Astropulse had been around as a concept for some years, but this particular implementation was written as a grad-student project by Josh von Korff. Because it formed part of his Examined coursework for whichever degree it was, Eric - as his supervisor - very deliberately: (a) left him to work out the solutions to his own problems (b) required him to handle deployment, snagging, and dealing with user feedback as part of his training. Josh got it working and deployed, passed his degree, and moved on to continue his academic career at another institution. I suppose the justification at the time (about 10 years ago, just before GPUs were added to the crunching mix) was that you could opt in or out of AP, only those with the most powerful CPUs would choose to opt in - others with deadline trouble could content themselves with MB only. ID: 1904449 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1904453 - Posted: 2 Dec 2017, 18:37:57 UTC - in response to Message 1904449. No, but... Some of it is covered in the (extremely ancient) Astropulse FAQ page. Astropulse had been around as a concept for some years, but this particular implementation was written as a grad-student project by Josh von Korff. Because it formed part of his Examined coursework for whichever degree it was, Eric - as his supervisor - very deliberately: (a) left him to work out the solutions to his own problems (b) required him to handle deployment, snagging, and dealing with user feedback as part of his training. Josh got it working and deployed, passed his degree, and moved on to continue his academic career at another institution. I suppose the justification at the time (about 10 years ago, just before GPUs were added to the crunching mix) was that you could opt in or out of AP, only those with the most powerful CPUs would choose to opt in - others with deadline trouble could content themselves with MB only. Actually, it looks like it was, at least initially, only an opt-out choice, unless you were running optimized apps. All others got AP tasks automatically, if their systems met the requirements. I'm intrigued by a couple of statements in there. First, that "The initial deadline for Astropulse tasks will be 14 days.", and then there's "... If our server judges that your computer cannot complete an Astropulse workunit in 22.5 days (75% of the maximum 30 days)...". So, the 14 days apparently got bumped up early on, but what's the meaning of "maximum 30 days"? Was that the maximum for any S@h task in those olden days? If so, how did we jump to 8 weeks, even as CPUs got faster and GPUs came into the mix? The bottom line, to me, is that decisions regarding task deadlines are among those that were made a long time ago, and no longer take into account the processing environment as it exists today, both in terms of end-user hardware and the project's periodic database woes. Many aspects of the project are moving forward. These legacy decisions should be revisited to evaluate whether the reasons behind them still make sense. ID: 1904453 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13847 Credit: 208,696,464 RAC: 304	Message 1904457 - Posted: 2 Dec 2017, 18:48:21 UTC - in response to Message 1904453. So, the 14 days apparently got bumped up early on, but what's the meaning of "maximum 30 days"? Was that the maximum for any S@h task in those olden days? If so, how did we jump to 8 weeks, even as CPUs got faster and GPUs came into the mix? Those numbers were probably based on the original pre-BOINC Seti work times, then beefed up for the original BOINC Seti. Since then, there have been several versions of Seti, each one involving more processing than the previous one and longer runtimes than the previous version for given hardware. Hence the long deadlines, based on the crunching time for that much older hardware. The bottom line, to me, is that decisions regarding task deadlines are among those that were made a long time ago, and no longer take into account the processing environment as it exists today, both in terms of end-user hardware and the project's periodic database woes. Many aspects of the project are moving forward. These legacy decisions should be revisited to evaluate whether the reasons behind them still make sense. That's the best argument yet for changing the deadlines IMHO. Current basic Android devices would be on par with what was a highend P4 computation wise. More recent Android devices are not only higher performing, but multi core as well; let alone current CPUs with AVX, AVX2 and IPC (instructions per clock) improvements. And then there are GPUs. Many of the deadlines were based on WU run times for what was the lowend hardware of the day. Current lowend hardware matches, or even exceeds, highend hardware of that period. Grant Darwin NT ID: 1904457 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874	Message 1904460 - Posted: 2 Dec 2017, 18:52:44 UTC - in response to Message 1904453. The bottom line, to me, is that decisions regarding task deadlines are among those that were made a long time ago, and no longer take into account the processing environment as it exists today, both in terms of end-user hardware and the project's periodic database woes. Many aspects of the project are moving forward. These legacy decisions should be revisited to evaluate whether the reasons behind them still make sense. I totally agree. But they need to be rational, considered revisitations, taking into account the needs of everyone - the project itself, the users who post here, the users who don't post here, the users with the latest hardware, the users with one clunky hand-me-down, the users who are exclusively dedicated to SETI, the users who spread themselves thinly across multiple projects..... And everyone in between. What the project needs most of all is time to think, and data to base their decisions on (which means fixing those broken webpages like client types and science status) which haven't updated since before Matt Lebofsky was diverted away from us. And that means more people. And more people means more money. Thinking caps on, please. ID: 1904460 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.