Panic Mode On (71) Server problems?

Author	Message
musicplayer Send message Joined: 17 May 10 Posts: 2430 Credit: 926,046 RAC: 0	Message 1205652 - Posted: 14 Mar 2012, 3:04:53 UTC - in response to Message 1205635. Hmm. If I go for Seti@home v7 as well (perhaps Seti@home Enhanced becomes double portion then), how about the Graphics Preference (like Minimalist) and Rainbow for the Color Preference? Will it work against Seti@home v7? Also I adjusted Default Computer Location to Home when doing this (it really does not matter). ID: 1205652 ·

Belthazor Volunteer tester Send message Joined: 6 Apr 00 Posts: 219 Credit: 10,373,795 RAC: 13	Message 1205728 - Posted: 14 Mar 2012, 10:28:52 UTC It's funny - it's seems like AP 6.01 tasks doesn't counting on the server status page: the number of AP's "out in the field" still decreasing... ID: 1205728 ·

LadyL Volunteer tester Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0	Message 1205732 - Posted: 14 Mar 2012, 10:53:20 UTC - in response to Message 1205652. Hmm. If I go for Seti@home v7 as well (perhaps Seti@home Enhanced becomes double portion then), how about the Graphics Preference (like Minimalist) and Rainbow for the Color Preference? Will it work against Seti@home v7? Also I adjusted Default Computer Location to Home when doing this (it really does not matter). You can go for MB V7 as much as you like but there are no V7 apps yet, so checking that box doesn't do anything yet (apart from when you only check the V7 box and uncheck the 'accept other work' box, in which case you don't get any tasks.) I'm not the Pope. I don't speak Ex Cathedra! ID: 1205732 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1205755 - Posted: 14 Mar 2012, 13:24:04 UTC - in response to Message 1205728. It's funny - it's seems like AP 6.01 tasks doesn't counting on the server status page: the number of AP's "out in the field" still decreasing... Yes, it appears that the code that counts all the AP values has not been updated to include v6. It might be an easy fix, or it might be a nightmare. Most likely an easy fix though. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1205755 ·

cliff Send message Joined: 16 Dec 07 Posts: 625 Credit: 3,590,440 RAC: 0	Message 1205799 - Posted: 14 Mar 2012, 15:56:52 UTC Ok so whats up with comms between users and Berkeley now? Cant contact server, hung downloads, timeouts etc.. Do we once again have router or network problems in Berkeley? Or is it the feed itself thats gone toes up? Thought those problems had been sorted out, had cood comms for a while, now back to substandard.. Cheers, Cliff, Been there, Done that, Still no damm T shirt! ID: 1205799 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 1205801 - Posted: 14 Mar 2012, 16:00:08 UTC - in response to Message 1205799. Ok so whats up with comms between users and Berkeley now? Cant contact server, hung downloads, timeouts etc.. Do we once again have router or network problems in Berkeley? Or is it the feed itself thats gone toes up? Thought those problems had been sorted out, had cood comms for a while, now back to substandard.. Cheers, It's because the communications link out of Berkeley is completely saturated by the AP_v6 rollout. Please bookmark Cricket. I expect it to remain like this for at least a week, maybe longer. ID: 1205801 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1205819 - Posted: 14 Mar 2012, 16:29:18 UTC Was just looking through my task list and saw an AP_v505 that was inconclusive, so I looked at the WUid for it.. http://setiathome.berkeley.edu/workunit.php?wuid=894563761. I was _7, and it has now gone out to _8. It's one of those ones that keeps getting sent to stock Linux apps, or the "hit and run" people who install, download a bunch of work, and then bail. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1205819 ·

Khangollo Send message Joined: 1 Aug 00 Posts: 245 Credit: 36,410,524 RAC: 0	Message 1205833 - Posted: 14 Mar 2012, 17:04:33 UTC Last modified: 14 Mar 2012, 17:07:51 UTC From server status page (for Astropulse): Results out in the field: 31,473 Results returned and awaiting validation: 62,379 Workunits waiting for validation: 0 So now, that v505 isn't splitting anymore, does this mean there are over 30,000 AP workunits stuck in "waiting for validation" state for at least one of the tasks? Are projects admins aware of this problem with stuck APs? ID: 1205833 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1205841 - Posted: 14 Mar 2012, 17:22:17 UTC - in response to Message 1205833. So now, that v505 isn't splitting anymore, does this mean there are over 30,000 AP workunits stuck in "waiting for validation" state for at least one of the tasks? Are projects admins aware of this problem with stuck APs? They're not so much stuck as they are waiting for wingmen to report or time-out. "Time-out" is probably 2/3 to 3/4 of the actual situation. With only a 25-day deadline, they will get sent out to a third wingman, and so on. The number of v505's will decrease, but at a slow rate. More like a logarithmic curve: faster and more frequently since there are many of them now, but slower and less frequently as the total gets smaller. Though since there is a 25-day deadline and the last ones were split about 15 days ago, that means that even if they make it all the way to _9, that leaves somewhere around 200 days from now for the very last ones to be expired/retired/completed. 200 days from now is the last day of September. So by Halloween, 505's should be 100% completely gone. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1205841 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1205848 - Posted: 14 Mar 2012, 17:43:19 UTC - in response to Message 1205841. Last modified: 14 Mar 2012, 17:46:56 UTC So now, that v505 isn't splitting anymore, does this mean there are over 30,000 AP workunits stuck in "waiting for validation" state for at least one of the tasks? Are projects admins aware of this problem with stuck APs? They're not so much stuck as they are waiting for wingmen to report or time-out. "Time-out" is probably 2/3 to 3/4 of the actual situation. With only a 25-day deadline, they will get sent out to a third wingman, and so on. The number of v505's will decrease, but at a slow rate. More like a logarithmic curve: faster and more frequently since there are many of them now, but slower and less frequently as the total gets smaller. Though since there is a 25-day deadline and the last ones were split about 15 days ago, that means that even if they make it all the way to _9, that leaves somewhere around 200 days from now for the very last ones to be expired/retired/completed. 200 days from now is the last day of September. So by Halloween, 505's should be 100% completely gone. Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most? ID: 1205848 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1205849 - Posted: 14 Mar 2012, 17:49:03 UTC - in response to Message 1205848. Given there are only 31,386 why not just send then all out again to a couple more systems with RACs over 10,000 and they will all be gone in 2 weeks at most! Agreed. The same thing was said when we went from "Astropulse" to "Astropulse_v5". And then again when v505 came along, but it apparently can't--or more accurately, won't--be done. They will work themselves out over time. It's not like it causes anyone any anguish by having some of those still floating around. There are apps for crunching it, so if you happen to be assigned one, you can crunch it and return it, thus helping get rid of them. Once the lunatics app gets released for v6, you can put r409 (or the GPU equivalent) back into your app_info along with the new app for v6, and change your preferences to allow the 505's to be sent. That's what I plan on doing. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1205849 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 1205850 - Posted: 14 Mar 2012, 17:51:18 UTC - in response to Message 1205848. So now, that v505 isn't splitting anymore, does this mean there are over 30,000 AP workunits stuck in "waiting for validation" state for at least one of the tasks? Are projects admins aware of this problem with stuck APs? They're not so much stuck as they are waiting for wingmen to report or time-out. "Time-out" is probably 2/3 to 3/4 of the actual situation. With only a 25-day deadline, they will get sent out to a third wingman, and so on. The number of v505's will decrease, but at a slow rate. More like a logarithmic curve: faster and more frequently since there are many of them now, but slower and less frequently as the total gets smaller. Though since there is a 25-day deadline and the last ones were split about 15 days ago, that means that even if they make it all the way to _9, that leaves somewhere around 200 days from now for the very last ones to be expired/retired/completed. 200 days from now is the last day of September. So by Halloween, 505's should be 100% completely gone. Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most? Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already. ID: 1205850 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004	Message 1205851 - Posted: 14 Mar 2012, 17:54:06 UTC - in response to Message 1205849. Given there are only 31,386 why not just send then all out again to a couple more systems with RACs over 10,000 and they will all be gone in 2 weeks at most! Agreed. The same thing was said when we went from "Astropulse" to "Astropulse_v5". And then again when v505 came along, but it apparently can't--or more accurately, won't--be done. They will work themselves out over time. It's not like it causes anyone any anguish by having some of those still floating around. There are apps for crunching it, so if you happen to be assigned one, you can crunch it and return it, thus helping get rid of them. Once the lunatics app gets released for v6, you can put r409 (or the GPU equivalent) back into your app_info along with the new app for v6, and change your preferences to allow the 505's to be sent. That's what I plan on doing. As the kitties prefer MB anyway, no changes here until things settle and the new Lunatics installer is released.... I'll take any 505 reissues that happen my way, and add v6 when the opti version is tested for a bit and the validation/credit issues have been sorted. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1205851 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 1205852 - Posted: 14 Mar 2012, 17:56:33 UTC - in response to Message 1205849. Given there are only 31,386 why not just send then all out again to a couple more systems with RACs over 10,000 and they will all be gone in 2 weeks at most! Agreed. The same thing was said when we went from "Astropulse" to "Astropulse_v5". And then again when v505 came along, but it apparently can't--or more accurately, won't--be done. They will work themselves out over time. It's not like it causes anyone any anguish by having some of those still floating around. There are apps for crunching it, so if you happen to be assigned one, you can crunch it and return it, thus helping get rid of them. Once the lunatics app gets released for v6, you can put r409 (or the GPU equivalent) back into your app_info along with the new app for v6, and change your preferences to allow the 505's to be sent. That's what I plan on doing. Won't be needed. We're writing both the apps, and the installer, in such a way that it will keep existing v505 tasks running, and fetch resends if they're available: while at the same time handling the new v6 tasks properly. And doing both types a bit faster than at present. That's the plan, at least. It's complicated, which is why we need the time for testing. ID: 1205852 ·

cliff Send message Joined: 16 Dec 07 Posts: 625 Credit: 3,590,440 RAC: 0	Message 1205854 - Posted: 14 Mar 2012, 18:05:25 UTC - in response to Message 1205801. Hi Richard, Thanks for the info, bookmarked the insect:-) Cheers, Cliff, Been there, Done that, Still no damm T shirt! ID: 1205854 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1205865 - Posted: 14 Mar 2012, 18:20:18 UTC - in response to Message 1205850. Last modified: 14 Mar 2012, 18:33:28 UTC Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most? Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already. A regime such that when any WU needs to resent after a timeout it would always be sent to a computer with a RAC above a minimum to minimise the risk of getting another timeout. This should reduce the overall number of results awaiting validation and whinges about slow wingmen. ID: 1205865 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004	Message 1205867 - Posted: 14 Mar 2012, 18:21:43 UTC - in response to Message 1205865. Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most? Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already. A regime such that when any WU need to resent after a timeout it would always be sent to a computer with a RAC above a minimum to minimise the risk of getting another timeout. This should reduce the overall number of results awaiting validation and whinges about slow wingmen. RAC is not always indicative of turn around time...... "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1205867 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1205872 - Posted: 14 Mar 2012, 18:29:14 UTC - in response to Message 1205867. Last modified: 14 Mar 2012, 18:33:18 UTC Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most? Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already. A regime such that when any WU needs to resent after a timeout it would always be sent to a computer with a RAC above a minimum to minimise the risk of getting another timeout. This should reduce the overall number of results awaiting validation and whinges about slow wingmen. RAC is not always indicative of turn around time...... Given the current 400/50 limits the per computer RAC is very close. If it's not close enough then also include a maximum computer Average turnaround time. ID: 1205872 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1205906 - Posted: 14 Mar 2012, 20:22:34 UTC - in response to Message 1205867. Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most? Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already. A regime such that when any WU need to resent after a timeout it would always be sent to a computer with a RAC above a minimum to minimise the risk of getting another timeout. This should reduce the overall number of results awaiting validation and whinges about slow wingmen. RAC is not always indicative of turn around time...... Quite true, turnaround time is what counts and the BOINC database keeps that statistic for each app version on a host. The available BOINC feature uses that plus consecutive valid and having a quota at least equal to the basic setting of 100 to judge. The documentation is slightly out of date, but gives a reasonable overall view. Reading the code in sched_send.cpp and a few other source files gives accurate current info on how it works. Having that feature turned on would of course be an additional load on the Scheduler processes, but with improved servers that may become feasible. What actual settings would be appropriate would take some thought, the general advice that about 25% of hosts ought to be considered reliable seems sensible. If set too tight, reissue tasks might occupy positions in the Feeder queue for too long. Joe ID: 1205906 ·

red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1205933 - Posted: 14 Mar 2012, 21:49:34 UTC - in response to Message 1205906. Last modified: 14 Mar 2012, 22:16:59 UTC Quite true, turnaround time is what counts and the BOINC database keeps that statistic for each app version on a host. The available BOINC feature uses that plus consecutive valid and having a quota at least equal to the basic setting of 100 to judge. The documentation is slightly out of date, but gives a reasonable overall view. Reading the code in sched_send.cpp and a few other source files gives accurate current info on how it works. Having that feature turned on would of course be an additional load on the Scheduler processes, but with improved servers that may become feasible. What actual settings would be appropriate would take some thought, the general advice that about 25% of hosts ought to be considered reliable seems sensible. If set too tight, reissue tasks might occupy positions in the Feeder queue for too long. Could we kill two birds with one stone and have two feeder queues in different instances of the scheduler? Ideally one with a big queue for the relable hosts that would include resends and a second with a smaller queue for the others. ID: 1205933 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.