Panic Mode On (71) Server problems?

Message boards : Number crunching : Panic Mode On (71) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
musicplayer

Send message
Joined: 17 May 10
Posts: 2430
Credit: 926,046
RAC: 0
Message 1205652 - Posted: 14 Mar 2012, 3:04:53 UTC - in response to Message 1205635.  

Hmm. If I go for Seti@home v7 as well (perhaps Seti@home Enhanced becomes double portion then), how about the Graphics Preference (like Minimalist) and Rainbow for the Color Preference? Will it work against Seti@home v7?

Also I adjusted Default Computer Location to Home when doing this (it really does not matter).
ID: 1205652 · Report as offensive
Profile Belthazor
Volunteer tester
Avatar

Send message
Joined: 6 Apr 00
Posts: 219
Credit: 10,373,795
RAC: 13
Russia
Message 1205728 - Posted: 14 Mar 2012, 10:28:52 UTC

It's funny - it's seems like AP 6.01 tasks doesn't counting on the server status page: the number of AP's "out in the field" still decreasing...
ID: 1205728 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1205732 - Posted: 14 Mar 2012, 10:53:20 UTC - in response to Message 1205652.  

Hmm. If I go for Seti@home v7 as well (perhaps Seti@home Enhanced becomes double portion then), how about the Graphics Preference (like Minimalist) and Rainbow for the Color Preference? Will it work against Seti@home v7?

Also I adjusted Default Computer Location to Home when doing this (it really does not matter).


You can go for MB V7 as much as you like but there are no V7 apps yet, so checking that box doesn't do anything yet (apart from when you only check the V7 box and uncheck the 'accept other work' box, in which case you don't get any tasks.)
I'm not the Pope. I don't speak Ex Cathedra!
ID: 1205732 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1205755 - Posted: 14 Mar 2012, 13:24:04 UTC - in response to Message 1205728.  

It's funny - it's seems like AP 6.01 tasks doesn't counting on the server status page: the number of AP's "out in the field" still decreasing...

Yes, it appears that the code that counts all the AP values has not been updated to include v6. It might be an easy fix, or it might be a nightmare. Most likely an easy fix though.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1205755 · Report as offensive
Profile cliff
Avatar

Send message
Joined: 16 Dec 07
Posts: 625
Credit: 3,590,440
RAC: 0
United Kingdom
Message 1205799 - Posted: 14 Mar 2012, 15:56:52 UTC

Ok so whats up with comms between users and Berkeley now?
Cant contact server, hung downloads, timeouts etc..

Do we once again have router or network problems in Berkeley?
Or is it the feed itself thats gone toes up?

Thought those problems had been sorted out, had cood comms for a while, now back to substandard..

Cheers,
Cliff,
Been there, Done that, Still no damm T shirt!
ID: 1205799 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1205801 - Posted: 14 Mar 2012, 16:00:08 UTC - in response to Message 1205799.  

Ok so whats up with comms between users and Berkeley now?
Cant contact server, hung downloads, timeouts etc..

Do we once again have router or network problems in Berkeley?
Or is it the feed itself thats gone toes up?

Thought those problems had been sorted out, had cood comms for a while, now back to substandard..

Cheers,

It's because the communications link out of Berkeley is completely saturated by the AP_v6 rollout. Please bookmark Cricket.

I expect it to remain like this for at least a week, maybe longer.
ID: 1205801 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1205819 - Posted: 14 Mar 2012, 16:29:18 UTC

Was just looking through my task list and saw an AP_v505 that was inconclusive, so I looked at the WUid for it.. http://setiathome.berkeley.edu/workunit.php?wuid=894563761. I was _7, and it has now gone out to _8. It's one of those ones that keeps getting sent to stock Linux apps, or the "hit and run" people who install, download a bunch of work, and then bail.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1205819 · Report as offensive
Profile Khangollo
Avatar

Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1205833 - Posted: 14 Mar 2012, 17:04:33 UTC
Last modified: 14 Mar 2012, 17:07:51 UTC

From server status page (for Astropulse):
Results out in the field: 31,473
Results returned and awaiting validation: 62,379
Workunits waiting for validation: 0

So now, that v505 isn't splitting anymore, does this mean there are over 30,000 AP workunits stuck in "waiting for validation" state for at least one of the tasks? Are projects admins aware of this problem with stuck APs?
ID: 1205833 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1205841 - Posted: 14 Mar 2012, 17:22:17 UTC - in response to Message 1205833.  

So now, that v505 isn't splitting anymore, does this mean there are over 30,000 AP workunits stuck in "waiting for validation" state for at least one of the tasks? Are projects admins aware of this problem with stuck APs?

They're not so much stuck as they are waiting for wingmen to report or time-out. "Time-out" is probably 2/3 to 3/4 of the actual situation. With only a 25-day deadline, they will get sent out to a third wingman, and so on. The number of v505's will decrease, but at a slow rate. More like a logarithmic curve: faster and more frequently since there are many of them now, but slower and less frequently as the total gets smaller.

Though since there is a 25-day deadline and the last ones were split about 15 days ago, that means that even if they make it all the way to _9, that leaves somewhere around 200 days from now for the very last ones to be expired/retired/completed. 200 days from now is the last day of September. So by Halloween, 505's should be 100% completely gone.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1205841 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1205848 - Posted: 14 Mar 2012, 17:43:19 UTC - in response to Message 1205841.  
Last modified: 14 Mar 2012, 17:46:56 UTC

So now, that v505 isn't splitting anymore, does this mean there are over 30,000 AP workunits stuck in "waiting for validation" state for at least one of the tasks? Are projects admins aware of this problem with stuck APs?

They're not so much stuck as they are waiting for wingmen to report or time-out. "Time-out" is probably 2/3 to 3/4 of the actual situation. With only a 25-day deadline, they will get sent out to a third wingman, and so on. The number of v505's will decrease, but at a slow rate. More like a logarithmic curve: faster and more frequently since there are many of them now, but slower and less frequently as the total gets smaller.

Though since there is a 25-day deadline and the last ones were split about 15 days ago, that means that even if they make it all the way to _9, that leaves somewhere around 200 days from now for the very last ones to be expired/retired/completed. 200 days from now is the last day of September. So by Halloween, 505's should be 100% completely gone.

Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most?
ID: 1205848 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1205849 - Posted: 14 Mar 2012, 17:49:03 UTC - in response to Message 1205848.  

Given there are only 31,386 why not just send then all out again to a couple more systems with RACs over 10,000 and they will all be gone in 2 weeks at most!

Agreed. The same thing was said when we went from "Astropulse" to "Astropulse_v5". And then again when v505 came along, but it apparently can't--or more accurately, won't--be done. They will work themselves out over time. It's not like it causes anyone any anguish by having some of those still floating around. There are apps for crunching it, so if you happen to be assigned one, you can crunch it and return it, thus helping get rid of them.

Once the lunatics app gets released for v6, you can put r409 (or the GPU equivalent) back into your app_info along with the new app for v6, and change your preferences to allow the 505's to be sent. That's what I plan on doing.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1205849 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1205850 - Posted: 14 Mar 2012, 17:51:18 UTC - in response to Message 1205848.  

So now, that v505 isn't splitting anymore, does this mean there are over 30,000 AP workunits stuck in "waiting for validation" state for at least one of the tasks? Are projects admins aware of this problem with stuck APs?

They're not so much stuck as they are waiting for wingmen to report or time-out. "Time-out" is probably 2/3 to 3/4 of the actual situation. With only a 25-day deadline, they will get sent out to a third wingman, and so on. The number of v505's will decrease, but at a slow rate. More like a logarithmic curve: faster and more frequently since there are many of them now, but slower and less frequently as the total gets smaller.

Though since there is a 25-day deadline and the last ones were split about 15 days ago, that means that even if they make it all the way to _9, that leaves somewhere around 200 days from now for the very last ones to be expired/retired/completed. 200 days from now is the last day of September. So by Halloween, 505's should be 100% completely gone.

Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most?

Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already.
ID: 1205850 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51469
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1205851 - Posted: 14 Mar 2012, 17:54:06 UTC - in response to Message 1205849.  

Given there are only 31,386 why not just send then all out again to a couple more systems with RACs over 10,000 and they will all be gone in 2 weeks at most!

Agreed. The same thing was said when we went from "Astropulse" to "Astropulse_v5". And then again when v505 came along, but it apparently can't--or more accurately, won't--be done. They will work themselves out over time. It's not like it causes anyone any anguish by having some of those still floating around. There are apps for crunching it, so if you happen to be assigned one, you can crunch it and return it, thus helping get rid of them.

Once the lunatics app gets released for v6, you can put r409 (or the GPU equivalent) back into your app_info along with the new app for v6, and change your preferences to allow the 505's to be sent. That's what I plan on doing.

As the kitties prefer MB anyway, no changes here until things settle and the new Lunatics installer is released....
I'll take any 505 reissues that happen my way, and add v6 when the opti version is tested for a bit and the validation/credit issues have been sorted.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1205851 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1205852 - Posted: 14 Mar 2012, 17:56:33 UTC - in response to Message 1205849.  

Given there are only 31,386 why not just send then all out again to a couple more systems with RACs over 10,000 and they will all be gone in 2 weeks at most!

Agreed. The same thing was said when we went from "Astropulse" to "Astropulse_v5". And then again when v505 came along, but it apparently can't--or more accurately, won't--be done. They will work themselves out over time. It's not like it causes anyone any anguish by having some of those still floating around. There are apps for crunching it, so if you happen to be assigned one, you can crunch it and return it, thus helping get rid of them.

Once the lunatics app gets released for v6, you can put r409 (or the GPU equivalent) back into your app_info along with the new app for v6, and change your preferences to allow the 505's to be sent. That's what I plan on doing.

Won't be needed. We're writing both the apps, and the installer, in such a way that it will keep existing v505 tasks running, and fetch resends if they're available: while at the same time handling the new v6 tasks properly. And doing both types a bit faster than at present. That's the plan, at least. It's complicated, which is why we need the time for testing.
ID: 1205852 · Report as offensive
Profile cliff
Avatar

Send message
Joined: 16 Dec 07
Posts: 625
Credit: 3,590,440
RAC: 0
United Kingdom
Message 1205854 - Posted: 14 Mar 2012, 18:05:25 UTC - in response to Message 1205801.  

Hi Richard,
Thanks for the info, bookmarked the insect:-)
Cheers,
Cliff,
Been there, Done that, Still no damm T shirt!
ID: 1205854 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1205865 - Posted: 14 Mar 2012, 18:20:18 UTC - in response to Message 1205850.  
Last modified: 14 Mar 2012, 18:33:28 UTC

Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most?

Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already.

A regime such that when any WU needs to resent after a timeout it would always be sent to a computer with a RAC above a minimum to minimise the risk of getting another timeout. This should reduce the overall number of results awaiting validation and whinges about slow wingmen.
ID: 1205865 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51469
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1205867 - Posted: 14 Mar 2012, 18:21:43 UTC - in response to Message 1205865.  

Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most?

Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already.

A regime such that when any WU need to resent after a timeout it would always be sent to a computer with a RAC above a minimum to minimise the risk of getting another timeout. This should reduce the overall number of results awaiting validation and whinges about slow wingmen.

RAC is not always indicative of turn around time......
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1205867 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1205872 - Posted: 14 Mar 2012, 18:29:14 UTC - in response to Message 1205867.  
Last modified: 14 Mar 2012, 18:33:18 UTC

Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most?

Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already.

A regime such that when any WU needs to resent after a timeout it would always be sent to a computer with a RAC above a minimum to minimise the risk of getting another timeout. This should reduce the overall number of results awaiting validation and whinges about slow wingmen.

RAC is not always indicative of turn around time......

Given the current 400/50 limits the per computer RAC is very close. If it's not close enough then also include a maximum computer Average turnaround time.
ID: 1205872 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1205906 - Posted: 14 Mar 2012, 20:22:34 UTC - in response to Message 1205867.  

Given there are only 31,386 why not just send them all out again to a couple more systems with RACs over 10,000 and they will all be done in 2 weeks at most?

Conversely, why bother to send them out? They will be processed in due course anyway. To do otherwise would require micro-management by the project staff, and I'm sure they've got plenty of higher-priority tasks on their hands already.

A regime such that when any WU need to resent after a timeout it would always be sent to a computer with a RAC above a minimum to minimise the risk of getting another timeout. This should reduce the overall number of results awaiting validation and whinges about slow wingmen.

RAC is not always indicative of turn around time......

Quite true, turnaround time is what counts and the BOINC database keeps that statistic for each app version on a host. The available BOINC feature uses that plus consecutive valid and having a quota at least equal to the basic setting of 100 to judge. The documentation is slightly out of date, but gives a reasonable overall view. Reading the code in sched_send.cpp and a few other source files gives accurate current info on how it works.

Having that feature turned on would of course be an additional load on the Scheduler processes, but with improved servers that may become feasible. What actual settings would be appropriate would take some thought, the general advice that about 25% of hosts ought to be considered reliable seems sensible. If set too tight, reissue tasks might occupy positions in the Feeder queue for too long.
                                                                  Joe
ID: 1205906 · Report as offensive
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1205933 - Posted: 14 Mar 2012, 21:49:34 UTC - in response to Message 1205906.  
Last modified: 14 Mar 2012, 22:16:59 UTC

Quite true, turnaround time is what counts and the BOINC database keeps that statistic for each app version on a host. The available BOINC feature uses that plus consecutive valid and having a quota at least equal to the basic setting of 100 to judge. The documentation is slightly out of date, but gives a reasonable overall view. Reading the code in sched_send.cpp and a few other source files gives accurate current info on how it works.

Having that feature turned on would of course be an additional load on the Scheduler processes, but with improved servers that may become feasible. What actual settings would be appropriate would take some thought, the general advice that about 25% of hosts ought to be considered reliable seems sensible. If set too tight, reissue tasks might occupy positions in the Feeder queue for too long.

Could we kill two birds with one stone and have two feeder queues in different instances of the scheduler? Ideally one with a big queue for the relable hosts that would include resends and a second with a smaller queue for the others.
ID: 1205933 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Panic Mode On (71) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.