Current download problem prohibits also other projects downloads

Message boards : Number crunching : Current download problem prohibits also other projects downloads
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 458020 - Posted: 14 Nov 2006, 23:50:50 UTC - in response to Message 458008.  

From the front page:

November 14, 2006
A configuration problem on our servers have caused workunit downloads to fail since yesterday afternoon. This has been fixed. However, we are bringing the whole project down for our regular Tuesday outage to back up our database. We should be back up in a few hours (22:00 UTC).


Be patient, they are working on it.

Actually, they fixed it about three hours ago (before 21:00 UTC - beating their own estimate!). Might be useful to bookmark this page.

Still, we definitely have a long catch-up period this time, and things will be slow for a while yet.
ID: 458020 · Report as offensive
Nathan

Send message
Joined: 11 Apr 01
Posts: 10
Credit: 9,996,407
RAC: 0
United States
Message 458025 - Posted: 14 Nov 2006, 23:54:26 UTC - in response to Message 458020.  

Actually, they fixed it about three hours ago (before 21:00 UTC - beating their own estimate!).


Sure doesn't look like it based on the machines here.
ID: 458025 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 458028 - Posted: 14 Nov 2006, 23:57:11 UTC - in response to Message 457977.  

Hmmm, I don't know. The INR-688 interface to Cogent "flatlined" yesterday around 3 PM Berkeley time, so it would seem they knew they were in trouble before they left yesterday (Also an afternoon time frame was mentioned in the news item).

Still, the extra curricular activity factor may have played a part in how long it was out. ;-)

Alinator

Sure, it flatlined, but can you infer that they knew ;-) ?

a) Were they there? I don't know if it was a holiday / team meeting / awayday / schmooze the sponsors / fit a new receiver at Arecibo / anything else sort of day. All goood reasons for not fixing it.
b) Did they notice? Do they have a flat-line alarm? Not a good reason for keeping us in the dark. Even while all the new development is taking place, they should nominate someone to mind the shop during working hours.
ID: 458028 · Report as offensive
Profile dragon1

Send message
Joined: 17 Sep 05
Posts: 33
Credit: 4,438,013
RAC: 0
Canada
Message 458031 - Posted: 14 Nov 2006, 23:59:30 UTC

Likely a big backlog....I have JUST received 7 downloads...AND SETI also now sees my preferences ie 1.5 days...something it hasn't done in about a week. Maybe (likely) our good friends at Berk. have been working on that earlier reported issue too. Horray....
ID: 458031 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 458057 - Posted: 15 Nov 2006, 0:16:28 UTC - in response to Message 457960.  

It's my guess that his client has requested X seconds of work from seti and the order was filled, but just hasn't reached him yet, so the puter thinks it has X seconds on hand when infact it doesn't. The scheduler has already taken that into account and isn't requesting other work. Atleast, I think this is correct.

So when the download is finally completed, he would have X seconds on hand, and if the code was changed to see the non existent download and actually get work from elsewhere, then if the outage was short, the host would be overcommitted and risk missing deadlines.
tony


I think that's what is happening. This is one rare situation when micro managing may be needed. As I mentioned, suspending seti for a minute allowed other projects to download some work and keep my hosts busy, at least for a while.

Harri

This is a known issue in the current release, where it does CPU scheduling system-wide. The current beta does scheduling on a per-core basis.

Actually, there is a problem with the current code as well. If the only contactable project with a higher LTD than -task switch interval is acknowledging work requests by granting work, but the download server is off line, then the BOINC client believes that it has enough work on the way and should not request any more (when it actually does not have work on the way). I believe I have found a fix that will go into 5.7.4 or so.


BOINC WIKI
ID: 458057 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 458077 - Posted: 15 Nov 2006, 0:38:24 UTC - in response to Message 458028.  
Last modified: 15 Nov 2006, 0:38:42 UTC

If I were to think that traffic In/OUT is no longer happening because the server has issues, then the higher volume of traffic will appear to flatline.

So the NFS issue prevents the traffic from happening... the lower volume of traffic is workstations trying to connect...


Hmmm, I don't know. The INR-688 interface to Cogent "flatlined" yesterday around 3 PM Berkeley time, so it would seem they knew they were in trouble before they left yesterday (Also an afternoon time frame was mentioned in the news item).

Still, the extra curricular activity factor may have played a part in how long it was out. ;-)

Alinator

Sure, it flatlined, but can you infer that they knew ;-) ?

a) Were they there? I don't know if it was a holiday / team meeting / awayday / schmooze the sponsors / fit a new receiver at Arecibo / anything else sort of day. All goood reasons for not fixing it.
b) Did they notice? Do they have a flat-line alarm? Not a good reason for keeping us in the dark. Even while all the new development is taking place, they should nominate someone to mind the shop during working hours.


Please consider a Donation to the Seti Project.

ID: 458077 · Report as offensive
Nathan

Send message
Joined: 11 Apr 01
Posts: 10
Credit: 9,996,407
RAC: 0
United States
Message 458198 - Posted: 15 Nov 2006, 4:35:16 UTC

All the machines I can see say "Activities Suspended" No processing--things are just stopped. Restarted boinc, updated project, etc. Nothing.
ID: 458198 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 458337 - Posted: 15 Nov 2006, 15:01:33 UTC - in response to Message 458057.  
Last modified: 15 Nov 2006, 15:02:00 UTC

This is a known issue in the current release, where it does CPU scheduling system-wide. The current beta does scheduling on a per-core basis.

Actually, there is a problem with the current code as well. If the only contactable project with a higher LTD than -task switch interval is acknowledging work requests by granting work, but the download server is off line, then the BOINC client believes that it has enough work on the way and should not request any more (when it actually does not have work on the way). I believe I have found a fix that will go into 5.7.4 or so.


John, just to make sure I'm reading this right.

Let's say say Project A has the highest LTD, Project B the next highest, Project C next, and so forth.

Project A is returning a NNW on requests, Project B is sending work but DL's are failing. You're saying that when the work for Project A runs out and BOINC discovers the result for Project B is unrunable it won't fall back and DL a result from Project C?

Alinator
ID: 458337 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 458340 - Posted: 15 Nov 2006, 15:09:45 UTC - in response to Message 458028.  

Hmmm, I don't know. The INR-688 interface to Cogent "flatlined" yesterday around 3 PM Berkeley time, so it would seem they knew they were in trouble before they left yesterday (Also an afternoon time frame was mentioned in the news item).

Still, the extra curricular activity factor may have played a part in how long it was out. ;-)

Alinator

Sure, it flatlined, but can you infer that they knew ;-) ?

a) Were they there? I don't know if it was a holiday / team meeting / awayday / schmooze the sponsors / fit a new receiver at Arecibo / anything else sort of day. All goood reasons for not fixing it.
b) Did they notice? Do they have a flat-line alarm? Not a good reason for keeping us in the dark. Even while all the new development is taking place, they should nominate someone to mind the shop during working hours.


Apparently according to the Tech News they were busy enough worrying about other issues they *didn't* realize new work wasn't going out the door. ;-)

Oh well, things like that happen from time to time (even if alarms go off, just ask the guys at TMI). :-)

Alinator


ID: 458340 · Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 22 Apr 04
Posts: 758
Credit: 27,771,894
RAC: 0
United States
Message 458655 - Posted: 16 Nov 2006, 1:19:28 UTC
Last modified: 16 Nov 2006, 2:05:46 UTC

Well, I have been (and still am) away on travel during this, and these are my observations. I usually get 17k per day. about 10k of that are machines that run SETI exclusively. The rest (11 machines) run SETI/Rosetta/WCG at 100/50/50. During the outage, I made 10.8. Now, the SETI-exclusive machines were running on a setting of 1.3 (or greater) days of work.

This tells me that one of two things happened:

1) I lost almost all the points on the mixed project machines...they sat idle, when they could have just worked on the other two projects instead.

or

2) The "Connect to network about every..." setting does not do what I have been told it does. And no, this is not attributable to the recent bug. These machines have been on the 1.3 days for (forgot to write) over a month now.

I'm guessing #1 based on other posts in this thread. So....we have uncovered a real bug, IMO. And a serious one at that.
Dublin, California
Team: SETI.USA
ID: 458655 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 458693 - Posted: 16 Nov 2006, 1:56:24 UTC - in response to Message 458337.  

This is a known issue in the current release, where it does CPU scheduling system-wide. The current beta does scheduling on a per-core basis.

Actually, there is a problem with the current code as well. If the only contactable project with a higher LTD than -task switch interval is acknowledging work requests by granting work, but the download server is off line, then the BOINC client believes that it has enough work on the way and should not request any more (when it actually does not have work on the way). I believe I have found a fix that will go into 5.7.4 or so.


John, just to make sure I'm reading this right.

Let's say say Project A has the highest LTD, Project B the next highest, Project C next, and so forth.

Project A is returning a NNW on requests, Project B is sending work but DL's are failing. You're saying that when the work for Project A runs out and BOINC discovers the result for Project B is unrunable it won't fall back and DL a result from Project C?

Alinator

Depends on the LTD of C and whether the system is otherwise in EDF. If the system is not in EDF and the LTD of C is above the cutoff, then work would be fetched. Otherwise not. This is a bug for which I have submitted a fix. It did not make it into 5.7.4 as that came out immediately after I submitted the fix, but before anyone had a chance to look at it and check it in.


BOINC WIKI
ID: 458693 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : Current download problem prohibits also other projects downloads


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.