Current download problem prohibits also other projects downloads

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 458020 - Posted: 14 Nov 2006, 23:50:50 UTC - in response to Message 458008. From the front page: November 14, 2006 A configuration problem on our servers have caused workunit downloads to fail since yesterday afternoon. This has been fixed. However, we are bringing the whole project down for our regular Tuesday outage to back up our database. We should be back up in a few hours (22:00 UTC). Be patient, they are working on it. Actually, they fixed it about three hours ago (before 21:00 UTC - beating their own estimate!). Might be useful to bookmark this page. Still, we definitely have a long catch-up period this time, and things will be slow for a while yet. ID: 458020 ·

Nathan Send message Joined: 11 Apr 01 Posts: 10 Credit: 9,996,407 RAC: 0	Message 458025 - Posted: 14 Nov 2006, 23:54:26 UTC - in response to Message 458020. Actually, they fixed it about three hours ago (before 21:00 UTC - beating their own estimate!). Sure doesn't look like it based on the machines here. ID: 458025 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 458028 - Posted: 14 Nov 2006, 23:57:11 UTC - in response to Message 457977. Hmmm, I don't know. The INR-688 interface to Cogent "flatlined" yesterday around 3 PM Berkeley time, so it would seem they knew they were in trouble before they left yesterday (Also an afternoon time frame was mentioned in the news item). Still, the extra curricular activity factor may have played a part in how long it was out. ;-) Alinator Sure, it flatlined, but can you infer that they *knew* ;-) ? a) Were they there? I don't know if it was a holiday / team meeting / awayday / schmooze the sponsors / fit a new receiver at Arecibo / anything else sort of day. All goood reasons for not fixing it. b) Did they notice? Do they have a flat-line alarm? Not a good reason for keeping us in the dark. Even while all the new development is taking place, they should nominate someone to mind the shop during working hours. ID: 458028 ·

dragon1 Send message Joined: 17 Sep 05 Posts: 33 Credit: 4,438,013 RAC: 0	Message 458031 - Posted: 14 Nov 2006, 23:59:30 UTC Likely a big backlog....I have JUST received 7 downloads...AND SETI also now sees my preferences ie 1.5 days...something it hasn't done in about a week. Maybe (likely) our good friends at Berk. have been working on that earlier reported issue too. Horray.... ID: 458031 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 458057 - Posted: 15 Nov 2006, 0:16:28 UTC - in response to Message 457960. It's my guess that his client has requested X seconds of work from seti and the order was filled, but just hasn't reached him yet, so the puter thinks it has X seconds on hand when infact it doesn't. The scheduler has already taken that into account and isn't requesting other work. Atleast, I think this is correct. So when the download is finally completed, he would have X seconds on hand, and if the code was changed to see the non existent download and actually get work from elsewhere, then if the outage was short, the host would be overcommitted and risk missing deadlines. tony I think that's what is happening. This is one rare situation when micro managing may be needed. As I mentioned, suspending seti for a minute allowed other projects to download some work and keep my hosts busy, at least for a while. Harri This is a known issue in the current release, where it does CPU scheduling system-wide. The current beta does scheduling on a per-core basis. Actually, there is a problem with the current code as well. If the only contactable project with a higher LTD than -task switch interval is acknowledging work requests by granting work, but the download server is off line, then the BOINC client believes that it has enough work on the way and should not request any more (when it actually does not have work on the way). I believe I have found a fix that will go into 5.7.4 or so. BOINC WIKI ID: 458057 ·

Pappa Volunteer tester Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0	Message 458077 - Posted: 15 Nov 2006, 0:38:24 UTC - in response to Message 458028. Last modified: 15 Nov 2006, 0:38:42 UTC If I were to think that traffic In/OUT is no longer happening because the server has issues, then the higher volume of traffic will appear to flatline. So the NFS issue prevents the traffic from happening... the lower volume of traffic is workstations trying to connect... Hmmm, I don't know. The INR-688 interface to Cogent "flatlined" yesterday around 3 PM Berkeley time, so it would seem they knew they were in trouble before they left yesterday (Also an afternoon time frame was mentioned in the news item). Still, the extra curricular activity factor may have played a part in how long it was out. ;-) Alinator Sure, it flatlined, but can you infer that they *knew* ;-) ? a) Were they there? I don't know if it was a holiday / team meeting / awayday / schmooze the sponsors / fit a new receiver at Arecibo / anything else sort of day. All goood reasons for not fixing it. b) Did they notice? Do they have a flat-line alarm? Not a good reason for keeping us in the dark. Even while all the new development is taking place, they should nominate someone to mind the shop during working hours. Please consider a Donation to the Seti Project. ID: 458077 ·

Nathan Send message Joined: 11 Apr 01 Posts: 10 Credit: 9,996,407 RAC: 0	Message 458198 - Posted: 15 Nov 2006, 4:35:16 UTC All the machines I can see say "Activities Suspended" No processing--things are just stopped. Restarted boinc, updated project, etc. Nothing. ID: 458198 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 458337 - Posted: 15 Nov 2006, 15:01:33 UTC - in response to Message 458057. Last modified: 15 Nov 2006, 15:02:00 UTC This is a known issue in the current release, where it does CPU scheduling system-wide. The current beta does scheduling on a per-core basis. Actually, there is a problem with the current code as well. If the only contactable project with a higher LTD than -task switch interval is acknowledging work requests by granting work, but the download server is off line, then the BOINC client believes that it has enough work on the way and should not request any more (when it actually does not have work on the way). I believe I have found a fix that will go into 5.7.4 or so. John, just to make sure I'm reading this right. Let's say say Project A has the highest LTD, Project B the next highest, Project C next, and so forth. Project A is returning a NNW on requests, Project B is sending work but DL's are failing. You're saying that when the work for Project A runs out and BOINC discovers the result for Project B is unrunable it won't fall back and DL a result from Project C? Alinator ID: 458337 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 458340 - Posted: 15 Nov 2006, 15:09:45 UTC - in response to Message 458028. Hmmm, I don't know. The INR-688 interface to Cogent "flatlined" yesterday around 3 PM Berkeley time, so it would seem they knew they were in trouble before they left yesterday (Also an afternoon time frame was mentioned in the news item). Still, the extra curricular activity factor may have played a part in how long it was out. ;-) Alinator Sure, it flatlined, but can you infer that they *knew* ;-) ? a) Were they there? I don't know if it was a holiday / team meeting / awayday / schmooze the sponsors / fit a new receiver at Arecibo / anything else sort of day. All goood reasons for not fixing it. b) Did they notice? Do they have a flat-line alarm? Not a good reason for keeping us in the dark. Even while all the new development is taking place, they should nominate someone to mind the shop during working hours. Apparently according to the Tech News they were busy enough worrying about other issues they didn't realize new work wasn't going out the door. ;-) Oh well, things like that happen from time to time (even if alarms go off, just ask the guys at TMI). :-) Alinator ID: 458340 ·

zombie67 [MM] Volunteer tester Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0	Message 458655 - Posted: 16 Nov 2006, 1:19:28 UTC Last modified: 16 Nov 2006, 2:05:46 UTC Well, I have been (and still am) away on travel during this, and these are my observations. I usually get 17k per day. about 10k of that are machines that run SETI exclusively. The rest (11 machines) run SETI/Rosetta/WCG at 100/50/50. During the outage, I made 10.8. Now, the SETI-exclusive machines were running on a setting of 1.3 (or greater) days of work. This tells me that one of two things happened: 1) I lost almost all the points on the mixed project machines...they sat idle, when they could have just worked on the other two projects instead. or 2) The "Connect to network about every..." setting does not do what I have been told it does. And no, this is not attributable to the recent bug. These machines have been on the 1.3 days for (forgot to write) over a month now. I'm guessing #1 based on other posts in this thread. So....we have uncovered a real bug, IMO. And a serious one at that. Dublin, California Team: SETI.USA ID: 458655 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 458693 - Posted: 16 Nov 2006, 1:56:24 UTC - in response to Message 458337. This is a known issue in the current release, where it does CPU scheduling system-wide. The current beta does scheduling on a per-core basis. Actually, there is a problem with the current code as well. If the only contactable project with a higher LTD than -task switch interval is acknowledging work requests by granting work, but the download server is off line, then the BOINC client believes that it has enough work on the way and should not request any more (when it actually does not have work on the way). I believe I have found a fix that will go into 5.7.4 or so. John, just to make sure I'm reading this right. Let's say say Project A has the highest LTD, Project B the next highest, Project C next, and so forth. Project A is returning a NNW on requests, Project B is sending work but DL's are failing. You're saying that when the work for Project A runs out and BOINC discovers the result for Project B is unrunable it won't fall back and DL a result from Project C? Alinator Depends on the LTD of C and whether the system is otherwise in EDF. If the system is not in EDF and the LTD of C is above the cutoff, then work would be fetched. Otherwise not. This is a bug for which I have submitted a fix. It did not make it into 5.7.4 as that came out immediately after I submitted the fix, but before anyone had a chance to look at it and check it in. BOINC WIKI ID: 458693 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.