Message boards :
Number crunching :
Problems...
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 13 · Next
Author | Message |
---|---|
Lint trap Send message Joined: 30 May 03 Posts: 871 Credit: 28,092,319 RAC: 0 |
[quote]As you get new work, then there is a correct entry that links everything together (in all the tables). I had another validation error today. I and the wingman both got the wu on the 10th. Wingman returned 1st and I got a validation error when I reported it. @Joe Segur; No offense meant to anyone! I was only referring to the two persons who were in discussion at the time, not an all-inclusive we. Martin |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
Tasks were discarded by BOINC client Raistmer This is a Boinc Core Issue that has been long overlooked/ignored. As I think about it the Server Code should say if "Resend" is turned off then "any WU's assigned to the machine should be aborted by the server when a Reset occurs. This would resolve the issue. Regards Please consider a Donation to the Seti Project. |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
In a comment from Eric, as we get past what everyone had on their machine (during the outage) and get it reported the validate errors should stop happening. Self Healing. Fred et al I am at a loss for what to say. I would guess there is a reason for no Tech News updates. Regards Please consider a Donation to the Seti Project. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14674 Credit: 200,643,578 RAC: 874 |
This is a Boinc Core Issue that has been long overlooked/ignored. As I think about it the Server Code should say if "Resend" is turned off then "any WU's assigned to the machine should be aborted by the server when a Reset occurs. This would resolve the issue. The server can only mark them as disposed of if the client actively sends a message telling it about the reset. That means that the reset button has a double action: 1) Update project: send scheduler request, wait for ack, retry as necessary, etc. 2) Clean up project files. What happens if you're trying to reset because the project files have got screwed up, and it can't communicate with the server? Or the server's down - does the reset button just hang until the project comes back up? Perhaps the best that could be done would be to pop up a question: You still have tasks for this project - do you want to attempt a project update before resetting? Choosing 'Yes' would cancel the reset, and send an 'Update' instruction to the core client instead. Choosing 'No' would do a forced reset as now, so you had an escape route if it's all gone completely pear-shaped. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Richard, I did project update before reset. Both actions, update and reset, not "freed" discarded tasks. So now they listed on web site but physically absent on my host. Single possibility to get rid of them (beside project detach probably) is to wait until deadline. IMO, BOINC core client could give to server hint (on update or on reset) that those tasks are lost and should be sent to another host. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14674 Credit: 200,643,578 RAC: 874 |
Richard, I did project update before reset. Well, it would have to be on update (communication with server): reset would be reserved for the situation where no communication is possible (ultimate big red....) What was the state of the tasks in the interval between the finger-fumble and the reset? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
on host side they just disappeared with message in log ~"no app for 603 version, task discarded" On server side (web page) they are still remains in "green" state, as "ghosts" |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14674 Credit: 200,643,578 RAC: 874 |
on host side they just disappeared with message in log ~"no app for 603 version, task discarded" Doesn't that mean we just have to generalise changeset [trac]changeset:19235[/trac]? if a RESULT uses an app version that is missing [a coprocessor], abort it (rather than deleting it). |
Leopoldo Send message Joined: 4 Aug 99 Posts: 102 Credit: 3,051,091 RAC: 0 |
(btw, is any reason to double mention of all files in heading section? I mean why each file_ref should have corresponding file_info? ) my interpretation is:
|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
yes, it's declaration-like stuff. But I'm not sure if such declaration needed in app_info. It's not so long to make declaration/definition stuff needed. BOINC can easely infer all needed files from parsing app_version sections. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
on host side they just disappeared with message in log ~"no app for 603 version, task discarded" But only not co-processor. GPUs can be swapped and better not to trash these tasks. Maybe just reverse should be done - not to trash tasks at all, just marking them as "no needed app binary exists" or something like this, as it does currently with missing GPU. Then user get a chance to repair possible error, and if not - they eventually will jest deadlined. But currently they just be deadlined anyway but no option to repair situation left. |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
This is a Boinc Core Issue that has been long overlooked/ignored. As I think about it the Server Code should say if "Resend" is turned off then "any WU's assigned to the machine should be aborted by the server when a Reset occurs. This would resolve the issue. Sorry went to bed early and have been running. Reset: Boinc informs you that a Project Reset might be Required! The reason may be that one of the project application files are suspect (open in active memory) and causing errors. When the "Reset" is done All files are removed as suspect... Another reason is a file that is open in Memory has become corrupt, A Reset then refreshes the application files and reopens the project application. There is still a bit of confict about the true purpose of the Reset command. My thinking is that if it is suspect that the applications (and support files) open and running in memory are the problem. Then a reset should only touch those applications and support files leaving work along... As I set down, I picked a project that I can reset on this machine. It had no work so it does not matter. 1. Disable Network Activity. 2. Open sched_request_projectname.xml 3. Open sched_reply_projectname.xml 4. With Explorer open, Reset the project. As I use Ultraedit for my editor, if a file changes (while open) I am notified of a change to the file and asked to reload it... What I saw in Explorer is that files were removed from the project. The message log (6.10.36) states * Resetting Project. * Resuming Network Activity. End of story... Neither the Scheduler Request or Reply files were updated. The <rpc_seqno> did not update. What did happen was a Direct RPC to the project and the files were removed. So the logic is there, but it assumed that every project has "resend" turned on. All files, Applications and Work are "refreshed." The good and the bad... Good - If you have tasks that are associated with your computer they are resent on the next scheduler contact (which should be immediate). Bad - If you have work that was waiting to report, it is now burned toast! If resend is turned off, You also have Orphans (Science that has wasted time and is now waiting to time out). What I would expect to see is the RPC update written to sched_request_projectname.xml. Thus the sequence number will update. Any work waiting to be reported would be reported. Any work yet to be processed would be removed and resent (along with the appropiate applications etc.). Why you have to remove work is still not clear. If the Scheduler is set to "not resend" the work. You end up with work that is ready to report is reported and the scheduler looks up the work assigned that is left and sets the flag "aborted." Then it can be sent to machines waiting for work. The problem, Project Down. A scheduler update will not happen nor will the direct RPC. The files are just removed and not replaced. End of story until a scheduler can be contacted. Regards Please consider a Donation to the Seti Project. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Why you have to remove work is still not clear. +2 (from both hands :D ) [actually, the reason could be suspicion that task data files currupted. But in case of misconfiguration this task discarding looks absolutely unneeded and wasteful to me. And why BOINC deletes ALL app_version files if ONE file missing? For example, I failed to declare single *.cl file, then executable gone too. So, after declaration *.cl file and launching BOINC I again got misconfigured settings cause now primary executable missing, BOINC deleted it by itself. For what reason ??? And, please, note that app_info used, it's Anonymous platform. So BOINC cant expect that project will just update application files. It should expect that operator responsible for configuration (as Richard pointed before). So what the hell it messing with my config deleting files I already added??] |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
Why you have to remove work is still not clear. Data file corruption is marbles in a pipe... Normally it ends in a computation error.. Next Please.. So once the data/workunits have landed on the machine and the checksum matches... It should be good unless there is corruption in the underlying file system or what was sent from the server was corrupt to start with. Please consider a Donation to the Seti Project. |
Julie Send message Joined: 28 Oct 09 Posts: 34060 Credit: 18,883,157 RAC: 18 |
|
dino Send message Joined: 21 Sep 01 Posts: 11 Credit: 1,048,310 RAC: 0 |
Results ready to send 54,476 (much more than last days) Results received in last hour 6,037 (about 10% of the normal number) Results returned and awaiting validation 5,808,203 (much more than normal number) Workunits waiting for assimilation 550,799 (highest number than usual) [As of 13 Mar 2010 15:50:19 UTC] I think this is the same router problem we have seen last end of february... Has someone try to pathping the router? In server status page we have the same situation of the last outage... |
dino Send message Joined: 21 Sep 01 Posts: 11 Credit: 1,048,310 RAC: 0 |
Results ready to send 63,834 Results out in the field 4,830,980 Results received in last hour 6,331 Results returned and awaiting validation 5,791,231 Workunits waiting for assimilation 540,378 [As of 13 Mar 2010 22:10:12 UTC] I'm sure this is the same network problem... My results do not upload and i can't download new WU Try pathping on router and we will see x% packet loss |
Julie Send message Joined: 28 Oct 09 Posts: 34060 Credit: 18,883,157 RAC: 18 |
|
Lint trap Send message Joined: 30 May 03 Posts: 871 Credit: 28,092,319 RAC: 0 |
The good news is all my Validate Errors have disappeared! Thanks! No uploads from here though, and hence no downloads possible. and Pathping is showing packet losses again, about same as during Feb's event. Martin |
KB7RZF Send message Joined: 15 Aug 99 Posts: 9549 Credit: 3,308,926 RAC: 2 |
The good news is all my Validate Errors have disappeared! Thanks! I've gotten 3 more downloads, but the 1 I'm trying to upload just refuses. Probably gonna have to wait till Monday when the guys get in and give the servers a swift kick to start the ball rolling again. LOL |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.