Problems...

Message boards : Number crunching : Problems...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 13 · Next

AuthorMessage
Profile Lint trap

Send message
Joined: 30 May 03
Posts: 871
Credit: 28,092,319
RAC: 0
United States
Message 977505 - Posted: 12 Mar 2010, 1:11:06 UTC - in response to Message 977380.  
Last modified: 12 Mar 2010, 1:11:46 UTC

[quote]As you get new work, then there is a correct entry that links everything together (in all the tables).


I had another validation error today. I and the wingman both got the wu on the 10th. Wingman returned 1st and I got a validation error when I reported it.

@Joe Segur; No offense meant to anyone! I was only referring to the two persons who were in discussion at the time, not an all-inclusive we.

Martin
ID: 977505 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 977514 - Posted: 12 Mar 2010, 2:06:49 UTC - in response to Message 977498.  

Tasks were discarded by BOINC client

I thought you said Due to app_info misconfiguration.

That's not BOINC client, that's operator error (no criticism - we all do it).

If you break it, you own both parts - until you fix it. BOINC doesn't supply its own glue - anyone who ventures down the anonymous platform route is well advised to heed the warnings 'advanced users only', and learn the manual recovery procedures from their manual experiments.


No-no-no, I didn't say (I hope) that discardign task was BOINC fault.
app_info was incomplete (btw, is any reason to double mention of all files in heading section? I mean why each file_ref should have corresponding file_info? ) so discarding was legal. What I trying to say - why BOINC, knowing that it just discarded tasks, didn't reported this fact back to server as computation errors?

And yes, before resetting I did few project updates (each time reciving no tasks, BOINC just rejected to ask for new work until I did project reset - another strange behavior BTW).

Tasks usually discarded due "operator error" as Richard said so probably BOINC's behavior in such situation not well tested (who will deliberately repeat such experiment ;) )...


Raistmer

This is a Boinc Core Issue that has been long overlooked/ignored. As I think about it the Server Code should say if "Resend" is turned off then "any WU's assigned to the machine should be aborted by the server when a Reset occurs. This would resolve the issue.

Regards


Please consider a Donation to the Seti Project.

ID: 977514 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 977515 - Posted: 12 Mar 2010, 2:09:06 UTC - in response to Message 977380.  

In a comment from Eric, as we get past what everyone had on their machine (during the outage) and get it reported the validate errors should stop happening. Self Healing.
As you get new work, then there is a correct entry that links everything together (in all the tables).

In the mean time Eric periodically runs the Credit Script that accounts for the missing Credits. How they will pick the "canonical result" is up to them.

Patience and keep returning results.

Regards


But I have a "validate error" on a WU that was sent to me on 10 March. Methinks there is something else afoot here - not "self-healing".

And welcome back (again)...

F.


Fred et al

I am at a loss for what to say. I would guess there is a reason for no Tech News updates.

Regards

Please consider a Donation to the Seti Project.

ID: 977515 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 977638 - Posted: 12 Mar 2010, 10:37:38 UTC - in response to Message 977514.  

This is a Boinc Core Issue that has been long overlooked/ignored. As I think about it the Server Code should say if "Resend" is turned off then "any WU's assigned to the machine should be aborted by the server when a Reset occurs. This would resolve the issue.

The server can only mark them as disposed of if the client actively sends a message telling it about the reset. That means that the reset button has a double action:

1) Update project: send scheduler request, wait for ack, retry as necessary, etc.
2) Clean up project files.

What happens if you're trying to reset because the project files have got screwed up, and it can't communicate with the server? Or the server's down - does the reset button just hang until the project comes back up?

Perhaps the best that could be done would be to pop up a question:

You still have tasks for this project - do you want to attempt a project update before resetting?

Choosing 'Yes' would cancel the reset, and send an 'Update' instruction to the core client instead. Choosing 'No' would do a forced reset as now, so you had an escape route if it's all gone completely pear-shaped.
ID: 977638 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 977801 - Posted: 12 Mar 2010, 18:26:37 UTC - in response to Message 977638.  

Richard, I did project update before reset.
Both actions, update and reset, not "freed" discarded tasks. So now they listed on web site but physically absent on my host. Single possibility to get rid of them (beside project detach probably) is to wait until deadline.
IMO, BOINC core client could give to server hint (on update or on reset) that those tasks are lost and should be sent to another host.
ID: 977801 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 977812 - Posted: 12 Mar 2010, 18:39:30 UTC - in response to Message 977801.  

Richard, I did project update before reset.
Both actions, update and reset, not "freed" discarded tasks. So now they listed on web site but physically absent on my host. Single possibility to get rid of them (beside project detach probably) is to wait until deadline.
IMO, BOINC core client could give to server hint (on update or on reset) that those tasks are lost and should be sent to another host.

Well, it would have to be on update (communication with server): reset would be reserved for the situation where no communication is possible (ultimate big red....)

What was the state of the tasks in the interval between the finger-fumble and the reset?
ID: 977812 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 977818 - Posted: 12 Mar 2010, 19:20:46 UTC - in response to Message 977812.  

on host side they just disappeared with message in log ~"no app for 603 version, task discarded"
On server side (web page) they are still remains in "green" state, as "ghosts"
ID: 977818 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 977871 - Posted: 12 Mar 2010, 21:52:03 UTC - in response to Message 977818.  

on host side they just disappeared with message in log ~"no app for 603 version, task discarded"
On server side (web page) they are still remains in "green" state, as "ghosts"

Doesn't that mean we just have to generalise changeset [trac]changeset:19235[/trac]?

if a RESULT uses an app version that is missing [a coprocessor], abort it (rather than deleting it).
The client will report the result on the next scheduler RPC, and the server will make a new instance.
ID: 977871 · Report as offensive
Profile Leopoldo
Volunteer tester
Avatar

Send message
Joined: 4 Aug 99
Posts: 102
Credit: 3,051,091
RAC: 0
Russia
Message 977882 - Posted: 12 Mar 2010, 22:48:04 UTC - in response to Message 977498.  
Last modified: 12 Mar 2010, 22:50:32 UTC

(btw, is any reason to double mention of all files in heading section? I mean why each file_ref should have corresponding file_info? )

my interpretation is:

  • files specified at <file_info> sections (right after the <app>) are declarations of all files which can be needed by any versions of specified app (i.e. for file existance check);
  • and files, mentioned at <file_ref> sections of one <app_version> - as files which must be copied to corresponding slot along with workunit file and chosen version of app for future processing.

ID: 977882 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 977889 - Posted: 12 Mar 2010, 23:14:08 UTC - in response to Message 977882.  
Last modified: 12 Mar 2010, 23:19:17 UTC

yes, it's declaration-like stuff. But I'm not sure if such declaration needed in app_info.
It's not so long to make declaration/definition stuff needed. BOINC can easely infer all needed files from parsing app_version sections.
ID: 977889 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 977890 - Posted: 12 Mar 2010, 23:17:02 UTC - in response to Message 977871.  

on host side they just disappeared with message in log ~"no app for 603 version, task discarded"
On server side (web page) they are still remains in "green" state, as "ghosts"

Doesn't that mean we just have to generalise changeset [trac]changeset:19235[/trac]?

if a RESULT uses an app version that is missing [a coprocessor], abort it (rather than deleting it).
The client will report the result on the next scheduler RPC, and the server will make a new instance.


But only not co-processor. GPUs can be swapped and better not to trash these tasks. Maybe just reverse should be done - not to trash tasks at all, just marking them as "no needed app binary exists" or something like this, as it does currently with missing GPU. Then user get a chance to repair possible error, and if not - they eventually will jest deadlined. But currently they just be deadlined anyway but no option to repair situation left.
ID: 977890 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 977927 - Posted: 13 Mar 2010, 1:25:56 UTC - in response to Message 977638.  

This is a Boinc Core Issue that has been long overlooked/ignored. As I think about it the Server Code should say if "Resend" is turned off then "any WU's assigned to the machine should be aborted by the server when a Reset occurs. This would resolve the issue.

The server can only mark them as disposed of if the client actively sends a message telling it about the reset. That means that the reset button has a double action:

1) Update project: send scheduler request, wait for ack, retry as necessary, etc.
2) Clean up project files.

What happens if you're trying to reset because the project files have got screwed up, and it can't communicate with the server? Or the server's down - does the reset button just hang until the project comes back up?

Perhaps the best that could be done would be to pop up a question:

You still have tasks for this project - do you want to attempt a project update before resetting?

Choosing 'Yes' would cancel the reset, and send an 'Update' instruction to the core client instead. Choosing 'No' would do a forced reset as now, so you had an escape route if it's all gone completely pear-shaped.


Sorry went to bed early and have been running.

Reset: Boinc informs you that a Project Reset might be Required! The reason may be that one of the project application files are suspect (open in active memory) and causing errors. When the "Reset" is done All files are removed as suspect... Another reason is a file that is open in Memory has become corrupt, A Reset then refreshes the application files and reopens the project application. There is still a bit of confict about the true purpose of the Reset command. My thinking is that if it is suspect that the applications (and support files) open and running in memory are the problem. Then a reset should only touch those applications and support files leaving work along...

As I set down, I picked a project that I can reset on this machine. It had no work so it does not matter.

1. Disable Network Activity.
2. Open sched_request_projectname.xml
3. Open sched_reply_projectname.xml
4. With Explorer open, Reset the project.

As I use Ultraedit for my editor, if a file changes (while open) I am notified of a change to the file and asked to reload it... What I saw in Explorer is that files were removed from the project. The message log (6.10.36) states

* Resetting Project.
* Resuming Network Activity.

End of story...

Neither the Scheduler Request or Reply files were updated. The <rpc_seqno> did not update. What did happen was a Direct RPC to the project and the files were removed.

So the logic is there, but it assumed that every project has "resend" turned on. All files, Applications and Work are "refreshed."

The good and the bad...
Good - If you have tasks that are associated with your computer they are resent on the next scheduler contact (which should be immediate).

Bad - If you have work that was waiting to report, it is now burned toast! If resend is turned off, You also have Orphans (Science that has wasted time and is now waiting to time out).

What I would expect to see is the RPC update written to sched_request_projectname.xml. Thus the sequence number will update. Any work waiting to be reported would be reported. Any work yet to be processed would be removed and resent (along with the appropiate applications etc.). Why you have to remove work is still not clear.
If the Scheduler is set to "not resend" the work. You end up with work that is ready to report is reported and the scheduler looks up the work assigned that is left and sets the flag "aborted." Then it can be sent to machines waiting for work.

The problem, Project Down. A scheduler update will not happen nor will the direct RPC. The files are just removed and not replaced. End of story until a scheduler can be contacted.

Regards



Please consider a Donation to the Seti Project.

ID: 977927 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 978128 - Posted: 13 Mar 2010, 10:04:39 UTC - in response to Message 977927.  
Last modified: 13 Mar 2010, 10:15:33 UTC

Why you have to remove work is still not clear.


+2 (from both hands :D )

[actually, the reason could be suspicion that task data files currupted. But in case of misconfiguration this task discarding looks absolutely unneeded and wasteful to me. And why BOINC deletes ALL app_version files if ONE file missing?
For example, I failed to declare single *.cl file, then executable gone too. So, after declaration *.cl file and launching BOINC I again got misconfigured settings cause now primary executable missing, BOINC deleted it by itself. For what reason ??? And, please, note that app_info used, it's Anonymous platform. So BOINC cant expect that project will just update application files. It should expect that operator responsible for configuration (as Richard pointed before). So what the hell it messing with my config deleting files I already added??]
ID: 978128 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 978240 - Posted: 13 Mar 2010, 16:55:57 UTC - in response to Message 978128.  

Why you have to remove work is still not clear.


+2 (from both hands :D )

[actually, the reason could be suspicion that task data files currupted. But in case of misconfiguration this task discarding looks absolutely unneeded and wasteful to me. And why BOINC deletes ALL app_version files if ONE file missing?
For example, I failed to declare single *.cl file, then executable gone too. So, after declaration *.cl file and launching BOINC I again got misconfigured settings cause now primary executable missing, BOINC deleted it by itself. For what reason ??? And, please, note that app_info used, it's Anonymous platform. So BOINC cant expect that project will just update application files. It should expect that operator responsible for configuration (as Richard pointed before). So what the hell it messing with my config deleting files I already added??]


Data file corruption is marbles in a pipe... Normally it ends in a computation error.. Next Please.. So once the data/workunits have landed on the machine and the checksum matches... It should be good unless there is corruption in the underlying file system or what was sent from the server was corrupt to start with.



Please consider a Donation to the Seti Project.

ID: 978240 · Report as offensive
Profile Julie
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 28 Oct 09
Posts: 34041
Credit: 18,883,157
RAC: 18
Belgium
Message 978275 - Posted: 13 Mar 2010, 18:15:22 UTC
Last modified: 13 Mar 2010, 18:15:46 UTC

I think the upload server is out of order again...
rOZZ
Music
Pictures
ID: 978275 · Report as offensive
dino

Send message
Joined: 21 Sep 01
Posts: 11
Credit: 1,048,310
RAC: 0
Italy
Message 978370 - Posted: 13 Mar 2010, 22:08:22 UTC - in response to Message 978275.  

Results ready to send 54,476 (much more than last days)
Results received in last hour 6,037 (about 10% of the normal number)
Results returned and awaiting validation 5,808,203 (much more than normal number)
Workunits waiting for assimilation 550,799 (highest number than usual)

[As of 13 Mar 2010 15:50:19 UTC]

I think this is the same router problem we have seen last end of february...
Has someone try to pathping the router?

In server status page we have the same situation of the last outage...

ID: 978370 · Report as offensive
dino

Send message
Joined: 21 Sep 01
Posts: 11
Credit: 1,048,310
RAC: 0
Italy
Message 978377 - Posted: 13 Mar 2010, 22:18:05 UTC - in response to Message 978370.  
Last modified: 13 Mar 2010, 22:22:11 UTC

Results ready to send 63,834
Results out in the field 4,830,980
Results received in last hour 6,331
Results returned and awaiting validation 5,791,231
Workunits waiting for assimilation 540,378

[As of 13 Mar 2010 22:10:12 UTC]

I'm sure this is the same network problem...
My results do not upload and i can't download new WU

Try pathping on router and we will see x% packet loss
ID: 978377 · Report as offensive
Profile Julie
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 28 Oct 09
Posts: 34041
Credit: 18,883,157
RAC: 18
Belgium
Message 978402 - Posted: 14 Mar 2010, 0:25:42 UTC

I can't upload but I do get WU's though, received about 45 WU's this evening on 1 computer
rOZZ
Music
Pictures
ID: 978402 · Report as offensive
Profile Lint trap

Send message
Joined: 30 May 03
Posts: 871
Credit: 28,092,319
RAC: 0
United States
Message 978415 - Posted: 14 Mar 2010, 1:06:59 UTC
Last modified: 14 Mar 2010, 1:07:40 UTC

The good news is all my Validate Errors have disappeared! Thanks!

No uploads from here though, and hence no downloads possible.

and Pathping is showing packet losses again, about same as during Feb's event.

Martin
ID: 978415 · Report as offensive
KB7RZF
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 9549
Credit: 3,308,926
RAC: 2
United States
Message 978428 - Posted: 14 Mar 2010, 1:43:35 UTC - in response to Message 978415.  

The good news is all my Validate Errors have disappeared! Thanks!

No uploads from here though, and hence no downloads possible.

and Pathping is showing packet losses again, about same as during Feb's event.

Martin

I've gotten 3 more downloads, but the 1 I'm trying to upload just refuses. Probably gonna have to wait till Monday when the guys get in and give the servers a swift kick to start the ball rolling again. LOL
ID: 978428 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 13 · Next

Message boards : Number crunching : Problems...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.