Message boards :
Number crunching :
Request for help- BOINC server software configuration.
Message board moderation
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
Rosetta has a problem- if a batch of tasks results in instant errors, those instant bomb out times are used in the Estimated time to completion calculations. End result- systems even with small caches getting work they have no chance of finishing. We don't have that happen here at Seti, so i figure the servers here are set to only use Validated Task times to determine Estimated completion times. Would any of those familiar with BOINC happen to know what is required/where this can be configured? No luck noticing any reference to it here https://boinc.berkeley.edu/trac/wiki/ProjectOptions Thanks. Edit- and looking in to things closer makes it even uglier- it all ties in with Credit New. It looks like all Invalid & Error Task completion times are taken in to account for Estimated completion time calculations, but those that are significantly different shouldn't be. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
The keyword to look for is "runtime outlier". We did have exactly this problem at SETI around 2011, and we pressurised David Anderson to implement a fix. It's done in the validator (which of course is project-specific code): in SETI's case, we look for the overflow marker SETI@Home Informational message -9 result_overflow NOTE: The number of results detected equals the storage space allocated.in MB tasks, and the percentage of radar blanking in AP tasks. Tell them to look at https://boinc.berkeley.edu/trac/wiki/ValidationSimple#Runtimeoutliers |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
The keyword to look for is "runtime outlier". We did have exactly this problem at SETI around 2011, and we pressurised David Anderson to implement a fix. It's done in the validator (which of course is project-specific code): in SETI's case, we look for the overflow markerExcellent. Thankyou. Grant Darwin NT |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Of course it still doesn't help that Rosetta has a replication of just 1 task per workunit. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Of course it still doesn't help that Rosetta has a replication of just 1 task per workunit.I think even with replication 1, projects still need a validator - it can act as a sanity-check that the result file is properly formatted and complete. All they need is to design a rule to sort the sheep from the goats. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
I know that, it's just, with a replication of more than 1 task per WU, if something goes wrong with a (batch of) task(s) it's easier to check if this is a bad host or a bad (batch of) task(s). |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 |
after reading this[url]https://ralph.bakerlab.org/forum_thread.php?id=84 [/url] each wu we process is a part of a research ... perhaps there's a half crossed with others in each wu we process ? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
Of course it still doesn't help that Rosetta has a replication of just 1 task per workunit.I thought the same thing considering some of the noise that systems have pumped out over the years here at Seti, but my understanding is that each Task is seeded with a random number when it starts, so even for a Task that is resent, even if it were to the same computer, the result produced would be different. So comparing 2 Results from the same Task/WU wouldn't work. I guess they find out just how Valid it is when they use it in their actual models. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22532 Credit: 416,307,556 RAC: 380 |
Rosetta jobs are nothing like those of SETI. As I understand it each Rosetta task follows on from its predecessor, and calculates for a set amount of time, thus the endpoint of the task is not determined by reaching a particular conclusion but by reaching a specified duration regardless of what progress towards the final end-point has been made in that time. Further there is a degree of checking within the task to ensure that the task is being executed correctly. The fact that the tasks are run-time dependant not goal dependant makes validation of in individual task almost impossible. The next task for that job picks up where its predecessor left off and carries on for its duration and so on until the job is complete. SETI however had discrete work-units, each of which was evaluated against a set of goals by a pair of independent computers. Within limits the tasks run for as long as each computer required to finish the task. Once both computers finished their results are then compared. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I suppose it depends on what the exact problem is. Grant wrote: Rosetta has a problem- if a batch of tasks results in instant errors, those instant bomb out times are used in the Estimated time to completion calculations.If that was an 'error while computing' on the volunteer's computer, the server code should discard the runtime for the task with no effect on future estimates. But if it was a fubar in creating the workunit, and the volunteer successfully completed calculating nop 1000 times, or whatever, then that's a problem. LHC has a similar problem with its sixtrack application. The job is to simulate possibly stable, possibly unstable, orbits round the LHC. The unstable orbits are the target for elimination: a very short unstable run is a valid and significant data point. But it messes up BOINC's estimates. LHC have requested a change to the server code, but I don't understand what they're asking. They want to move the 'runtime outlier' check to the WU - they have normal replication-of-two validation - whereas outliers are strictly a host-result level problem. It doesn't affect wingmates - their estimates are based on their own runtimes. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
I suppose it depends on what the exact problem is.The error was in the WU creation, in that the application doesn't recognise the Task as a valid format, so it comes up as a Computation error. Outcome Computation error Client state Compute error ERROR: Cannot determine file type. Current supported types are: PDB, CIF, SRLZ, MMTF ERROR:: Exit from: ..\..\..\src\core\import_pose\import_pose.cc line: 380 BOINC:: Error reading and gzipping output datafile: default.out Run time 40sec or less, Target CPU time 8hrs. I had an look at Job runtime Estimation, which ties in closely with Credit New. Because Rosetta Tasks are for a fixed period of time (selectable from 2hrs to 36hrs) and have a 4 hr grace period after which a Watchdog timer will end the Task, as near as i could figure out the extremely large wu.fpops_bound value (necessary for the 4 hour cutoff for the Watchdog timer) appears to break the Sanity check, so extremely short completion times (ie Tasks erroring out in seconds) even of those that are an Error, are included in Estimated completion time calculations instead of being excluded. Someone also mentioned (but i haven't checked for myself) that the wu.fpops_est for Tasks is set at 80,000 GFLOPs, regardless of whether the Target CPU runtime is 2hrs, or 36hrs. Which would probably explain the APR values for their applications (along with the huge variability of granted Credit), and also means the that wu.fpops_bound values would have to be truly huge to allow for 2hr to 36hr Tasks with a 4hr extension allowed. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
But this: should trump everything else. That should mean the result goes nowhere near the runtime estimation code.Outcome Computation error Client state Compute error |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
But this:*shrug* I'd have thought so too- that only Valid Tasks are used for Runtime Estimation calculations. But it was a widespread problem, i had about 5 or 6 of those Tasks on my systems and they errored out in 20-40sec. Next batch of new work i got against that application had an estimated completion time of around 39min (pretty sure prior to that the Estimated time was around 7 hours or so). Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
It's up to the system admins now. Either they disabled the 'failed task' check when they wrote their bespoke target time code, or there's a bug in Credit New (surprise!) which they'll have to take up with David. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
It's up to the system admins now. Either they disabled the 'failed task' check when they wrote their bespoke target time code, or there's a bug in Credit New (surprise!) which they'll have to take up with David.I forwarded your link and sent my WAG (Wild Arse Guess) relating to bound values etc. So it's all in their court now. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Now that you've clarified the error status, the link is probably redundant - although it points to an alternate solution. The other issue - fixed fpops_est for different duration tasks - is a weakness I've seen at other projects too, but it's a second-order problem - without the errors, the project should survive. It would help if they checked their WUs before sending them out, but that's probably too much to hope for. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.