-177 (0xffffffffffffff4f) Faults


log in

Advanced search

Message boards : Number crunching : -177 (0xffffffffffffff4f) Faults

1 · 2 · Next
Author Message
Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2466
Credit: 85,734,269
RAC: 27,569
United States
Message 1005703 - Posted: 18 Jun 2010, 13:49:16 UTC
Last modified: 18 Jun 2010, 13:55:03 UTC

Overnight this computer has downloaded 15 version 603 MB work units. All errored out with the same fault code of -177 (0xffffffffffffff4f). I have now removed version 603 MB from my app_info so that no more will be requested. I would be happy to add the 603 info back into the app_info again if anyone has any trouble shooting they want to do.

Of the 15 errored out work units, 10 failed at a run time of 0.00. 3 errored with a run time of exactly 1,533.89. 1 errored out at exactly 1,534.45 and 1 errored out at exactly 1,534.22 seconds.

This all concerns this computer only. I am convinced that this is NOT a problem on my end. 3 errors at exactly the same time and 10 errors at exactly the same time? Not a hardware problem here. And again this computer was successfully crunching 603 work units before the server changes earlier this week.

Again if anyone at Berkeley is interested, I'm still here. I have copy's of the error's that report to Microsoft available if intersted.

[edit] More info here from yesterday.
____________
Boinc....Boinc....Boinc....Boinc....

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 357,953
RAC: 37
Germany
Message 1005709 - Posted: 18 Jun 2010, 14:06:42 UTC - in response to Message 1005703.

Of the 15 errored out work units, 10 failed at a run time of 0.00. 3 errored with a run time of exactly 1,533.89. 1 errored out at exactly 1,534.45 and 1 errored out at exactly 1,534.22 seconds.

Pretty strange to get "Maximum elapsed time exceeded" in such a short run time. Perhaps you should download a new copy of the 6.03 exe.

FWIW, did you check your app_info.xml for <flops> statements, and what is your DCF on that host?

Gruß,
Gundolf
____________
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2466
Credit: 85,734,269
RAC: 27,569
United States
Message 1005718 - Posted: 18 Jun 2010, 14:17:18 UTC
Last modified: 18 Jun 2010, 14:18:30 UTC

This is a completely new install from yesterday evening. No flops are in the app_info file. Yes I copied a new exe into the work directory but the errors still coming. Crunches AP just fine.

[edit] DCF now is .9598
____________
Boinc....Boinc....Boinc....Boinc....

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8465
Credit: 48,910,753
RAC: 76,227
United Kingdom
Message 1005722 - Posted: 18 Jun 2010, 14:19:54 UTC

Not the application.

This is eerily reminiscent of a sequence which took place at SETI Beta during testing, reported to the boinc_alpha buglist.

"Richard Haselgrove" wrote:
I got a block of SETI Beta tasks yeasterday which all errored out with

01-Jun-2010 00:31:11 [SETI@home Beta Test] Aborting task 18dc09aa.22310.12337.5.13.19_1: exceeded elapsed time limit 0.000000

(details followed)

"David Anderson" wrote:
I fixed the problem. As Richard suggested,
it affected only jobs that were being resent to a client
using anonymous platform.
-- David

"Richard Haselgrove" wrote:
David got this sorted off-list yesterday - according to changeset 21671, there was a problem with resent tasks on anonymous platform.

Except - today, I'm getting the same problem on tasks which are *not* resent (but are anonymous platform - host 23491)

Tasks show zero time To completion' in BOINC Manager, and have the same

<rsc_fpops_est>0.000000</rsc_fpops_est>
<rsc_fpops_bound>0.000000</rsc_fpops_bound>

"David Anderson" wrote:
possibly fixed now.
-- David

"Richard Haselgrove" wrote:
Yes, tasks issued since around 19:30 UTC today have had plausible (if to my eye slightly low) fpops_est values


So, Beta testing can put a bandaid over some problems. But it sounds as if either (a) the bandaid hasn't been copied to the main project, or (b) yet a third variant of the problem has surfaced. But I'm pretty sure it's a server problem - check those <rsc_fpops_bound> values to be certain.

I've got to set off for a 3-hour cross-country drive now, and I don't have the time or enough detail to report it now. But if people can check and post their findings while I'm en-route, I'll check in once I've arrived and found a computer to fire up.

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2466
Credit: 85,734,269
RAC: 27,569
United States
Message 1005728 - Posted: 18 Jun 2010, 14:25:25 UTC - in response to Message 1005722.

I can't find any <rsc_fpops_bound> with a 603 app since I don't have any at the moment. Will reenable and report as soon as I can catch a 603.

____________
Boinc....Boinc....Boinc....Boinc....

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2466
Credit: 85,734,269
RAC: 27,569
United States
Message 1005743 - Posted: 18 Jun 2010, 15:07:45 UTC
Last modified: 18 Jun 2010, 15:21:18 UTC

Just found this computer had the same fault on June 15 on nearly 40 work units.

This seems to be a serious problem, among many problems after the server upgrade.

[edit]

Here is the info from the client_state for some of the 603 work units waiting to crunch.

<rsc_fpops_bound>61671868056682.000000</rsc_fpops_bound>
<rsc_fpops_bound>211269334891737.310000</rsc_fpops_bound>
<rsc_fpops_bound>61671868056682.000000</rsc_fpops_bound>
<rsc_fpops_bound>211445572728336.590000</rsc_fpops_bound>
<rsc_fpops_bound>211445572728336.590000</rsc_fpops_bound>
<rsc_fpops_bound>211445572728336.590000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>

I can send the entire clien_state.xml file if anyone is interested.
____________
Boinc....Boinc....Boinc....Boinc....

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8465
Credit: 48,910,753
RAC: 76,227
United Kingdom
Message 1005797 - Posted: 18 Jun 2010, 17:29:30 UTC - in response to Message 1005743.

Here is the info from the client_state for some of the 603 work units waiting to crunch.

Those ones are OK (by eye - not got a calculator out to check). The "smoking gun" would be to find one which displayed a zero, or ridiculously low, "To completion" time in Task Manager before it even started, then check the <rsc_fpops_bound> for that exact task.

I can send the entire clien_state.xml file if anyone is interested.

No thanks, I'm on a rustic computer, on the end of a rustic telephone line, in the depths of the country. Back to full equipment on Monday, if it's still a problem then.

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2466
Credit: 85,734,269
RAC: 27,569
United States
Message 1005799 - Posted: 18 Jun 2010, 17:31:02 UTC - in response to Message 1005797.

Have a good weekend Richard. Thanks for your help.
____________
Boinc....Boinc....Boinc....Boinc....

Dave Lewis
Send message
Joined: 12 Apr 99
Posts: 14
Credit: 19,327,883
RAC: 8,080
United States
Message 1005808 - Posted: 18 Jun 2010, 17:53:06 UTC

My computer also generated a ton of these errors too but they involve v6.08 cuda wu's. I haven't checked for a week or so and I was really surprised to find this. I had to detach in early May because of a failing hard drive and when I set my system up again with stock software everything looked to be working okay until I noticed the errors just now. Any suggestions appreciated.
____________

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 357,953
RAC: 37
Germany
Message 1005812 - Posted: 18 Jun 2010, 18:00:19 UTC - in response to Message 1005808.

Any suggestions appreciated.

See Richard Haselgrove's posts in this same thread.

Gruß,
Gundolf

Profile BilBg
Volunteer tester
Avatar
Send message
Joined: 27 May 07
Posts: 2642
Credit: 6,000,259
RAC: 4,280
Bulgaria
Message 1005842 - Posted: 18 Jun 2010, 19:29:34 UTC - in response to Message 1005743.
Last modified: 18 Jun 2010, 19:39:50 UTC


I'm now on my K6-2+ / 524 MHz computer
It has 4 tasks, uses opt app

Last download:
17-Jun-2010 14:44:02 [SETI@home] Finished download of 18mr10aa.31979.20517.15.10.129

I see that rsc_fpops_bound is exactly 10 times the rsc_fpops_est:

<rsc_fpops_est>159697111958278.000000</rsc_fpops_est> <rsc_fpops_bound>1596971119582780.000000</rsc_fpops_bound> <rsc_fpops_est>160566093629472.000000</rsc_fpops_est> <rsc_fpops_bound>1605660936294720.000000</rsc_fpops_bound> <rsc_fpops_est>163980079177073.000000</rsc_fpops_est> <rsc_fpops_bound>1639800791770730.000000</rsc_fpops_bound> <rsc_fpops_est>161006148967213.000000</rsc_fpops_est> <rsc_fpops_bound>1610061489672130.000000</rsc_fpops_bound> <app_info> <app> <name>setiathome_enhanced</name> </app> <file_info> <name>KWSN_2.4V_MMX_MB.exe</name> <executable/> </file_info> <app_version> <app_name>setiathome_enhanced</app_name> <version_num>528</version_num> <file_ref> <file_name>KWSN_2.4V_MMX_MB.exe</file_name> <main_program/> </file_ref> </app_version> </app_info>

____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4230
Credit: 1,042,833
RAC: 320
United States
Message 1005859 - Posted: 18 Jun 2010, 20:16:36 UTC - in response to Message 1005842.

...
I see that [b]rsc_fpops_bound is exactly 10 times the rsc_fpops_est
...

That's true, the splitter uses that ratio and BOINC's attempt to do per-application estimates server-side adjust both by the same amount.

However, in the core client the Duration Correction Factor is applied only to estimates, not the bound. With almost all hosts having DCF around 0.2 the effective bound was about 50 times the estimate. With the server-side adjustment effectively trying to drive DCF toward 1.0, the allowance should reduce to around 10. If BOINC's adjustments were reasonably accurate that would be adequate, but at least for now they are not that accurate.

Because the bound is merely intended to keep a corrupted task or application from running forever, increasing it to give your host time to do the work won't harm the project in any way. More info in my post in Some changes made to this recent BOINC update BUT.
Joe

Terror Australis
Volunteer tester
Send message
Joined: 14 Feb 04
Posts: 1711
Credit: 204,371,690
RAC: 25,221
Australia
Message 1006098 - Posted: 19 Jun 2010, 8:13:20 UTC

Just got one of these myself. Task 1638222570 if any one wants to look at the details. DCF was reading 7.8. I've reset it back to 1.

I've still got the <flops> values in the app_info file for this box. Whats the current thinking on this, are flops values in or out ?

And yes the protection is working, Max tasks for Anon Platform, nvidia GPU has been reset back to 99 from 208. :-P

Brodo

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2466
Credit: 85,734,269
RAC: 27,569
United States
Message 1006770 - Posted: 21 Jun 2010, 3:19:55 UTC

I received 11 more error's today, this time on cuda work, just seconds before they would have ended normaly. example here

Must be many other crunchers in this predicament.
____________
Boinc....Boinc....Boinc....Boinc....

Profile hiamps
Volunteer tester
Avatar
Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 1006777 - Posted: 21 Jun 2010, 3:51:51 UTC - in response to Message 1006770.

Yep, same here.
____________
Official Abuser of Boinc Buttons...
And no good credit hound!

Profile RottenMutt
Avatar
Send message
Joined: 15 Mar 01
Posts: 992
Credit: 207,654,737
RAC: 1
United States
Message 1006789 - Posted: 21 Jun 2010, 4:49:19 UTC - in response to Message 1006777.

ditto

running AK_v8b_win_x64_SSE41
____________

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2466
Credit: 85,734,269
RAC: 27,569
United States
Message 1006794 - Posted: 21 Jun 2010, 5:36:56 UTC - in response to Message 1006789.

ditto

running AK_v8b_win_x64_SSE41


You still have MB (NON cuda) work??? I have not received one all weekend. Nothing except cuda and AP work all weekend. And not one VLAR that could be rescheduled to the cpu's.

Looks like they cherry picked the tapes (disks) for the weekend to minimize problems.

____________
Boinc....Boinc....Boinc....Boinc....

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8465
Credit: 48,910,753
RAC: 76,227
United Kingdom
Message 1007034 - Posted: 21 Jun 2010, 19:57:43 UTC

Finally catching up after my weekend away. Looks like my Fermi had about 30 of these: All tasks for computer 4292666 - all being stopped in their tracks at around 5 minutes.

I had flops correction in place at the time, but no effective VLAR catcher: now I've swapped that over - no flops entry in app_info, and ReSchedule installed. All I need now are some tasks to try it out - stuffed full of Beta at the moment.

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,274
RAC: 0
Korea, North
Message 1007037 - Posted: 21 Jun 2010, 20:02:52 UTC - in response to Message 1007034.

wish I could get beta
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile Area 51
Avatar
Send message
Joined: 31 Jan 04
Posts: 965
Credit: 42,193,520
RAC: 0
United Kingdom
Message 1008843 - Posted: 27 Jun 2010, 6:15:03 UTC
Last modified: 27 Jun 2010, 6:15:15 UTC

What is the current thinking on these errors? I have accumulated 118 over-night! I' had comms disabled, so I could re-process them.... Any thoughts?
____________

1 · 2 · Next

Message boards : Number crunching : -177 (0xffffffffffffff4f) Faults

Copyright © 2014 University of California