-177 (0xffffffffffffff4f) Faults

Message boards : Number crunching : -177 (0xffffffffffffff4f) Faults
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1005703 - Posted: 18 Jun 2010, 13:49:16 UTC
Last modified: 18 Jun 2010, 13:55:03 UTC

Overnight this computer has downloaded 15 version 603 MB work units. All errored out with the same fault code of -177 (0xffffffffffffff4f). I have now removed version 603 MB from my app_info so that no more will be requested. I would be happy to add the 603 info back into the app_info again if anyone has any trouble shooting they want to do.

Of the 15 errored out work units, 10 failed at a run time of 0.00. 3 errored with a run time of exactly 1,533.89. 1 errored out at exactly 1,534.45 and 1 errored out at exactly 1,534.22 seconds.

This all concerns this computer only. I am convinced that this is NOT a problem on my end. 3 errors at exactly the same time and 10 errors at exactly the same time? Not a hardware problem here. And again this computer was successfully crunching 603 work units before the server changes earlier this week.

Again if anyone at Berkeley is interested, I'm still here. I have copy's of the error's that report to Microsoft available if intersted.

[edit] More info here from yesterday.
Boinc....Boinc....Boinc....Boinc....
ID: 1005703 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1005709 - Posted: 18 Jun 2010, 14:06:42 UTC - in response to Message 1005703.  

Of the 15 errored out work units, 10 failed at a run time of 0.00. 3 errored with a run time of exactly 1,533.89. 1 errored out at exactly 1,534.45 and 1 errored out at exactly 1,534.22 seconds.

Pretty strange to get "Maximum elapsed time exceeded" in such a short run time. Perhaps you should download a new copy of the 6.03 exe.

FWIW, did you check your app_info.xml for <flops> statements, and what is your DCF on that host?

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 1005709 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1005718 - Posted: 18 Jun 2010, 14:17:18 UTC
Last modified: 18 Jun 2010, 14:18:30 UTC

This is a completely new install from yesterday evening. No flops are in the app_info file. Yes I copied a new exe into the work directory but the errors still coming. Crunches AP just fine.

[edit] DCF now is .9598
Boinc....Boinc....Boinc....Boinc....
ID: 1005718 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1005722 - Posted: 18 Jun 2010, 14:19:54 UTC

Not the application.

This is eerily reminiscent of a sequence which took place at SETI Beta during testing, reported to the boinc_alpha buglist.

"Richard Haselgrove" wrote:
I got a block of SETI Beta tasks yeasterday which all errored out with

01-Jun-2010 00:31:11 [SETI@home Beta Test] Aborting task 18dc09aa.22310.12337.5.13.19_1: exceeded elapsed time limit 0.000000

(details followed)

"David Anderson" wrote:
I fixed the problem. As Richard suggested,
it affected only jobs that were being resent to a client
using anonymous platform.
-- David

"Richard Haselgrove" wrote:
David got this sorted off-list yesterday - according to changeset 21671, there was a problem with resent tasks on anonymous platform.

Except - today, I'm getting the same problem on tasks which are *not* resent (but are anonymous platform - host 23491)

Tasks show zero time To completion' in BOINC Manager, and have the same

<rsc_fpops_est>0.000000</rsc_fpops_est>
<rsc_fpops_bound>0.000000</rsc_fpops_bound>

"David Anderson" wrote:
possibly fixed now.
-- David

"Richard Haselgrove" wrote:
Yes, tasks issued since around 19:30 UTC today have had plausible (if to my eye slightly low) fpops_est values


So, Beta testing can put a bandaid over some problems. But it sounds as if either (a) the bandaid hasn't been copied to the main project, or (b) yet a third variant of the problem has surfaced. But I'm pretty sure it's a server problem - check those <rsc_fpops_bound> values to be certain.

I've got to set off for a 3-hour cross-country drive now, and I don't have the time or enough detail to report it now. But if people can check and post their findings while I'm en-route, I'll check in once I've arrived and found a computer to fire up.
ID: 1005722 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1005728 - Posted: 18 Jun 2010, 14:25:25 UTC - in response to Message 1005722.  

I can't find any <rsc_fpops_bound> with a 603 app since I don't have any at the moment. Will reenable and report as soon as I can catch a 603.

Boinc....Boinc....Boinc....Boinc....
ID: 1005728 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1005743 - Posted: 18 Jun 2010, 15:07:45 UTC
Last modified: 18 Jun 2010, 15:21:18 UTC

Just found this computer had the same fault on June 15 on nearly 40 work units.

This seems to be a serious problem, among many problems after the server upgrade.

[edit]

Here is the info from the client_state for some of the 603 work units waiting to crunch.

<rsc_fpops_bound>61671868056682.000000</rsc_fpops_bound>
<rsc_fpops_bound>211269334891737.310000</rsc_fpops_bound>
<rsc_fpops_bound>61671868056682.000000</rsc_fpops_bound>
<rsc_fpops_bound>211445572728336.590000</rsc_fpops_bound>
<rsc_fpops_bound>211445572728336.590000</rsc_fpops_bound>
<rsc_fpops_bound>211445572728336.590000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>
<rsc_fpops_bound>211405094592406.970000</rsc_fpops_bound>

I can send the entire clien_state.xml file if anyone is interested.
Boinc....Boinc....Boinc....Boinc....
ID: 1005743 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1005797 - Posted: 18 Jun 2010, 17:29:30 UTC - in response to Message 1005743.  

Here is the info from the client_state for some of the 603 work units waiting to crunch.

Those ones are OK (by eye - not got a calculator out to check). The "smoking gun" would be to find one which displayed a zero, or ridiculously low, "To completion" time in Task Manager before it even started, then check the <rsc_fpops_bound> for that exact task.

I can send the entire clien_state.xml file if anyone is interested.

No thanks, I'm on a rustic computer, on the end of a rustic telephone line, in the depths of the country. Back to full equipment on Monday, if it's still a problem then.
ID: 1005797 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1005799 - Posted: 18 Jun 2010, 17:31:02 UTC - in response to Message 1005797.  

Have a good weekend Richard. Thanks for your help.
Boinc....Boinc....Boinc....Boinc....
ID: 1005799 · Report as offensive
Dave Lewis

Send message
Joined: 12 Apr 99
Posts: 34
Credit: 53,432,603
RAC: 108
United States
Message 1005808 - Posted: 18 Jun 2010, 17:53:06 UTC

My computer also generated a ton of these errors too but they involve v6.08 cuda wu's. I haven't checked for a week or so and I was really surprised to find this. I had to detach in early May because of a failing hard drive and when I set my system up again with stock software everything looked to be working okay until I noticed the errors just now. Any suggestions appreciated.
ID: 1005808 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 1005812 - Posted: 18 Jun 2010, 18:00:19 UTC - in response to Message 1005808.  

Any suggestions appreciated.

See Richard Haselgrove's posts in this same thread.

Gruß,
Gundolf
ID: 1005812 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1005842 - Posted: 18 Jun 2010, 19:29:34 UTC - in response to Message 1005743.  
Last modified: 18 Jun 2010, 19:39:50 UTC


I'm now on my K6-2+ / 524 MHz computer
It has 4 tasks, uses opt app

Last download:
17-Jun-2010 14:44:02 [SETI@home] Finished download of 18mr10aa.31979.20517.15.10.129

I see that rsc_fpops_bound is exactly 10 times the rsc_fpops_est:
      <rsc_fpops_est>159697111958278.000000</rsc_fpops_est>
    <rsc_fpops_bound>1596971119582780.000000</rsc_fpops_bound>

      <rsc_fpops_est>160566093629472.000000</rsc_fpops_est>
    <rsc_fpops_bound>1605660936294720.000000</rsc_fpops_bound>

      <rsc_fpops_est>163980079177073.000000</rsc_fpops_est>
    <rsc_fpops_bound>1639800791770730.000000</rsc_fpops_bound>

      <rsc_fpops_est>161006148967213.000000</rsc_fpops_est>
    <rsc_fpops_bound>1610061489672130.000000</rsc_fpops_bound>


<app_info>

    <app>
        <name>setiathome_enhanced</name>
    </app>

    <file_info>
        <name>KWSN_2.4V_MMX_MB.exe</name>
        <executable/>
    </file_info>

    <app_version>
        <app_name>setiathome_enhanced</app_name>
        <version_num>528</version_num>
        <file_ref>
            <file_name>KWSN_2.4V_MMX_MB.exe</file_name>
            <main_program/>
        </file_ref>
    </app_version>

</app_info>


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1005842 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1005859 - Posted: 18 Jun 2010, 20:16:36 UTC - in response to Message 1005842.  

...
I see that [b]rsc_fpops_bound is exactly 10 times the rsc_fpops_est
...

That's true, the splitter uses that ratio and BOINC's attempt to do per-application estimates server-side adjust both by the same amount.

However, in the core client the Duration Correction Factor is applied only to estimates, not the bound. With almost all hosts having DCF around 0.2 the effective bound was about 50 times the estimate. With the server-side adjustment effectively trying to drive DCF toward 1.0, the allowance should reduce to around 10. If BOINC's adjustments were reasonably accurate that would be adequate, but at least for now they are not that accurate.

Because the bound is merely intended to keep a corrupted task or application from running forever, increasing it to give your host time to do the work won't harm the project in any way. More info in my post in Some changes made to this recent BOINC update BUT.
                                                               Joe
ID: 1005859 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1817
Credit: 262,693,308
RAC: 44
Australia
Message 1006098 - Posted: 19 Jun 2010, 8:13:20 UTC

Just got one of these myself. Task 1638222570 if any one wants to look at the details. DCF was reading 7.8. I've reset it back to 1.

I've still got the <flops> values in the app_info file for this box. Whats the current thinking on this, are flops values in or out ?

And yes the protection is working, Max tasks for Anon Platform, nvidia GPU has been reset back to 99 from 208. :-P

Brodo
ID: 1006098 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1006770 - Posted: 21 Jun 2010, 3:19:55 UTC

I received 11 more error's today, this time on cuda work, just seconds before they would have ended normaly. example here

Must be many other crunchers in this predicament.
Boinc....Boinc....Boinc....Boinc....
ID: 1006770 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 1006777 - Posted: 21 Jun 2010, 3:51:51 UTC - in response to Message 1006770.  

Yep, same here.
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 1006777 · Report as offensive
Profile RottenMutt
Avatar

Send message
Joined: 15 Mar 01
Posts: 1011
Credit: 230,314,058
RAC: 0
United States
Message 1006789 - Posted: 21 Jun 2010, 4:49:19 UTC - in response to Message 1006777.  

ditto

running AK_v8b_win_x64_SSE41
ID: 1006789 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1006794 - Posted: 21 Jun 2010, 5:36:56 UTC - in response to Message 1006789.  

ditto

running AK_v8b_win_x64_SSE41


You still have MB (NON cuda) work??? I have not received one all weekend. Nothing except cuda and AP work all weekend. And not one VLAR that could be rescheduled to the cpu's.

Looks like they cherry picked the tapes (disks) for the weekend to minimize problems.

Boinc....Boinc....Boinc....Boinc....
ID: 1006794 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1007034 - Posted: 21 Jun 2010, 19:57:43 UTC

Finally catching up after my weekend away. Looks like my Fermi had about 30 of these: All tasks for computer 4292666 - all being stopped in their tracks at around 5 minutes.

I had flops correction in place at the time, but no effective VLAR catcher: now I've swapped that over - no flops entry in app_info, and ReSchedule installed. All I need now are some tasks to try it out - stuffed full of Beta at the moment.
ID: 1007034 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1007037 - Posted: 21 Jun 2010, 20:02:52 UTC - in response to Message 1007034.  

wish I could get beta


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1007037 · Report as offensive
Profile Area 51
Avatar

Send message
Joined: 31 Jan 04
Posts: 965
Credit: 42,193,520
RAC: 0
United Kingdom
Message 1008843 - Posted: 27 Jun 2010, 6:15:03 UTC
Last modified: 27 Jun 2010, 6:15:15 UTC

What is the current thinking on these errors? I have accumulated 118 over-night! I' had comms disabled, so I could re-process them.... Any thoughts?
ID: 1008843 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : -177 (0xffffffffffffff4f) Faults


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.