Unexplained cpu errors

Message boards : Number crunching : Unexplained cpu errors
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile MadMaC
Volunteer tester
Avatar

Send message
Joined: 4 Apr 01
Posts: 201
Credit: 47,158,217
RAC: 0
United Kingdom
Message 1025807 - Posted: 17 Aug 2010, 11:19:39 UTC

Not sure why this is happening...
I cant check the ones in the image as uploads aren't working but browsing through tasks for the machine I can see errors stating

<core_client_version>6.10.56</core_client_version>
<![CDATA[
<message>
- exit code -12 (0xfffffff4)
</message>
<stderr_txt>






</stderr_txt>
]]>



The machine in question is

http://setiathome.berkeley.edu/results.php?hostid=5424775

Im suspending cpu calculations for now and apologise to my wingmen :-(
ID: 1025807 · Report as offensive
Profile Sid
Volunteer tester

Send message
Joined: 12 Jun 07
Posts: 16
Credit: 10,968,872
RAC: 0
United States
Message 1025811 - Posted: 17 Aug 2010, 11:37:27 UTC


Any heat issues?
ID: 1025811 · Report as offensive
aad

Send message
Joined: 3 Apr 99
Posts: 101
Credit: 204,131,099
RAC: 26
Netherlands
Message 1025812 - Posted: 17 Aug 2010, 11:38:25 UTC
Last modified: 17 Aug 2010, 11:43:07 UTC

Did you allready try a reboot?

Or perhaps you have Windows update issues (new drivers)?
ID: 1025812 · Report as offensive
geoff

Send message
Joined: 25 Apr 00
Posts: 123
Credit: 34,100,351
RAC: 18
United Kingdom
Message 1025819 - Posted: 17 Aug 2010, 11:59:02 UTC

Looking at error tasks for 5424775 they are all GPU errors with exit status of -177 and -12
ID: 1025819 · Report as offensive
Profile Miep
Volunteer moderator
Avatar

Send message
Joined: 23 Jul 99
Posts: 2412
Credit: 351,996
RAC: 0
Message 1025829 - Posted: 17 Aug 2010, 12:25:09 UTC - in response to Message 1025819.  

Looking at error tasks for 5424775 they are all GPU errors with exit status of -177 and -12


He said the tasks in question hadn't been reported yet.

-177 (elapsed time too long) is a consequence of the introduction of server side DCF corrections and can be prevented by running the new rescheduler to correct the 'maximum allowed elapsed time before abort' values.

-12 are a mostly GPU specific problem, that's being worked on at Lunatics.

Was that the complete stderr text?! That shouldn't be empty...

I'd also suggest a reboot and maybe some basic system checks before setting boinc to allow one task to run as a test. (set to NNT, mark all CPU, set them to suspend, unsuspend one)
Carola
-------
I'm multilingual - I can misunderstand people in several languages!
ID: 1025829 · Report as offensive
Profile MadMaC
Volunteer tester
Avatar

Send message
Joined: 4 Apr 01
Posts: 201
Credit: 47,158,217
RAC: 0
United Kingdom
Message 1025868 - Posted: 17 Aug 2010, 14:51:19 UTC

Thanks for the input - I dont think it was heat related as the rig is in an air conditioned room
I have rebooted the box and will see how it goes running one task..
ID: 1025868 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1025876 - Posted: 17 Aug 2010, 15:18:28 UTC - in response to Message 1025868.  

@ Miep (Carola), just ahead of me, but you're right, most of them are -177 error's on GPU.


Naam 23my10aa.8366.12746.14.10.159_1
Werkeenheid 643374638
Aangemaakt 13 Aug 2010 22:49:58 UTC
Verzonden 14 Aug 2010 0:33:09 UTC
Ontvangen 16 Aug 2010 8:30:43 UTC
Server status Binnen
Uitkomst Client fout
Client status Berekeningsfout
Afsluit status -177 (0xffffffffffffff4f)
Computer ID 5424775
Rapporteren voor 27 Aug 2010 18:49:49 UTC
Loop tijd 1,583.59
CPU tijd 560.84
Validatie status Ongeldig
Punten 0.00
Programma versie SETI@home Enhanced
Anoniem platform (NVIDIA GPU)
Stderr output

<core_client_version>6.10.56</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>
s = 'bf54a347'
00533c54 bf54aa46 bf54adc5 bec27b55 bec29295 bec2a9d4 AK_v8b_win_SSE3_AMD!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = 'bf54a6c6'


All my GPU error's were heat related, so I took some serious action ;-) and have my host with ATI's HD5770 & 4850, turned flat. Laying flat,
like an 'old fashioned PC', but it did work!.

Gonna run my QX9650 host, without a case, cause everything got that hot, I burned my fingers taking out a Harddisk! Both card's, let alone more cards, produce so much heat, the 2 input and 2 output fans, are simply not enough!
And overclocking can be a real killer, too.
Just my ample observations on my own rig's. And reading more and more about CUDA/CAL error's on GPU's, on these Forums, as well.

ID: 1025876 · Report as offensive

Message boards : Number crunching : Unexplained cpu errors


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.