Strange Invalid MB Overflow tasks with truncated Stderr outputs...

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 14 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1461332 - Posted: 6 Jan 2014, 23:46:24 UTC
Last modified: 6 Jan 2014, 23:55:23 UTC

Seems I've received another one. The last one was a week or two ago. As I remember, it was the same. The Stderr output just stops, and it receives an immediate Invalid. Since it's so short, nothing is really lost. It's just puzzling as to what actually happened since other overflows complete normally, as the overflow immediately preceding the one that failed.

Work Unit Info:
...............
WU true angle range is : 2.684834
re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes

</stderr_txt>
]]>

Run time 12.45
CPU time 11.67
Validate state Invalid
Credit 0.00
Application version SETI@home v7 Anonymous platform (NVIDIA GPU)
ID: 1461332 · Report as offensive
Batter Up
Avatar

Send message
Joined: 5 May 99
Posts: 1946
Credit: 24,860,347
RAC: 0
United States
Message 1461355 - Posted: 7 Jan 2014, 1:56:49 UTC - in response to Message 1461332.  

I just got this.

Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>

Run time 20.13
CPU time 2.38
Validate state Invalid
Credit 0.00
Application version SETI@home v7 v7.00 (cuda50)

http://setiathome.berkeley.edu/result.php?resultid=3321928417
ID: 1461355 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1461356 - Posted: 7 Jan 2014, 1:58:17 UTC - in response to Message 1461332.  

Looking at the similar wingmen (apps, gpu generation etc) processing fine, seems to point definitely toward something specific to the system. As it's been a while I don't recall what was tried so far. On the off chance there is some resolved issue specific to that GPU, and you're using a new Boinc revision, is there any particular reason for not updating the Driver ? There can be funky interactions with the way newer Boinc kills apps under some conditions, especially if the driver takes it's time cleaning up.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1461356 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1461357 - Posted: 7 Jan 2014, 1:59:50 UTC - in response to Message 1461355.  

I just got this.


That one looks like a Boinc bug Claggy was telling me he reported recently ... could be the same thing.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1461357 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1461367 - Posted: 7 Jan 2014, 2:20:11 UTC - in response to Message 1461356.  

Looking at the similar wingmen (apps, gpu generation etc) processing fine, seems to point definitely toward something specific to the system. As it's been a while I don't recall what was tried so far. On the off chance there is some resolved issue specific to that GPU, and you're using a new Boinc revision, is there any particular reason for not updating the Driver ? There can be funky interactions with the way newer Boinc kills apps under some conditions, especially if the driver takes it's time cleaning up.

I tried 331.82 on my XP Dual core Host and if failed to produce any better CUDA runtimes than 266.58. What 331.82 did accomplish was to make the Host completely unusable when running an AstroPulse on the 8800 whereas there isn't that much of a problem when running an AP with 266.58. When you only have a Dual core processor, using half of it when not necessary isn't an option. I had the same results in Windows 8 where 266.58 isn't an option. Running an AP with 331.82 on a Dual core Host makes it extremely annoying to use the Host. Definitely something to be avoided when possible.

Since I've been using 266.58 for over a year without this problem, I'm inclined to place the blame elsewhere.
ID: 1461367 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1461375 - Posted: 7 Jan 2014, 2:40:41 UTC

I seem to get about one of these just about every week or two, on different machines. They're always WUs where the wingmen get -9 overflows where the Pulse count is less than 30, but one or more of the other counts brings the total up to 30. Most only take a few seconds to overflow, but some take several minutes. Here's one from last Friday, where the wingmen's counts were 29,0,0,1,0:

Name	12mr13af.14976.20108.438086664199.12.0_1
Workunit	1393362608
Created	3 Jan 2014, 2:37:28 UTC
Sent	3 Jan 2014, 3:08:56 UTC
Received	3 Jan 2014, 19:19:05 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	6979886
Report deadline	23 Jan 2014, 14:18:38 UTC
Run time	5.23
CPU time	1.40
Validate state	Invalid
Credit	0.00
Application version	SETI@home v7
Anonymous platform (NVIDIA GPU)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>

Here's one from a couple weeks ago on a different machine, where the wingmen's counts were 29,0,0,0,1:
Name	09se09af.21444.23789.438086664205.12.22_1
Workunit	1382306633
Created	19 Dec 2013, 10:42:47 UTC
Sent	19 Dec 2013, 16:19:23 UTC
Received	20 Dec 2013, 7:45:44 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	7057115
Report deadline	10 Feb 2014, 5:46:48 UTC
Run time	1,266.05
CPU time	197.13
Validate state	Initial
Credit	0.00
Application version	SETI@home v7 v7.00 (cuda50)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>

And from yet another machine, about the same time, where the wingmens' counts were 28,0,0,2,0:
Name	02dc13ae.8857.7429.438086664203.12.247_1
Workunit	1382671822
Created	20 Dec 2013, 0:06:32 UTC
Sent	20 Dec 2013, 6:00:57 UTC
Received	20 Dec 2013, 14:33:36 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	6980751
Report deadline	11 Feb 2014, 4:38:27 UTC
Run time	2,476.34
CPU time	114.64
Validate state	Initial
Credit	0.00
Application version	SETI@home v7 v7.00 (cuda42)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 5 
     pciBusID = 24, pciSlotID = 0
  Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 5, pciSlotID = 0
  Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 69, pciSlotID = 0
  Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 88, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 660 is okay
SETI@home using CUDA accelerated device GeForce GTX 660
mbcuda.cfg, processpriority key detected
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to ABOVE_NORMAL successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 4.20

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.434356

Kepler GPU current clockRate = 1162 MHz

re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k

</stderr_txt>
]]>

As you can see, sometimes the STDERR is almost completely empty, and other times it shows all the way to that "Thread call stack limit" line. I haven't been able to identify any consistency between the two types, but the end result for both is always an invalid, although sometimes its an "immediate" Invalid (as with my first example) and sometimes it doesn't get flagged as Invalid until the first wingman reports (as with the 2nd and 3rd examples).
ID: 1461375 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1461378 - Posted: 7 Jan 2014, 2:49:25 UTC

While I'm at it, here's one more, where one wingman got counts of 2,28,0,0,0 and two others got counts of 18,0,0,12,0 to earn the validation:
Name	01dc13ac.14707.15609.438086664195.12.254_1
Workunit	1378744616
Created	14 Dec 2013, 19:23:30 UTC
Sent	14 Dec 2013, 23:29:56 UTC
Received	15 Dec 2013, 4:43:17 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	6980751
Report deadline	6 Feb 2014, 6:40:09 UTC
Run time	873.69
CPU time	116.70
Validate state	Invalid
Credit	0.00
Application version	SETI@home v7 v7.00 (cuda42)
Stderr output

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 5 
     pciBusID = 24, pciSlotID = 0
  Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 5, pciSlotID = 0
  Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 69, pciSlotID = 0
  Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 88, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 660 is okay
SETI@home using CUDA accelerated device GeForce GTX 660
mbcuda.cfg, processpriority key detected
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to ABOVE_NORMAL successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 4.20

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.426631

Kepler GPU current clockRate = 1162 MHz

re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k

</stderr_txt>
]]>
ID: 1461378 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1461379 - Posted: 7 Jan 2014, 2:50:42 UTC

Here's the emerging pattern:

<core_client_version>7.2.33</core_client_version>

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1461379 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1461380 - Posted: 7 Jan 2014, 2:53:45 UTC - in response to Message 1461367.  
Last modified: 7 Jan 2014, 2:55:13 UTC

Looking at the similar wingmen (apps, gpu generation etc) processing fine, seems to point definitely toward something specific to the system. As it's been a while I don't recall what was tried so far. On the off chance there is some resolved issue specific to that GPU, and you're using a new Boinc revision, is there any particular reason for not updating the Driver ? There can be funky interactions with the way newer Boinc kills apps under some conditions, especially if the driver takes it's time cleaning up.

I tried 331.82 on my XP Dual core Host and if failed to produce any better CUDA runtimes than 266.58. What 331.82 did accomplish was to make the Host completely unusable when running an AstroPulse on the 8800 whereas there isn't that much of a problem when running an AP with 266.58. When you only have a Dual core processor, using half of it when not necessary isn't an option. I had the same results in Windows 8 where 266.58 isn't an option. Running an AP with 331.82 on a Dual core Host makes it extremely annoying to use the Host. Definitely something to be avoided when possible.

Since I've been using 266.58 for over a year without this problem, I'm inclined to place the blame elsewhere.


Agreed. It's looking like Claggy's Boinc bug reports. [Edit:] As for AP, might want to enquire about the newer lower CPU usage builds. Not my department, but I understand they should be noticeably better on either the old or newer drivers.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1461380 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1461390 - Posted: 7 Jan 2014, 3:29:11 UTC - in response to Message 1461379.  

Here's the emerging pattern:

<core_client_version>7.2.33</core_client_version>

Actually, if you take a look at the additional example I added, I was still on <core_client_version>7.0.64</core_client_version>. In fact, I'd have to check, but I may be able to come up with examples under 7.0.64 going back as far as July or August, although they certainly seem to be getting more frequent lately.
ID: 1461390 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1461404 - Posted: 7 Jan 2014, 6:13:23 UTC

And another one, just today (first one on this machine since Dec. 20), where both wingmen got counts of 28,0,2,0,0:
Name	16oc13ab.5599.3748.438086664199.12.0_1
Workunit	1396580051
Created	6 Jan 2014, 12:32:13 UTC
Sent	6 Jan 2014, 14:50:55 UTC
Received	6 Jan 2014, 18:06:50 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	6980751
Report deadline	27 Jan 2014, 2:00:37 UTC
Run time	7.47
CPU time	1.88
Validate state	Invalid
Credit	0.00
Application version	SETI@home v7 v7.00 (cuda50)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 5 
     pciBusID = 24, pciSlotID = 0
  Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 5, pciSlotID = 0
  Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 69, pciSlotID = 0
  Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536
     computeCap 3.0, multiProcs 2 
     pciBusID = 88, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 660 is okay
SETI@home using CUDA accelerated device GeForce GTX 660
mbcuda.cfg, processpriority key detected
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to ABOVE_NORMAL successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 5.00

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  2.737595

</stderr_txt>
]]>
ID: 1461404 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1461594 - Posted: 7 Jan 2014, 23:30:31 UTC

I can't comment on the truncated stderr_txt problems, but the tasks with status 'validate error' seem to have been server problems (probably a bad volume mount between the validate server and the upload storage area). Tasks of mine which were showing 'validate error' before maintenance are now showing 'valid'.
ID: 1461594 · Report as offensive
Batter Up
Avatar

Send message
Joined: 5 May 99
Posts: 1946
Credit: 24,860,347
RAC: 0
United States
Message 1462023 - Posted: 9 Jan 2014, 5:59:50 UTC

Something is still not right.

I just got a bunch of time exceeded with a report date of tomorrow.

Task,3322128229
WU,1397035059
Sent 7 Jan 2014, 1:27:22 UTC
Due 9 Jan 2014, 2:36:52 UTC
Timed out - no response 0.00 0.00 --- SETI@home v7 v7.00
ID: 1462023 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1462026 - Posted: 9 Jan 2014, 6:07:09 UTC - in response to Message 1462023.  

Something is still not right.

I just got a bunch of time exceeded with a report date of tomorrow.

Task,3322128229
WU,1397035059
Sent 7 Jan 2014, 1:27:22 UTC
Due 9 Jan 2014, 2:36:52 UTC
Timed out - no response 0.00 0.00 --- SETI@home v7 v7.00

From what I can see they are all vlars.

Your PC would've made a request for CPU & GPU without getting work and then sent a request for just GPU work which results in this happening.

It's no fault at your end and as they are in red they are not held against you so you have nothing to worry about.

Cheers.
ID: 1462026 · Report as offensive
Batter Up
Avatar

Send message
Joined: 5 May 99
Posts: 1946
Credit: 24,860,347
RAC: 0
United States
Message 1462122 - Posted: 9 Jan 2014, 16:34:42 UTC - in response to Message 1462026.  

request for just GPU work which results in this happening.

It's no fault at your end and as they are in red they are not held against you so you have nothing to worry about.

Cheers.

With the goings on of late I'm not the one who should worry. Thank you for the replay.
ID: 1462122 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1464214 - Posted: 14 Jan 2014, 7:46:48 UTC - in response to Message 1461379.  
Last modified: 14 Jan 2014, 8:10:30 UTC

Here's the emerging pattern:

<core_client_version>7.2.33</core_client_version>

Well, there goes that theory. I just received the Same type of 'Invalid' on the Host I left at 7.2.28. This Host had a previous 'Consecutive valid tasks' number of around 7000 before this Strange task. Now it has to start over. This was another overflow exit, according to the Wingperson. Note the truncated Stderr output;

Computer ID: 6796475
Coprocessors: NVIDIA GeForce GTS 250 (1024MB) driver: 332.21
Operating System: Microsoft Windows 8.1 Professional with Media Center x86 Edition
Run time: 1,652.17
CPU time: 228.25
Validate: state Invalid

Stderr output

<core_client_version>7.2.28</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>

Task 	        Computer 	         Sent 	                     Time reported                         Status            	      Run time(sec)    CPU time Credit                  Application
3325713600 	5360046 	9 Jan 2014, 5:13:27 UTC 	14 Jan 2014, 6:30:29 UTC 	Completed, validation inconclusive 	2,402.84 	183.21 	pending 	SETI@home v7 v7.00 (cuda50)
3325713601 	6796475 	9 Jan 2014, 5:13:35 UTC 	9 Jan 2014, 14:11:41 UTC 	Completed, marked as invalid     	1,652.17 	228.25 	0.00 	        SETI@home v7 Anonymous platform (NVIDIA GPU)
3334501898 	---     	Unsent  	--- 


??
ID: 1464214 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1464218 - Posted: 14 Jan 2014, 8:00:44 UTC

I'd be looking at what other programs are running in the background that could cause this problem for those effected.

Cheers.
ID: 1464218 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1464220 - Posted: 14 Jan 2014, 8:14:10 UTC - in response to Message 1464218.  
Last modified: 14 Jan 2014, 8:23:44 UTC

You do realize this is a different Host from the OP...right? There was nothing going on with the Win8 machine at that time, it's not used until around 0930EST...except for SETI. Apparently the invalid didn't pop up until the Wingperson reported.
ID: 1464220 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1464223 - Posted: 14 Jan 2014, 8:21:15 UTC - in response to Message 1464220.  

You do realize this is a different Host from the OP...right? There is nothing going on with the Win8 machine at present, it's not being used...except for SETI.

I do and I'd suggest both of you to look at the possibility of what I posted.

Cheers.
ID: 1464223 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1464225 - Posted: 14 Jan 2014, 8:25:44 UTC - in response to Message 1464223.  

You do realize this is a different Host from the OP...right? There is nothing going on with the Win8 machine at present, it's not being used...except for SETI.

I do and I'd suggest both of you to look at the possibility of what I posted.

Cheers.

One Host was Windows XP, the Other Windows 8.1.
Do you really think the same very improbable Background App was running on Both?
Get Real...
ID: 1464225 · Report as offensive
1 · 2 · 3 · 4 . . . 14 · Next

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.