Strange Invalid MB Overflow tasks with truncated Stderr outputs...


log in

Advanced search

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...

1 · 2 · 3 · 4 . . . 15 · Next
Author Message
TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1177
Credit: 41,868,231
RAC: 114,400
United States
Message 1461332 - Posted: 6 Jan 2014, 23:46:24 UTC
Last modified: 6 Jan 2014, 23:55:23 UTC

Seems I've received another one. The last one was a week or two ago. As I remember, it was the same. The Stderr output just stops, and it receives an immediate Invalid. Since it's so short, nothing is really lost. It's just puzzling as to what actually happened since other overflows complete normally, as the overflow immediately preceding the one that failed.

Work Unit Info:
...............
WU true angle range is : 2.684834
re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes

</stderr_txt>
]]>

Run time 12.45
CPU time 11.67
Validate state Invalid
Credit 0.00
Application version SETI@home v7 Anonymous platform (NVIDIA GPU)

Batter Up
Avatar
Send message
Joined: 5 May 99
Posts: 1839
Credit: 24,858,559
RAC: 11
United States
Message 1461355 - Posted: 7 Jan 2014, 1:56:49 UTC - in response to Message 1461332.

I just got this.

Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>

Run time 20.13
CPU time 2.38
Validate state Invalid
Credit 0.00
Application version SETI@home v7 v7.00 (cuda50)

http://setiathome.berkeley.edu/result.php?resultid=3321928417
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4920
Credit: 72,650,779
RAC: 4,214
Australia
Message 1461356 - Posted: 7 Jan 2014, 1:58:17 UTC - in response to Message 1461332.

Looking at the similar wingmen (apps, gpu generation etc) processing fine, seems to point definitely toward something specific to the system. As it's been a while I don't recall what was tried so far. On the off chance there is some resolved issue specific to that GPU, and you're using a new Boinc revision, is there any particular reason for not updating the Driver ? There can be funky interactions with the way newer Boinc kills apps under some conditions, especially if the driver takes it's time cleaning up.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4920
Credit: 72,650,779
RAC: 4,214
Australia
Message 1461357 - Posted: 7 Jan 2014, 1:59:50 UTC - in response to Message 1461355.

I just got this.


That one looks like a Boinc bug Claggy was telling me he reported recently ... could be the same thing.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1177
Credit: 41,868,231
RAC: 114,400
United States
Message 1461367 - Posted: 7 Jan 2014, 2:20:11 UTC - in response to Message 1461356.

Looking at the similar wingmen (apps, gpu generation etc) processing fine, seems to point definitely toward something specific to the system. As it's been a while I don't recall what was tried so far. On the off chance there is some resolved issue specific to that GPU, and you're using a new Boinc revision, is there any particular reason for not updating the Driver ? There can be funky interactions with the way newer Boinc kills apps under some conditions, especially if the driver takes it's time cleaning up.

I tried 331.82 on my XP Dual core Host and if failed to produce any better CUDA runtimes than 266.58. What 331.82 did accomplish was to make the Host completely unusable when running an AstroPulse on the 8800 whereas there isn't that much of a problem when running an AP with 266.58. When you only have a Dual core processor, using half of it when not necessary isn't an option. I had the same results in Windows 8 where 266.58 isn't an option. Running an AP with 331.82 on a Dual core Host makes it extremely annoying to use the Host. Definitely something to be avoided when possible.

Since I've been using 266.58 for over a year without this problem, I'm inclined to place the blame elsewhere.

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 258
Credit: 29,028,484
RAC: 82,817
United States
Message 1461375 - Posted: 7 Jan 2014, 2:40:41 UTC

I seem to get about one of these just about every week or two, on different machines. They're always WUs where the wingmen get -9 overflows where the Pulse count is less than 30, but one or more of the other counts brings the total up to 30. Most only take a few seconds to overflow, but some take several minutes. Here's one from last Friday, where the wingmen's counts were 29,0,0,1,0:

Name 12mr13af.14976.20108.438086664199.12.0_1
Workunit 1393362608
Created 3 Jan 2014, 2:37:28 UTC
Sent 3 Jan 2014, 3:08:56 UTC
Received 3 Jan 2014, 19:19:05 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 6979886
Report deadline 23 Jan 2014, 14:18:38 UTC
Run time 5.23
CPU time 1.40
Validate state Invalid
Credit 0.00
Application version SETI@home v7
Anonymous platform (NVIDIA GPU)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>

Here's one from a couple weeks ago on a different machine, where the wingmen's counts were 29,0,0,0,1:
Name 09se09af.21444.23789.438086664205.12.22_1
Workunit 1382306633
Created 19 Dec 2013, 10:42:47 UTC
Sent 19 Dec 2013, 16:19:23 UTC
Received 20 Dec 2013, 7:45:44 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 7057115
Report deadline 10 Feb 2014, 5:46:48 UTC
Run time 1,266.05
CPU time 197.13
Validate state Initial
Credit 0.00
Application version SETI@home v7 v7.00 (cuda50)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>

And from yet another machine, about the same time, where the wingmens' counts were 28,0,0,2,0:
Name 02dc13ae.8857.7429.438086664203.12.247_1
Workunit 1382671822
Created 20 Dec 2013, 0:06:32 UTC
Sent 20 Dec 2013, 6:00:57 UTC
Received 20 Dec 2013, 14:33:36 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 6980751
Report deadline 11 Feb 2014, 4:38:27 UTC
Run time 2,476.34
CPU time 114.64
Validate state Initial
Credit 0.00
Application version SETI@home v7 v7.00 (cuda42)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 5
pciBusID = 24, pciSlotID = 0
Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 5, pciSlotID = 0
Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 69, pciSlotID = 0
Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 88, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 660 is okay
SETI@home using CUDA accelerated device GeForce GTX 660
mbcuda.cfg, processpriority key detected
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to ABOVE_NORMAL successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 4.20

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 0.434356

Kepler GPU current clockRate = 1162 MHz

re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k

</stderr_txt>
]]>

As you can see, sometimes the STDERR is almost completely empty, and other times it shows all the way to that "Thread call stack limit" line. I haven't been able to identify any consistency between the two types, but the end result for both is always an invalid, although sometimes its an "immediate" Invalid (as with my first example) and sometimes it doesn't get flagged as Invalid until the first wingman reports (as with the 2nd and 3rd examples).

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 258
Credit: 29,028,484
RAC: 82,817
United States
Message 1461378 - Posted: 7 Jan 2014, 2:49:25 UTC

While I'm at it, here's one more, where one wingman got counts of 2,28,0,0,0 and two others got counts of 18,0,0,12,0 to earn the validation:

Name 01dc13ac.14707.15609.438086664195.12.254_1
Workunit 1378744616
Created 14 Dec 2013, 19:23:30 UTC
Sent 14 Dec 2013, 23:29:56 UTC
Received 15 Dec 2013, 4:43:17 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 6980751
Report deadline 6 Feb 2014, 6:40:09 UTC
Run time 873.69
CPU time 116.70
Validate state Invalid
Credit 0.00
Application version SETI@home v7 v7.00 (cuda42)
Stderr output

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 5
pciBusID = 24, pciSlotID = 0
Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 5, pciSlotID = 0
Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 69, pciSlotID = 0
Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 88, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 660 is okay
SETI@home using CUDA accelerated device GeForce GTX 660
mbcuda.cfg, processpriority key detected
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to ABOVE_NORMAL successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 4.20

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 0.426631

Kepler GPU current clockRate = 1162 MHz

re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k

</stderr_txt>
]]>

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4920
Credit: 72,650,779
RAC: 4,214
Australia
Message 1461379 - Posted: 7 Jan 2014, 2:50:42 UTC

Here's the emerging pattern:

<core_client_version>7.2.33</core_client_version>

____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4920
Credit: 72,650,779
RAC: 4,214
Australia
Message 1461380 - Posted: 7 Jan 2014, 2:53:45 UTC - in response to Message 1461367.
Last modified: 7 Jan 2014, 2:55:13 UTC

Looking at the similar wingmen (apps, gpu generation etc) processing fine, seems to point definitely toward something specific to the system. As it's been a while I don't recall what was tried so far. On the off chance there is some resolved issue specific to that GPU, and you're using a new Boinc revision, is there any particular reason for not updating the Driver ? There can be funky interactions with the way newer Boinc kills apps under some conditions, especially if the driver takes it's time cleaning up.

I tried 331.82 on my XP Dual core Host and if failed to produce any better CUDA runtimes than 266.58. What 331.82 did accomplish was to make the Host completely unusable when running an AstroPulse on the 8800 whereas there isn't that much of a problem when running an AP with 266.58. When you only have a Dual core processor, using half of it when not necessary isn't an option. I had the same results in Windows 8 where 266.58 isn't an option. Running an AP with 331.82 on a Dual core Host makes it extremely annoying to use the Host. Definitely something to be avoided when possible.

Since I've been using 266.58 for over a year without this problem, I'm inclined to place the blame elsewhere.


Agreed. It's looking like Claggy's Boinc bug reports. [Edit:] As for AP, might want to enquire about the newer lower CPU usage builds. Not my department, but I understand they should be noticeably better on either the old or newer drivers.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 258
Credit: 29,028,484
RAC: 82,817
United States
Message 1461390 - Posted: 7 Jan 2014, 3:29:11 UTC - in response to Message 1461379.

Here's the emerging pattern:

<core_client_version>7.2.33</core_client_version>

Actually, if you take a look at the additional example I added, I was still on <core_client_version>7.0.64</core_client_version>. In fact, I'd have to check, but I may be able to come up with examples under 7.0.64 going back as far as July or August, although they certainly seem to be getting more frequent lately.

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 258
Credit: 29,028,484
RAC: 82,817
United States
Message 1461404 - Posted: 7 Jan 2014, 6:13:23 UTC

And another one, just today (first one on this machine since Dec. 20), where both wingmen got counts of 28,0,2,0,0:

Name 16oc13ab.5599.3748.438086664199.12.0_1
Workunit 1396580051
Created 6 Jan 2014, 12:32:13 UTC
Sent 6 Jan 2014, 14:50:55 UTC
Received 6 Jan 2014, 18:06:50 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 6980751
Report deadline 27 Jan 2014, 2:00:37 UTC
Run time 7.47
CPU time 1.88
Validate state Invalid
Credit 0.00
Application version SETI@home v7 v7.00 (cuda50)
Stderr output

<core_client_version>7.2.33</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 5
pciBusID = 24, pciSlotID = 0
Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 5, pciSlotID = 0
Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 69, pciSlotID = 0
Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 88, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 660 is okay
SETI@home using CUDA accelerated device GeForce GTX 660
mbcuda.cfg, processpriority key detected
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to ABOVE_NORMAL successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 5.00

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 2.737595

</stderr_txt>
]]>

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8377
Credit: 46,811,102
RAC: 24,140
United Kingdom
Message 1461594 - Posted: 7 Jan 2014, 23:30:31 UTC

I can't comment on the truncated stderr_txt problems, but the tasks with status 'validate error' seem to have been server problems (probably a bad volume mount between the validate server and the upload storage area). Tasks of mine which were showing 'validate error' before maintenance are now showing 'valid'.

Batter Up
Avatar
Send message
Joined: 5 May 99
Posts: 1839
Credit: 24,858,559
RAC: 11
United States
Message 1462023 - Posted: 9 Jan 2014, 5:59:50 UTC

Something is still not right.

I just got a bunch of time exceeded with a report date of tomorrow.

Task,3322128229
WU,1397035059
Sent 7 Jan 2014, 1:27:22 UTC
Due 9 Jan 2014, 2:36:52 UTC
Timed out - no response 0.00 0.00 --- SETI@home v7 v7.00
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6540
Credit: 90,723,716
RAC: 75,119
Australia
Message 1462026 - Posted: 9 Jan 2014, 6:07:09 UTC - in response to Message 1462023.

Something is still not right.

I just got a bunch of time exceeded with a report date of tomorrow.

Task,3322128229
WU,1397035059
Sent 7 Jan 2014, 1:27:22 UTC
Due 9 Jan 2014, 2:36:52 UTC
Timed out - no response 0.00 0.00 --- SETI@home v7 v7.00

From what I can see they are all vlars.

Your PC would've made a request for CPU & GPU without getting work and then sent a request for just GPU work which results in this happening.

It's no fault at your end and as they are in red they are not held against you so you have nothing to worry about.

Cheers.

Batter Up
Avatar
Send message
Joined: 5 May 99
Posts: 1839
Credit: 24,858,559
RAC: 11
United States
Message 1462122 - Posted: 9 Jan 2014, 16:34:42 UTC - in response to Message 1462026.

request for just GPU work which results in this happening.

It's no fault at your end and as they are in red they are not held against you so you have nothing to worry about.

Cheers.

With the goings on of late I'm not the one who should worry. Thank you for the replay.
____________

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1177
Credit: 41,868,231
RAC: 114,400
United States
Message 1464214 - Posted: 14 Jan 2014, 7:46:48 UTC - in response to Message 1461379.
Last modified: 14 Jan 2014, 8:10:30 UTC

Here's the emerging pattern:

<core_client_version>7.2.33</core_client_version>

Well, there goes that theory. I just received the Same type of 'Invalid' on the Host I left at 7.2.28. This Host had a previous 'Consecutive valid tasks' number of around 7000 before this Strange task. Now it has to start over. This was another overflow exit, according to the Wingperson. Note the truncated Stderr output;

Computer ID: 6796475
Coprocessors: NVIDIA GeForce GTS 250 (1024MB) driver: 332.21
Operating System: Microsoft Windows 8.1 Professional with Media Center x86 Edition
Run time: 1,652.17
CPU time: 228.25
Validate: state Invalid

Stderr output

<core_client_version>7.2.28</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>

Task Computer Sent Time reported Status Run time(sec) CPU time Credit Application 3325713600 5360046 9 Jan 2014, 5:13:27 UTC 14 Jan 2014, 6:30:29 UTC Completed, validation inconclusive 2,402.84 183.21 pending SETI@home v7 v7.00 (cuda50) 3325713601 6796475 9 Jan 2014, 5:13:35 UTC 9 Jan 2014, 14:11:41 UTC Completed, marked as invalid 1,652.17 228.25 0.00 SETI@home v7 Anonymous platform (NVIDIA GPU) 3334501898 --- Unsent ---


??

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6540
Credit: 90,723,716
RAC: 75,119
Australia
Message 1464218 - Posted: 14 Jan 2014, 8:00:44 UTC

I'd be looking at what other programs are running in the background that could cause this problem for those effected.

Cheers.

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1177
Credit: 41,868,231
RAC: 114,400
United States
Message 1464220 - Posted: 14 Jan 2014, 8:14:10 UTC - in response to Message 1464218.
Last modified: 14 Jan 2014, 8:23:44 UTC

You do realize this is a different Host from the OP...right? There was nothing going on with the Win8 machine at that time, it's not used until around 0930EST...except for SETI. Apparently the invalid didn't pop up until the Wingperson reported.

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6540
Credit: 90,723,716
RAC: 75,119
Australia
Message 1464223 - Posted: 14 Jan 2014, 8:21:15 UTC - in response to Message 1464220.

You do realize this is a different Host from the OP...right? There is nothing going on with the Win8 machine at present, it's not being used...except for SETI.

I do and I'd suggest both of you to look at the possibility of what I posted.

Cheers.

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1177
Credit: 41,868,231
RAC: 114,400
United States
Message 1464225 - Posted: 14 Jan 2014, 8:25:44 UTC - in response to Message 1464223.

You do realize this is a different Host from the OP...right? There is nothing going on with the Win8 machine at present, it's not being used...except for SETI.

I do and I'd suggest both of you to look at the possibility of what I posted.

Cheers.

One Host was Windows XP, the Other Windows 8.1.
Do you really think the same very improbable Background App was running on Both?
Get Real...

1 · 2 · 3 · 4 . . . 15 · Next

Message boards : Number crunching : Strange Invalid MB Overflow tasks with truncated Stderr outputs...

Copyright © 2014 University of California