Message boards :
Number crunching :
Strange Invalid MB Overflow tasks with truncated Stderr outputs...
Message board moderation
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Seems I've received another one. The last one was a week or two ago. As I remember, it was the same. The Stderr output just stops, and it receives an immediate Invalid. Since it's so short, nothing is really lost. It's just puzzling as to what actually happened since other overflows complete normally, as the overflow immediately preceding the one that failed. Work Unit Info: ............... WU true angle range is : 2.684834 re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes </stderr_txt> ]]> Run time 12.45 CPU time 11.67 Validate state Invalid Credit 0.00 Application version SETI@home v7 Anonymous platform (NVIDIA GPU) |
Batter Up Send message Joined: 5 May 99 Posts: 1946 Credit: 24,860,347 RAC: 0 |
I just got this. Stderr output <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> </stderr_txt> ]]> Run time 20.13 CPU time 2.38 Validate state Invalid Credit 0.00 Application version SETI@home v7 v7.00 (cuda50) http://setiathome.berkeley.edu/result.php?resultid=3321928417 |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Looking at the similar wingmen (apps, gpu generation etc) processing fine, seems to point definitely toward something specific to the system. As it's been a while I don't recall what was tried so far. On the off chance there is some resolved issue specific to that GPU, and you're using a new Boinc revision, is there any particular reason for not updating the Driver ? There can be funky interactions with the way newer Boinc kills apps under some conditions, especially if the driver takes it's time cleaning up. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I just got this. That one looks like a Boinc bug Claggy was telling me he reported recently ... could be the same thing. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Looking at the similar wingmen (apps, gpu generation etc) processing fine, seems to point definitely toward something specific to the system. As it's been a while I don't recall what was tried so far. On the off chance there is some resolved issue specific to that GPU, and you're using a new Boinc revision, is there any particular reason for not updating the Driver ? There can be funky interactions with the way newer Boinc kills apps under some conditions, especially if the driver takes it's time cleaning up. I tried 331.82 on my XP Dual core Host and if failed to produce any better CUDA runtimes than 266.58. What 331.82 did accomplish was to make the Host completely unusable when running an AstroPulse on the 8800 whereas there isn't that much of a problem when running an AP with 266.58. When you only have a Dual core processor, using half of it when not necessary isn't an option. I had the same results in Windows 8 where 266.58 isn't an option. Running an AP with 331.82 on a Dual core Host makes it extremely annoying to use the Host. Definitely something to be avoided when possible. Since I've been using 266.58 for over a year without this problem, I'm inclined to place the blame elsewhere. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I seem to get about one of these just about every week or two, on different machines. They're always WUs where the wingmen get -9 overflows where the Pulse count is less than 30, but one or more of the other counts brings the total up to 30. Most only take a few seconds to overflow, but some take several minutes. Here's one from last Friday, where the wingmen's counts were 29,0,0,1,0: Name 12mr13af.14976.20108.438086664199.12.0_1 Workunit 1393362608 Created 3 Jan 2014, 2:37:28 UTC Sent 3 Jan 2014, 3:08:56 UTC Received 3 Jan 2014, 19:19:05 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 6979886 Report deadline 23 Jan 2014, 14:18:38 UTC Run time 5.23 CPU time 1.40 Validate state Invalid Credit 0.00 Application version SETI@home v7 Anonymous platform (NVIDIA GPU) Stderr output <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> </stderr_txt> Here's one from a couple weeks ago on a different machine, where the wingmen's counts were 29,0,0,0,1: Name 09se09af.21444.23789.438086664205.12.22_1 Workunit 1382306633 Created 19 Dec 2013, 10:42:47 UTC Sent 19 Dec 2013, 16:19:23 UTC Received 20 Dec 2013, 7:45:44 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 7057115 Report deadline 10 Feb 2014, 5:46:48 UTC Run time 1,266.05 CPU time 197.13 Validate state Initial Credit 0.00 Application version SETI@home v7 v7.00 (cuda50) Stderr output <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> </stderr_txt> ]]> And from yet another machine, about the same time, where the wingmens' counts were 28,0,0,2,0: Name 02dc13ae.8857.7429.438086664203.12.247_1 Workunit 1382671822 Created 20 Dec 2013, 0:06:32 UTC Sent 20 Dec 2013, 6:00:57 UTC Received 20 Dec 2013, 14:33:36 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 6980751 Report deadline 11 Feb 2014, 4:38:27 UTC Run time 2,476.34 CPU time 114.64 Validate state Initial Credit 0.00 Application version SETI@home v7 v7.00 (cuda42) Stderr output <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 5 pciBusID = 24, pciSlotID = 0 Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 5, pciSlotID = 0 Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 69, pciSlotID = 0 Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 88, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 660 is okay SETI@home using CUDA accelerated device GeForce GTX 660 mbcuda.cfg, processpriority key detected pulsefind: blocks per SM 4 (Fermi or newer default) pulsefind: periods per launch 100 (default) Priority of process set to ABOVE_NORMAL successfully Priority of worker thread set successfully setiathome enhanced x41zc, Cuda 4.20 Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.434356 Kepler GPU current clockRate = 1162 MHz re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes Thread call stack limit is: 1k </stderr_txt> ]]> As you can see, sometimes the STDERR is almost completely empty, and other times it shows all the way to that "Thread call stack limit" line. I haven't been able to identify any consistency between the two types, but the end result for both is always an invalid, although sometimes its an "immediate" Invalid (as with my first example) and sometimes it doesn't get flagged as Invalid until the first wingman reports (as with the 2nd and 3rd examples). |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
While I'm at it, here's one more, where one wingman got counts of 2,28,0,0,0 and two others got counts of 18,0,0,12,0 to earn the validation: Name 01dc13ac.14707.15609.438086664195.12.254_1 Workunit 1378744616 Created 14 Dec 2013, 19:23:30 UTC Sent 14 Dec 2013, 23:29:56 UTC Received 15 Dec 2013, 4:43:17 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 6980751 Report deadline 6 Feb 2014, 6:40:09 UTC Run time 873.69 CPU time 116.70 Validate state Invalid Credit 0.00 Application version SETI@home v7 v7.00 (cuda42) Stderr output <core_client_version>7.0.64</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 5 pciBusID = 24, pciSlotID = 0 Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 5, pciSlotID = 0 Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 69, pciSlotID = 0 Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 88, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 660 is okay SETI@home using CUDA accelerated device GeForce GTX 660 mbcuda.cfg, processpriority key detected pulsefind: blocks per SM 4 (Fermi or newer default) pulsefind: periods per launch 100 (default) Priority of process set to ABOVE_NORMAL successfully Priority of worker thread set successfully setiathome enhanced x41zc, Cuda 4.20 Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.426631 Kepler GPU current clockRate = 1162 MHz re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes Thread call stack limit is: 1k </stderr_txt> ]]> |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Here's the emerging pattern: <core_client_version>7.2.33</core_client_version> "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Looking at the similar wingmen (apps, gpu generation etc) processing fine, seems to point definitely toward something specific to the system. As it's been a while I don't recall what was tried so far. On the off chance there is some resolved issue specific to that GPU, and you're using a new Boinc revision, is there any particular reason for not updating the Driver ? There can be funky interactions with the way newer Boinc kills apps under some conditions, especially if the driver takes it's time cleaning up. Agreed. It's looking like Claggy's Boinc bug reports. [Edit:] As for AP, might want to enquire about the newer lower CPU usage builds. Not my department, but I understand they should be noticeably better on either the old or newer drivers. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Here's the emerging pattern: Actually, if you take a look at the additional example I added, I was still on <core_client_version>7.0.64</core_client_version>. In fact, I'd have to check, but I may be able to come up with examples under 7.0.64 going back as far as July or August, although they certainly seem to be getting more frequent lately. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
And another one, just today (first one on this machine since Dec. 20), where both wingmen got counts of 28,0,2,0,0: Name 16oc13ab.5599.3748.438086664199.12.0_1 Workunit 1396580051 Created 6 Jan 2014, 12:32:13 UTC Sent 6 Jan 2014, 14:50:55 UTC Received 6 Jan 2014, 18:06:50 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 6980751 Report deadline 27 Jan 2014, 2:00:37 UTC Run time 7.47 CPU time 1.88 Validate state Invalid Credit 0.00 Application version SETI@home v7 v7.00 (cuda50) Stderr output <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 660, 2047 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 5 pciBusID = 24, pciSlotID = 0 Device 2: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 5, pciSlotID = 0 Device 3: GeForce GT 640, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 69, pciSlotID = 0 Device 4: GeForce GTX 650, 1023 MiB, regsPerBlock 65536 computeCap 3.0, multiProcs 2 pciBusID = 88, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 660 is okay SETI@home using CUDA accelerated device GeForce GTX 660 mbcuda.cfg, processpriority key detected pulsefind: blocks per SM 4 (Fermi or newer default) pulsefind: periods per launch 100 (default) Priority of process set to ABOVE_NORMAL successfully Priority of worker thread set successfully setiathome enhanced x41zc, Cuda 5.00 Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 2.737595 </stderr_txt> ]]> |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
I can't comment on the truncated stderr_txt problems, but the tasks with status 'validate error' seem to have been server problems (probably a bad volume mount between the validate server and the upload storage area). Tasks of mine which were showing 'validate error' before maintenance are now showing 'valid'. |
Batter Up Send message Joined: 5 May 99 Posts: 1946 Credit: 24,860,347 RAC: 0 |
Something is still not right. I just got a bunch of time exceeded with a report date of tomorrow. Task,3322128229 WU,1397035059 Sent 7 Jan 2014, 1:27:22 UTC Due 9 Jan 2014, 2:36:52 UTC Timed out - no response 0.00 0.00 --- SETI@home v7 v7.00 |
Wiggo Send message Joined: 24 Jan 00 Posts: 34831 Credit: 261,360,520 RAC: 489 |
Something is still not right. From what I can see they are all vlars. Your PC would've made a request for CPU & GPU without getting work and then sent a request for just GPU work which results in this happening. It's no fault at your end and as they are in red they are not held against you so you have nothing to worry about. Cheers. |
Batter Up Send message Joined: 5 May 99 Posts: 1946 Credit: 24,860,347 RAC: 0 |
request for just GPU work which results in this happening. With the goings on of late I'm not the one who should worry. Thank you for the replay. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Here's the emerging pattern: Well, there goes that theory. I just received the Same type of 'Invalid' on the Host I left at 7.2.28. This Host had a previous 'Consecutive valid tasks' number of around 7000 before this Strange task. Now it has to start over. This was another overflow exit, according to the Wingperson. Note the truncated Stderr output; Computer ID: 6796475 Coprocessors: NVIDIA GeForce GTS 250 (1024MB) driver: 332.21 Operating System: Microsoft Windows 8.1 Professional with Media Center x86 Edition Run time: 1,652.17 CPU time: 228.25 Validate: state Invalid Stderr output <core_client_version>7.2.28</core_client_version> <![CDATA[ <stderr_txt> </stderr_txt> ]]> Task Computer Sent Time reported Status Run time(sec) CPU time Credit Application 3325713600 5360046 9 Jan 2014, 5:13:27 UTC 14 Jan 2014, 6:30:29 UTC Completed, validation inconclusive 2,402.84 183.21 pending SETI@home v7 v7.00 (cuda50) 3325713601 6796475 9 Jan 2014, 5:13:35 UTC 9 Jan 2014, 14:11:41 UTC Completed, marked as invalid 1,652.17 228.25 0.00 SETI@home v7 Anonymous platform (NVIDIA GPU) 3334501898 --- Unsent --- ?? |
Wiggo Send message Joined: 24 Jan 00 Posts: 34831 Credit: 261,360,520 RAC: 489 |
I'd be looking at what other programs are running in the background that could cause this problem for those effected. Cheers. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
You do realize this is a different Host from the OP...right? There was nothing going on with the Win8 machine at that time, it's not used until around 0930EST...except for SETI. Apparently the invalid didn't pop up until the Wingperson reported. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34831 Credit: 261,360,520 RAC: 489 |
You do realize this is a different Host from the OP...right? There is nothing going on with the Win8 machine at present, it's not being used...except for SETI. I do and I'd suggest both of you to look at the possibility of what I posted. Cheers. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
You do realize this is a different Host from the OP...right? There is nothing going on with the Win8 machine at present, it's not being used...except for SETI. One Host was Windows XP, the Other Windows 8.1. Do you really think the same very improbable Background App was running on Both? Get Real... |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.