I need ideas for testing a GPU


log in

Advanced search

Message boards : Number crunching : I need ideas for testing a GPU

Author Message
Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 71,088,780
RAC: 82,226
Argentina
Message 1211703 - Posted: 29 Mar 2012, 23:01:04 UTC

In my main host one of the 2 560Ti GPUs is throwing invalid results... Its not something new but as in the next week Ill have some time to do some testings and I want to try to solve that...

If I test the GPU with standard tools (like furmark and alike) it dosnt fail so I cant send it to the technical service as I have nothing to show that its failing other than to say that some SETI results dont get validated, wich is hard to proove.
If someone knows some better utility to do the testing it may help...

But as im pretty sure that it will be difficult to get it changed until it start to fail more seriously and notoriously (which may "never" happen), I want to know if its worth to keep it crunching or not and that will depend on the ratio of invalid over total crunched. (if the ratio is low it will be worth to keep it running but if it is high it might be a waste of energy, bandwith and time...)

The point is that I cant think any way to get that ratio...I cant rely on any "log tool app" as I need to know how it validated so, someone has any idea on how I can get the data about how many results were valid and how many are invalid for a specific GPU on a specific host?

Thanks in advance for any idea.


To extend what is happening this is what I know/did/tested:

    I know for sure that all the invalid results come from this GPU, and I know that some results from this card are not invalid. Ussually the invalid tasks are reported as -9 error, but there are also successfully ended results that become invalids.

    I have a 1000W PSU and the measured consumption informed by the UPS is about 500W, so i dont think there is a power issue (and, if it were the PSU I should seen failures on both cards...)

    This GPU also throw invalids in einstein but it seems that whatever is happening affect much less those tasks, which is not a surprise as the seti optimized apps applied much more pressure on the GPUs.

    Both cards are equal brand (Zotac) and have equal settings. The one that dosnt throws invalids is close to the CPU so it ussually runs a bit hot than the one failing. The good one runs at around 66 ºC while the one "failing" runs at 63º

    They are not OCed, but Im using Afterbuner to rise the voltage cause at stock voltage they tend to downclock very often. Anyway sometimes (less than once a week) and on equal chances, one of them bite the dust and enters the failsafe donclocked mode.

    As I am using the public beta for Nvidia AP im using the 266.66 version of the drivers (it is the especific one for this boards and its suppossed that equals the 266.58). Ive read something over there about something wrong in this drivers, but even if thats true, I think it should affect both cards not just one. Im thinking about upgrading the driver as soon as I get rid of the last APs Ive got. Interestingly Ive not seen any invalid AP but I guess Its just that I have missed them...




____________

Tom
Send message
Joined: 12 Aug 11
Posts: 114
Credit: 4,566,097
RAC: 0
United States
Message 1211708 - Posted: 29 Mar 2012, 23:27:03 UTC

You didn't mention if you have tried swapping the card positions to eliminate
a motherboard or power cabling issue?

Bill

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 71,088,780
RAC: 82,226
Argentina
Message 1211712 - Posted: 29 Mar 2012, 23:39:57 UTC - in response to Message 1211708.
Last modified: 29 Mar 2012, 23:49:18 UTC

You didn't mention if you have tried swapping the card positions to eliminate
a motherboard or power cabling issue?

Bill


I didnt swapped them on the slots, but Ive changed the power cables with no changes in the behaviour.

I thought about it, but if I swap them, they will get renumbered so Ill need to figure out when the tasks were reported to know from what GPU they come... As I was not having time to test them and check often the results page it is something that got delayed...

Anyway, thanks to remind me about this option...

[EDIT:] One more thing, despite the result of swapping the cards and unless it stops failing Ill be in the same issue about meassuring the ratio of failures...
____________

Profile Lint trapProject donor
Send message
Joined: 30 May 03
Posts: 859
Credit: 26,138,167
RAC: 13,163
United States
Message 1211723 - Posted: 30 Mar 2012, 1:12:35 UTC
Last modified: 30 Mar 2012, 1:51:21 UTC

Another idea, run the suspect card solo for testing purposes. Put it in the primary slot and run everything through it you can.

I'm doing that right now with a used 460 (same make/model as my original) from eBay. Looks like the problem here is my ancient socket 775 mobo. The 2nd x16 physical slot runs at x4 1.1 speed. Both slots run at x4 1.1 when both cards are present, which (edited: could be) too slow for the 460's - ended up bogging the whole system down when trying to crunch. Solo the 'new' card is fine. As good as the original. So far.


Lt

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 71,088,780
RAC: 82,226
Argentina
Message 1211789 - Posted: 30 Mar 2012, 5:44:40 UTC - in response to Message 1211723.

Another idea, run the suspect card solo for testing purposes. Put it in the primary slot and run everything through it you can.

I'm doing that right now with a used 460 (same make/model as my original) from eBay. Looks like the problem here is my ancient socket 775 mobo. The 2nd x16 physical slot runs at x4 1.1 speed. Both slots run at x4 1.1 when both cards are present, which (edited: could be) too slow for the 460's - ended up bogging the whole system down when trying to crunch. Solo the 'new' card is fine. As good as the original. So far.


Lt


Thanks to take time to try help me.
That's another good way to test the card itself. But if the card still produce some invalid results Ill be in the same issue: how bad is keeping it running?

Doing wild assumptions, on average I see that the number of invalids for this host are between 5 and 10% of the valid ones, as I have 2 GPUS and Im not crunching on CPU, and guessing that the already validated tasks (i.e. already compared to their sisters) are in normal distribution about the GPUs and assuming that efectively all the invalid ones come from this suspected GPU then I should think that this card is ruining around 15% of the tasks, I can live with that as Ill be doing a lot more contribution with the other 85%...

But all that thiking is very flawed, among all the wild assumptions the main reason of doubt is that most of the invalid tasks are first marked as inconclusive, then they are resent and some time later they get marked as invalid, so the valid tasks might appear faster than the invalids and may be the ratio is very far from that 15%.

Ive been thinking about this and it seems that my only solution is to bring up my own project on a "personal SETI server" and using one of the other hosts as "wingmen", I know its possible as the BOINC plataform is there available to do that for private distributed computation nets (and is already packed in a VM ready to be launched), also getting WUs is as easy as copying the ones sent by the real SETI server from all of my hosts to my server but also, Ill need to write a validator app to compare the results... I feel that the whole process will need much more time and knowledge that Ill never have...

Another thing might be writting a software bot to dive 2 or tree times a day on the results page getting all the data... I guess that it will be a complex task to writte the parser to extract the usefull data from the html code of the page and then to get rid of the already "readed" results... At least sounds a bit less difficult than the idea of my own project...

Ok... Sorry for thinking outloud... :O
And again thanks for helping me, Ill allways apreciate people that gives their time trying to help others that dont even know, no matter if the answer can be used or not.

____________

Profile Lint trapProject donor
Send message
Joined: 30 May 03
Posts: 859
Credit: 26,138,167
RAC: 13,163
United States
Message 1211888 - Posted: 30 Mar 2012, 12:46:50 UTC - in response to Message 1211789.


OK, more input is good. Even thinking out loud is good sometimes...:)


But all that thiking is very flawed, among all the wild assumptions the main reason of doubt is that most of the invalid tasks are first marked as inconclusive, then they are resent and some time later they get marked as invalid, so the valid tasks might appear faster than the invalids and may be the ratio is very far from that 15%.


I had inconclusives too when running the Beta AP app, some of which became Invalid. But, I never suspected my card was at fault because valid work was being done in between the occasional inconclusive, and the card showed no signs of any real problems otherwise.


Lt

Message boards : Number crunching : I need ideas for testing a GPU

Copyright © 2014 University of California