Linux Fedora 17 CUDA pain. Worked on Fedora 16


log in

Advanced search

Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16

Author Message
Profile Andy Lee Robinson
Avatar
Send message
Joined: 8 Dec 05
Posts: 615
Credit: 40,819,319
RAC: 43,330
Hungary
Message 1302773 - Posted: 6 Nov 2012, 9:01:04 UTC

I'm supposed to be a professional linux guru, but this is driving me mental!

A few months ago I managed to get lunatics linux+cuda app running on fc16, and since upgrading to fc17 and boinc-client-7.0.29-1.r25790svn.fc17.x86_64 GPU tasks are aborting with this error about 90% of the time.

Cuda error '(cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost))' in file 'cuda/cudaAcc_summax.cu' in line 239 : unspecified launch failure.

app_info.xml is correct as far as I know.

ldd setiathome_x41g_x86_64-pc-linux-gnu_cuda32 gives this:
linux-vdso.so.1 => (0x00007fff7dfff000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003506e00000)
libcudart.so.3 (0x00007faf91bd6000)
libcufft.so.3 (0x00007faf8fe20000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000000350d600000)
libm.so.6 => /lib64/libm.so.6 (0x0000003507200000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003508e00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003506a00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003506600000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003507600000)
librt.so.1 => /lib64/librt.so.1 (0x0000003507a00000)

libs are in /var/lib/boinc/projects/setiathome.berkeley.edu
-rwxr-xr-x 1 boinc boinc 313872 Dec 2 2011 libcudart.so.3
-rwxr-xr-x 1 boinc boinc 28317K Dec 2 2011 libcufft.so.3

/dev contains these, launched by nvidia-smi -pm 1 in rc.local
crw-rw-rw- 1 root root 195, 0 Nov 5 09:42 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 5 09:42 nvidiactl

/etc/ld.so.conf points to where cuda is installed:
/usr/local/cuda/lib
/usr/local/cuda/lib64
and
/usr/lib64/nvidia
/usr/lib/nvidia

echo $PATH
/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/root/bin:/usr/local/cuda/bin

In directory /var/lib/boinc/projects/setiathome.berkeley.edu
I copied a failed WU to work_unit.sah and ran
./setiathome_x41g_x86_64-pc-linux-gnu_cuda32 -standalone
and it completed OK.
I'm at a loss to explain why it won't work reliably under boinc-client :(

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4963
Credit: 73,070,273
RAC: 14,828
Australia
Message 1302800 - Posted: 6 Nov 2012, 11:05:02 UTC
Last modified: 6 Nov 2012, 11:08:20 UTC

The crux of the issues relate to the continually evolving specifications on which the hardware/firmware & software infrastructure are based on. Whether on not you subscribe to Microsoft 'wisdom', the WDDM (Windows Device Driver Model, introduced from Vista onwards) was designed to replace the more traditional hardware based XP driver model (XPDM, ala directX - XBox ) with a fully virtualised environment to last some 10-15 years into the future.

That involves significant architectural difficulties cross generation at all levels of abstraction, not least of those involving synchronisation and security concepts familiar to multithreading/OS implementations beforehand, but otherwise alien to 'coprocessors'. I naively suspect the latter [Linux] kernel may have introduced multithreaded display drivers, so inducing the same set of issues.


Some recent Windows updates related to Texture/Font cache functionality. It wouldn't surprise me at all if recent Linux Kernel/driver updates mirror the 'odd behaviour' to at least some extent, and that Aaron, Martin myself & maybe fedora/redhat/centos need to slap each other around a bit to figure out what'll work there as an intermediate fix... until technologies stabilise a bit at least.

Don't feel too bad. Getting the messages about the rapid pace of development through to Windows developers is at least as difficult, as evidenced by game developers similarly struggling to keep up. Somebpdy needs to strap devs down to listen, and Engineers to reign in progress enough for us Luddites to keep up.

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Tron
Send message
Joined: 16 Aug 09
Posts: 180
Credit: 2,236,055
RAC: 0
United States
Message 1302838 - Posted: 6 Nov 2012, 14:24:23 UTC

I have had the same problem with ubuntu 12.04 , I bounced around video drivers till it worked right. I now only see that error about one in 50 tasks. I suppose thats an acceptable rate.

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8376
Credit: 4,104,598
RAC: 1,048
United Kingdom
Message 1302926 - Posted: 7 Nov 2012, 0:31:22 UTC - in response to Message 1302800.

Mmmmm... I'm not that on the bleeding edge of distros at the moment so not noticed...

Anyone else confirm? Or worth me taking a look on a test machine?...


Happy fast crunchin',
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4963
Credit: 73,070,273
RAC: 14,828
Australia
Message 1303012 - Posted: 7 Nov 2012, 7:09:54 UTC - in response to Message 1302926.

... Or worth me taking a look on a test machine?...


Always worth reproducing & characterising the issues where practical, particularly isolating the particular technology change to OS/Driver/library specific areas. Notifying the right parties with as much info as possible seems to work for me.

Would you have a relatively recent Centos install Martin ? being further down the development chain there might be some insights there as well ?

____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Andy Lee Robinson
Avatar
Send message
Joined: 8 Dec 05
Posts: 615
Credit: 40,819,319
RAC: 43,330
Hungary
Message 1308116 - Posted: 20 Nov 2012, 15:37:34 UTC - in response to Message 1303012.
Last modified: 20 Nov 2012, 15:52:33 UTC

Jason, I've identified why it happens! Just don't know how to solve it yet though.

I get a ton of spam - many thousands per day. DKIM, virus and spamassassin scanning takes up a lot of webserver CPU.
Given that most of the spam comes from countries that none of my clients do business with - VN, CN, UA, TH, ID, BR etc, I did something clever on the webserver to ease the load.

I set up iptables rules to preroute the addresses from just those countries and send them over openvpn to here, then to the i7 with the nvidia 550 card to process the spam and automatically report to spamcop etc.

I also wrote log scanners to ban ip addresses of antisocial machines, port scanners, ssh/ftp attacks, phpmyadmin, http proxy probes etc...

Not only do they defend servers from attack, the bad ip addresses are also distributed over mysql replication to all other servers and added to their iptables too. (They are purged automatically after a few days depending on history).

The really weird thing, I noticed in the messages log that CUDA errors were happening when iptables blocked an address! WTF???

I stopped proxying mail traffic to the i7 now, and the CUDA errors have gone away.

It looks like there is a path to investigate, but how in the world does a net packet rejection cause CUDA to fail?

The kernel and rsyslog would be involved, and maybe a writing line to the console and messages log introduced some kind of delay that caused the error.

I noticed it also happened with Einstein CUDA app. and occurrence is very strongly correlated from the messages file:
The app that was running at the time was also aborted with an error, and the seti app also generated similar NVRM errors.

...
Nov 20 10:21:21 ares kernel: [875066.460137] MAIL_DROP:IN=em1 OUT= SRC=109.162.92.6
Nov 20 10:21:21 ares kernel: [875066.471965] NVRM: Xid (0000:03:00): 13, 0001 00000000 000090c0 00002390 00000000 00000000
Nov 20 10:21:24 ares kernel: [875069.418647] MAIL_DROP:IN=em1 OUT= SRC=109.162.92.6
Nov 20 10:21:25 ares kernel: [875071.061918] NVRM: Xid (0000:03:00): 13, 0001 00000000 000090c0 00001b0c 00000000 00000000
...

Perhaps a workaround could be for a CUDA app to handle these errors more gracefully, pause and retry n times if a function fails because of an occasional kernel hiccup?

Meanwhile, I hope relevant maintainers can look into this bizarre behaviour and solve it.

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4963
Credit: 73,070,273
RAC: 14,828
Australia
Message 1308132 - Posted: 20 Nov 2012, 16:29:57 UTC - in response to Message 1308116.
Last modified: 20 Nov 2012, 16:36:37 UTC

Right, that sounds like down the right path.

As underlying driver architectures have evolved, mostly driven by Windows Gaming :), the technologies have become more interdependant in unexpected ways.

Whatever the Linux equivalent of a Windows 'Deferred Procedure Call' is (aka software interrupt ), On Windows you check the latency with DPC Latency checker ( http://www.thesycon.com/deu/latency_check.shtml )

On Windows you can starkly see when enabling any system device that has a poor quality driver ( such as a wifi card, usb devices etc ), by its impact on the DPC latency, which directly affects memory transfer 'timeouts' streaming multimedia performance/dropouts). In later app builds I use Boincapi's new temporary exit facilities, so catching & resolving these failures when they occur (still preferable to not trigger them of course).

In your case, barring some equivalent Linux DPC latency checking tool, I would suspect your NIC driver may be the culprit... Failing that, iptables itself or other supporting infrastructure may not be using best practices with respect to multithreaded environment, particularly not returning from interrupts immediately & handling IO completion in a separate (deferred) thread, but instead processing for too long in a kernel callback or similar. Worth delving deeper if better NICs/Drivers yield no improvement.

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Andy Lee Robinson
Avatar
Send message
Joined: 8 Dec 05
Posts: 615
Credit: 40,819,319
RAC: 43,330
Hungary
Message 1308372 - Posted: 21 Nov 2012, 15:07:18 UTC - in response to Message 1308132.
Last modified: 21 Nov 2012, 15:09:42 UTC

Unlikely to be the onboard nic and drivers, it deals with gigabytes of transfers without upsetting CUDA, a packet has to be received before being rejected, so I suspect something else. (I'll modify the iptables rules to drop silently and see if that helps).

I grew up with hardware interrupts and non-maskable interrupts on a 6502 micro!
I don't know the intricacies of CUDA and hardware level architecture, but it is my understanding that non-urgent interrupts should themselves be interruptible, and I would define a boinc task and functions called, cuda or otherwise to be non-urgent!

Please have a look at the CUDA app code again and consider a retry if a routine fails. I think the CUDA library should be looked at too by NVidia to respect the demands of a system and yield - adapting a kernel to accommodate a library's deficiencies or inaccurate assumptions is a bigger task, though probably not without precedent.
I'll have a look to see where I can file a bug report!

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4963
Credit: 73,070,273
RAC: 14,828
Australia
Message 1308559 - Posted: 22 Nov 2012, 0:20:24 UTC - in response to Message 1308372.

Unlikely to be the onboard nic and drivers, it deals with gigabytes of transfers without upsetting CUDA, a packet has to be received before being rejected, so I suspect something else. (I'll modify the iptables rules to drop silently and see if that helps).

I grew up with hardware interrupts and non-maskable interrupts on a 6502 micro!
I don't know the intricacies of CUDA and hardware level architecture, but it is my understanding that non-urgent interrupts should themselves be interruptible, and I would define a boinc task and functions called, cuda or otherwise to be non-urgent!

Please have a look at the CUDA app code again and consider a retry if a routine fails. I think the CUDA library should be looked at too by NVidia to respect the demands of a system and yield - adapting a kernel to accommodate a library's deficiencies or inaccurate assumptions is a bigger task, though probably not without precedent.
I'll have a look to see where I can file a bug report!


Yes those aspects have been ongoing development matters for all concerned, including MS, nVidia (through reports backstage by myself & others), and in app development. Those issues really only came forward as driver infrastructure moved toward being more multithreaded for scalability.

As mentioned, the most recent app codebase ( up to x41zb ) takes the path of using Boincapi's temporary exit mechanisms where needed, with retries under certain known recoverable/transient conditions.

After initialisation steps were moved to be much more atomic than the original 6.08-6.10 code, various forms of recovery were experimented with. The primary reason boinc temporary exits have been found to be preferred in many cases, is that historically various driver, library, OS & genuine hardware issues have been pretty much unrecoverable at the application level (context corruption). A full exit/restart reinitialises the Cuda context etc completely, often after the OS decided to reset the device. Another aspect to consider is we are dealing with consumer level hardware, in many cases being pushed too far , which is going to take a whole new approach toward ensuring some level of reliability/integrity.

Much of this challenge originates from the incorporation of DMA engines & helper threads, into traditionally not very threadsafe programming environment. For example boincapi freeing host memory buffers underneath active transfers was found to be the dominant cause of 'sticky-downclocks' seen when snoozing/exiting Boinc, effectively sending everything lower level into failsafe. That's one example only, and in retrospect it would have been helpful if nVidia had pushed forward warnings about thread safety as the architectures moved toward the current highly threaded implementation. (rather than wait for our reports :) )

Once someone gets around to making an updated build for Linux, I'd be interested to see if there are any special considerations there as well. It's certainly been a long haul on shifting ground, as was expected.

current public beta x41zb sources reside in Berkeley's svn at:
https://setisvn.ssl.berkeley.edu/trac/browser/branches/sah_v7_opt/Xbranch

,would require some makefile updates to reflect the changes for V7 autocorrelation additions in testing on the beta project, along with other refinements. Hopefully as the project resolves some technical issues server side, and the current approaches toward improving cross-device-generation, cross cuda-version matches are able to be verified/improved, then the focus can shift more toward dealing with the likes of rogue hosts & misconfigured systems, which tend to dominate more than more pedestrian issues of the past.

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16

Copyright © 2014 University of California