Posts by Andy Lee Robinson


log in
1) Message boards : Number crunching : Welcome Back! (Message 1311505)
Posted 595 days ago by Profile Andy Lee Robinson
Yay! and http://setistats.haveland.com came back too without any twiddling.
2) Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16 (Message 1308372)
Posted 609 days ago by Profile Andy Lee Robinson
Unlikely to be the onboard nic and drivers, it deals with gigabytes of transfers without upsetting CUDA, a packet has to be received before being rejected, so I suspect something else. (I'll modify the iptables rules to drop silently and see if that helps).

I grew up with hardware interrupts and non-maskable interrupts on a 6502 micro!
I don't know the intricacies of CUDA and hardware level architecture, but it is my understanding that non-urgent interrupts should themselves be interruptible, and I would define a boinc task and functions called, cuda or otherwise to be non-urgent!

Please have a look at the CUDA app code again and consider a retry if a routine fails. I think the CUDA library should be looked at too by NVidia to respect the demands of a system and yield - adapting a kernel to accommodate a library's deficiencies or inaccurate assumptions is a bigger task, though probably not without precedent.
I'll have a look to see where I can file a bug report!
3) Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16 (Message 1308116)
Posted 610 days ago by Profile Andy Lee Robinson
Jason, I've identified why it happens! Just don't know how to solve it yet though.

I get a ton of spam - many thousands per day. DKIM, virus and spamassassin scanning takes up a lot of webserver CPU.
Given that most of the spam comes from countries that none of my clients do business with - VN, CN, UA, TH, ID, BR etc, I did something clever on the webserver to ease the load.

I set up iptables rules to preroute the addresses from just those countries and send them over openvpn to here, then to the i7 with the nvidia 550 card to process the spam and automatically report to spamcop etc.

I also wrote log scanners to ban ip addresses of antisocial machines, port scanners, ssh/ftp attacks, phpmyadmin, http proxy probes etc...

Not only do they defend servers from attack, the bad ip addresses are also distributed over mysql replication to all other servers and added to their iptables too. (They are purged automatically after a few days depending on history).

The really weird thing, I noticed in the messages log that CUDA errors were happening when iptables blocked an address! WTF???

I stopped proxying mail traffic to the i7 now, and the CUDA errors have gone away.

It looks like there is a path to investigate, but how in the world does a net packet rejection cause CUDA to fail?

The kernel and rsyslog would be involved, and maybe a writing line to the console and messages log introduced some kind of delay that caused the error.

I noticed it also happened with Einstein CUDA app. and occurrence is very strongly correlated from the messages file:
The app that was running at the time was also aborted with an error, and the seti app also generated similar NVRM errors.

...
Nov 20 10:21:21 ares kernel: [875066.460137] MAIL_DROP:IN=em1 OUT= SRC=109.162.92.6
Nov 20 10:21:21 ares kernel: [875066.471965] NVRM: Xid (0000:03:00): 13, 0001 00000000 000090c0 00002390 00000000 00000000
Nov 20 10:21:24 ares kernel: [875069.418647] MAIL_DROP:IN=em1 OUT= SRC=109.162.92.6
Nov 20 10:21:25 ares kernel: [875071.061918] NVRM: Xid (0000:03:00): 13, 0001 00000000 000090c0 00001b0c 00000000 00000000
...

Perhaps a workaround could be for a CUDA app to handle these errors more gracefully, pause and retry n times if a function fails because of an occasional kernel hiccup?

Meanwhile, I hope relevant maintainers can look into this bizarre behaviour and solve it.
4) Message boards : Number crunching : Suggestions for people having problems connecting to the servers (Message 1307177)
Posted 613 days ago by Profile Andy Lee Robinson
Something needs a really good kicking!
Uploads are working, but I can't report or get any new work for the last couple of days, apart from a couple of freak occasions. I configured a socks proxy, SS5 on my webserver, and that is similarly unreliable.
Curiously boinc on the webserver itself does appear to report results, so I would expect that running the proxy on it would help, but no. Perhaps the scheduler chooses a server based on host id and not its ip address?
5) Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16 (Message 1302773)
Posted 624 days ago by Profile Andy Lee Robinson
I'm supposed to be a professional linux guru, but this is driving me mental!

A few months ago I managed to get lunatics linux+cuda app running on fc16, and since upgrading to fc17 and boinc-client-7.0.29-1.r25790svn.fc17.x86_64 GPU tasks are aborting with this error about 90% of the time.

Cuda error '(cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost))' in file 'cuda/cudaAcc_summax.cu' in line 239 : unspecified launch failure.

app_info.xml is correct as far as I know.

ldd setiathome_x41g_x86_64-pc-linux-gnu_cuda32 gives this:
linux-vdso.so.1 => (0x00007fff7dfff000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003506e00000)
libcudart.so.3 (0x00007faf91bd6000)
libcufft.so.3 (0x00007faf8fe20000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000000350d600000)
libm.so.6 => /lib64/libm.so.6 (0x0000003507200000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003508e00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003506a00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003506600000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003507600000)
librt.so.1 => /lib64/librt.so.1 (0x0000003507a00000)

libs are in /var/lib/boinc/projects/setiathome.berkeley.edu
-rwxr-xr-x 1 boinc boinc 313872 Dec 2 2011 libcudart.so.3
-rwxr-xr-x 1 boinc boinc 28317K Dec 2 2011 libcufft.so.3

/dev contains these, launched by nvidia-smi -pm 1 in rc.local
crw-rw-rw- 1 root root 195, 0 Nov 5 09:42 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 5 09:42 nvidiactl

/etc/ld.so.conf points to where cuda is installed:
/usr/local/cuda/lib
/usr/local/cuda/lib64
and
/usr/lib64/nvidia
/usr/lib/nvidia

echo $PATH
/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/root/bin:/usr/local/cuda/bin

In directory /var/lib/boinc/projects/setiathome.berkeley.edu
I copied a failed WU to work_unit.sah and ran
./setiathome_x41g_x86_64-pc-linux-gnu_cuda32 -standalone
and it completed OK.
I'm at a loss to explain why it won't work reliably under boinc-client :(
6) Message boards : Number crunching : 62 AP_V5 Left In The Field (Message 1242242)
Posted 777 days ago by Profile Andy Lee Robinson

Report deadline 9 Jun 2012 | 8:49:09 UTC


Received 6 Jun 2012 | 12:17:25 UTC

Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 6259723
Report deadline 9 Jun 2012 | 8:49:09 UTC
Run time 456,056.73
CPU time 417,734.70
Validate state Task was reported too late to validate
Credit 0.00
Application version Astropulse v505 v5.05


Err... Received 6 Jun, report deadline 9 Jun - it was returned within the period, so why too late to validate?

An AMD E350 is a silly processor to use on Boinc, but IMHO it still should have validated.
7) Message boards : Number crunching : SAH On Linux (Message 1240631)
Posted 780 days ago by Profile Andy Lee Robinson
This should renice the cpu/gpu threads as a simple command:

for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;

I guess it could be added to /etc/crontab to run every minute:
* * * * * root sleep 10; (for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;)

Not sure how much benefit it really is, but might not be helpful on a production webserver/db :-)
8) Message boards : News : Major Power Outage at SSL (Message 1239959)
Posted 781 days ago by Profile Andy Lee Robinson

https://twitter.com/#!/setihome

Funny how the colour "yellow" and the word "bypass" comes to mind... :-)
9) Message boards : News : Major Power Outage at SSL (Message 1239766)
Posted 782 days ago by Profile Andy Lee Robinson
Instead of an impossible redirect, I would have suggested a managed update of DNS as they shouldn't be on site. For a planned outage, the subdomain's resource record refresh rate is increased before the switch so that an update will happen relatively immediately.
However, with infinite wisdom, the forum website has the same domain name as the project's, so after a DNS update the backup info site would be swamped with half a million machines trying to connect. Not good!

The forum is also intimately tied to the project database, so it couldn't run standalone, unless connecting to a remote multiple master db, or to a slave as readonly.

A free blog site such as setiathome.wordpress.com could suffice, at present available for someone to register.

Or how about:
Twitter feed?
Boincstats shoutbox?
An official Facebook page? (not the uninteractive wiki)

I could also update my stats page on http://setistats.haveland.com/ if someone lets me know.

So many ways to communicate now - no excuses for info darkness.
10) Message boards : Number crunching : Hang on wingys, I'm trying.......... (Message 1234239)
Posted 794 days ago by Profile Andy Lee Robinson
Mark, I have 3 P5Ks still running.
On one of them, 2 of the SATA ports failed and running on what is left.
Your board could be a similar thing, so try the disk on different ports.
If it's still wobbly, try with another disk?

It may be that retirement is looming...
I had to tone down the overclocking on my boards to maintain stability.
11) Message boards : Number crunching : 62 AP_V5 Left In The Field (Message 1233901)
Posted 794 days ago by Profile Andy Lee Robinson
A theoretical 250 days to bounce around peeing everyone off!

Has to be a solution for user or admin to notify of a bad host and force reassignment, prioritize high turnover hosts etc.
There already is something in place that limits delivery to hosts that produce many errors, so shouldn't be difficult to add. I remember having this discussion some years ago!

Not checking in for a few days could count as an error, if the user has specified that the machine is permanently connected to the net and expects to check in/report daily.
12) Message boards : Number crunching : 62 AP_V5 Left In The Field (Message 1233768)
Posted 795 days ago by Profile Andy Lee Robinson
I'm also wondering why it went back up to 4. I hope someone in the lab is at least curious enough to check for himself and see if they're out to hosts that are likely to return them.


Simply because tasks that have timed out "return home" to wait for a new host to come along. Until then, they are not out in the field!
13) Message boards : Number crunching : how much GPU can the download server support? (Message 1213367)
Posted 842 days ago by Profile Andy Lee Robinson
The plan right now is to replace all 3 of the current download/upload servers and compile them into one server.


hm... then it'll probably need the tcp/ip settings tweaking a lot to handle the number of sessions and a shedload of cpu for connection tracking and massive disk concurrency.

I think 4 servers, using iptables to route based on least significant two bits over a 1G network, or 4x100mpbs split between different providers would get things running acceptably again.

The current state of affairs is really really painful, with so much bandwidth wasted in retries.

The lowest hanging fruit (knee level) is to change the WU data encoding for MB tasks from base64 to binary (ala astropulse), that would win about 20% efficiency for a simple tweak.
14) Message boards : News : Your chance to be famous. (Message 1209522)
Posted 851 days ago by Profile Andy Lee Robinson
I wonder why msattler springs to mind... :)
15) Message boards : Number crunching : memory as a ram disk (Message 1191018)
Posted 902 days ago by Profile Andy Lee Robinson
Not quite true... might gain an extra couple of seconds per day!
16) Message boards : Number crunching : Sparse FFT: 10x speed (Message 1187056)
Posted 914 days ago by Profile Andy Lee Robinson
See the obscure and misleading "SODA, with a D" topic...

I don't think it'll be useful because it throws the baby out with the bathwater, and the babies we are looking for are really really really tiny!
17) Message boards : Number crunching : SODA, with a D (Message 1186962)
Posted 915 days ago by Profile Andy Lee Robinson
Thanks Martin, I do know what they are, where and why they are used!
(I wrote an FFT algorithm in 6502 ASM to do digital vocoding amongst other things for my postgrad thesis 25 years ago!)

This algorithm is not a lossless fast FFT, it economizes by throwing away insignificant frequencies and working on the rest.
Not too useful for SETI where everything is significant - the faintest of signals may be overlooked and discarded.
It might be useful for superficial scanning, then do a deeper search with the normal FFT.
18) Message boards : Number crunching : Breathing life into an old computer (Message 1186785)
Posted 915 days ago by Profile Andy Lee Robinson
Now I just have to speed up the Pentium 4.


Bin it, and spend the money you save on its electricity bill on another GPU for the other machine.
19) Message boards : Number crunching : SODA, with a D (Message 1186782)
Posted 915 days ago by Profile Andy Lee Robinson
um... that should be Foxtrot Foxtrot Tango if you want to use the international convention...

Interesting article, but pity that the graphic incorrectly sums the example waveforms, and appears to just be drawn!

I'm not sure that sparse FFTs are applicable for SETI, as they are used for lossy compression of media.
20) Message boards : Number crunching : Ramdisk (Message 1185306)
Posted 921 days ago by Profile Andy Lee Robinson
client_state.xml updated quite often and takes ~4MB on my not too fast host...
On more fast hosts it would be much bigger.

So, there is something in BOINC that can win from faster "HDD" access.
Robustness of such BOINC setup is another question.

That is true. With my low-end rigs, my client_state has never been more than 1MB, but it gets re-generated/updated every 60 seconds, and that <1MB file ends up with hundreds of fragments in the file system.

I imagine that when we lose the limits and some people go back to 10,000+ tasks in their cache, client_state must be huge ( >10MB ), and with that, thousands of fragments, which makes disk-access time for it a lot slower.

However, as stated before, since it is volatile storage, you would have to write the contents of it to disk somewhat frequently as a safety measure.



It isn't worth it - disk transfers are mostly cached and handled by DMA, so while the cpu is waiting for I/O the FPU/GPU can still be working.

A 10 MB file loads and saves in less than 1/4 of a second, so even at once a minute, that's only 1/240th of time, during which the FPU/GPUs are still crunching.

Disk caching makes running from ramdisk almost redundant, unless used like a "persistent cache" that doesn't expire other cached stuff.

For example, if I need to do a lot of random access processing on files that might impact the performance of other disk intensive applications, mysql etc. by head thrashing, then I copy from disk to the ramdrive, do the intensive stuff on them and copy back.

Perhaps php sessions can benefit, with rsynced backup and modified httpd start/stop to save/restore on boot/shutdown, but as these are all cached anyway, any performance benefits are really marginal. Any site requiring such performance should have its sessions distributed through a mysql cluster and memcached anyway!

The best improvement I've seen for a ramdisk is using /dev/shm as a temp dir for mysql, which I can't recommend highly enough.


Next 20

Copyright © 2014 University of California