Posts by Andy Lee Robinson

log in
1) Message boards : News : New Data from AO (Message 1609571)
Posted 15 days ago by Profile Andy Lee Robinson
Yay! At last the wheels are moving again!
2) Message boards : Number crunching : Panic Mode On (92) Server Problems? (Message 1605442)
Posted 25 days ago by Profile Andy Lee Robinson
Not much happening, work in progress still steadily falling.

Might be a good idea to have a thorough cleanout before restart!
3) Message boards : Number crunching : Welcome Back! (Message 1311505)
Posted 5 Dec 2012 by Profile Andy Lee Robinson
Yay! and came back too without any twiddling.
4) Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16 (Message 1308372)
Posted 21 Nov 2012 by Profile Andy Lee Robinson
Unlikely to be the onboard nic and drivers, it deals with gigabytes of transfers without upsetting CUDA, a packet has to be received before being rejected, so I suspect something else. (I'll modify the iptables rules to drop silently and see if that helps).

I grew up with hardware interrupts and non-maskable interrupts on a 6502 micro!
I don't know the intricacies of CUDA and hardware level architecture, but it is my understanding that non-urgent interrupts should themselves be interruptible, and I would define a boinc task and functions called, cuda or otherwise to be non-urgent!

Please have a look at the CUDA app code again and consider a retry if a routine fails. I think the CUDA library should be looked at too by NVidia to respect the demands of a system and yield - adapting a kernel to accommodate a library's deficiencies or inaccurate assumptions is a bigger task, though probably not without precedent.
I'll have a look to see where I can file a bug report!
5) Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16 (Message 1308116)
Posted 20 Nov 2012 by Profile Andy Lee Robinson
Jason, I've identified why it happens! Just don't know how to solve it yet though.

I get a ton of spam - many thousands per day. DKIM, virus and spamassassin scanning takes up a lot of webserver CPU.
Given that most of the spam comes from countries that none of my clients do business with - VN, CN, UA, TH, ID, BR etc, I did something clever on the webserver to ease the load.

I set up iptables rules to preroute the addresses from just those countries and send them over openvpn to here, then to the i7 with the nvidia 550 card to process the spam and automatically report to spamcop etc.

I also wrote log scanners to ban ip addresses of antisocial machines, port scanners, ssh/ftp attacks, phpmyadmin, http proxy probes etc...

Not only do they defend servers from attack, the bad ip addresses are also distributed over mysql replication to all other servers and added to their iptables too. (They are purged automatically after a few days depending on history).

The really weird thing, I noticed in the messages log that CUDA errors were happening when iptables blocked an address! WTF???

I stopped proxying mail traffic to the i7 now, and the CUDA errors have gone away.

It looks like there is a path to investigate, but how in the world does a net packet rejection cause CUDA to fail?

The kernel and rsyslog would be involved, and maybe a writing line to the console and messages log introduced some kind of delay that caused the error.

I noticed it also happened with Einstein CUDA app. and occurrence is very strongly correlated from the messages file:
The app that was running at the time was also aborted with an error, and the seti app also generated similar NVRM errors.

Nov 20 10:21:21 ares kernel: [875066.460137] MAIL_DROP:IN=em1 OUT= SRC=
Nov 20 10:21:21 ares kernel: [875066.471965] NVRM: Xid (0000:03:00): 13, 0001 00000000 000090c0 00002390 00000000 00000000
Nov 20 10:21:24 ares kernel: [875069.418647] MAIL_DROP:IN=em1 OUT= SRC=
Nov 20 10:21:25 ares kernel: [875071.061918] NVRM: Xid (0000:03:00): 13, 0001 00000000 000090c0 00001b0c 00000000 00000000

Perhaps a workaround could be for a CUDA app to handle these errors more gracefully, pause and retry n times if a function fails because of an occasional kernel hiccup?

Meanwhile, I hope relevant maintainers can look into this bizarre behaviour and solve it.
6) Message boards : Number crunching : Suggestions for people having problems connecting to the servers (Message 1307177)
Posted 17 Nov 2012 by Profile Andy Lee Robinson
Something needs a really good kicking!
Uploads are working, but I can't report or get any new work for the last couple of days, apart from a couple of freak occasions. I configured a socks proxy, SS5 on my webserver, and that is similarly unreliable.
Curiously boinc on the webserver itself does appear to report results, so I would expect that running the proxy on it would help, but no. Perhaps the scheduler chooses a server based on host id and not its ip address?
7) Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16 (Message 1302773)
Posted 6 Nov 2012 by Profile Andy Lee Robinson
I'm supposed to be a professional linux guru, but this is driving me mental!

A few months ago I managed to get lunatics linux+cuda app running on fc16, and since upgrading to fc17 and boinc-client-7.0.29-1.r25790svn.fc17.x86_64 GPU tasks are aborting with this error about 90% of the time.

Cuda error '(cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost))' in file 'cuda/' in line 239 : unspecified launch failure.

app_info.xml is correct as far as I know.

ldd setiathome_x41g_x86_64-pc-linux-gnu_cuda32 gives this: => (0x00007fff7dfff000) => /lib64/ (0x0000003506e00000) (0x00007faf91bd6000) (0x00007faf8fe20000) => /lib64/ (0x000000350d600000) => /lib64/ (0x0000003507200000) => /lib64/ (0x0000003508e00000) => /lib64/ (0x0000003506a00000)
/lib64/ (0x0000003506600000) => /lib64/ (0x0000003507600000) => /lib64/ (0x0000003507a00000)

libs are in /var/lib/boinc/projects/
-rwxr-xr-x 1 boinc boinc 313872 Dec 2 2011
-rwxr-xr-x 1 boinc boinc 28317K Dec 2 2011

/dev contains these, launched by nvidia-smi -pm 1 in rc.local
crw-rw-rw- 1 root root 195, 0 Nov 5 09:42 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 5 09:42 nvidiactl

/etc/ points to where cuda is installed:

echo $PATH

In directory /var/lib/boinc/projects/
I copied a failed WU to work_unit.sah and ran
./setiathome_x41g_x86_64-pc-linux-gnu_cuda32 -standalone
and it completed OK.
I'm at a loss to explain why it won't work reliably under boinc-client :(
8) Message boards : Number crunching : 62 AP_V5 Left In The Field (Message 1242242)
Posted 6 Jun 2012 by Profile Andy Lee Robinson

Report deadline 9 Jun 2012 | 8:49:09 UTC

Received 6 Jun 2012 | 12:17:25 UTC

Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 6259723
Report deadline 9 Jun 2012 | 8:49:09 UTC
Run time 456,056.73
CPU time 417,734.70
Validate state Task was reported too late to validate
Credit 0.00
Application version Astropulse v505 v5.05

Err... Received 6 Jun, report deadline 9 Jun - it was returned within the period, so why too late to validate?

An AMD E350 is a silly processor to use on Boinc, but IMHO it still should have validated.
9) Message boards : Number crunching : SAH On Linux (Message 1240631)
Posted 3 Jun 2012 by Profile Andy Lee Robinson
This should renice the cpu/gpu threads as a simple command:

for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;

I guess it could be added to /etc/crontab to run every minute:
* * * * * root sleep 10; (for i in $(ps -C setiathome_x41g -o pid=); do renice -n 0 $i; done;)

Not sure how much benefit it really is, but might not be helpful on a production webserver/db :-)
10) Message boards : News : Major Power Outage at SSL (Message 1239959)
Posted 2 Jun 2012 by Profile Andy Lee Robinson!/setihome

Funny how the colour "yellow" and the word "bypass" comes to mind... :-)
11) Message boards : News : Major Power Outage at SSL (Message 1239766)
Posted 2 Jun 2012 by Profile Andy Lee Robinson
Instead of an impossible redirect, I would have suggested a managed update of DNS as they shouldn't be on site. For a planned outage, the subdomain's resource record refresh rate is increased before the switch so that an update will happen relatively immediately.
However, with infinite wisdom, the forum website has the same domain name as the project's, so after a DNS update the backup info site would be swamped with half a million machines trying to connect. Not good!

The forum is also intimately tied to the project database, so it couldn't run standalone, unless connecting to a remote multiple master db, or to a slave as readonly.

A free blog site such as could suffice, at present available for someone to register.

Or how about:
Twitter feed?
Boincstats shoutbox?
An official Facebook page? (not the uninteractive wiki)

I could also update my stats page on if someone lets me know.

So many ways to communicate now - no excuses for info darkness.
12) Message boards : Number crunching : Hang on wingys, I'm trying.......... (Message 1234239)
Posted 20 May 2012 by Profile Andy Lee Robinson
Mark, I have 3 P5Ks still running.
On one of them, 2 of the SATA ports failed and running on what is left.
Your board could be a similar thing, so try the disk on different ports.
If it's still wobbly, try with another disk?

It may be that retirement is looming...
I had to tone down the overclocking on my boards to maintain stability.
13) Message boards : Number crunching : 62 AP_V5 Left In The Field (Message 1233901)
Posted 20 May 2012 by Profile Andy Lee Robinson
A theoretical 250 days to bounce around peeing everyone off!

Has to be a solution for user or admin to notify of a bad host and force reassignment, prioritize high turnover hosts etc.
There already is something in place that limits delivery to hosts that produce many errors, so shouldn't be difficult to add. I remember having this discussion some years ago!

Not checking in for a few days could count as an error, if the user has specified that the machine is permanently connected to the net and expects to check in/report daily.
14) Message boards : Number crunching : 62 AP_V5 Left In The Field (Message 1233768)
Posted 20 May 2012 by Profile Andy Lee Robinson
I'm also wondering why it went back up to 4. I hope someone in the lab is at least curious enough to check for himself and see if they're out to hosts that are likely to return them.

Simply because tasks that have timed out "return home" to wait for a new host to come along. Until then, they are not out in the field!
15) Message boards : Number crunching : how much GPU can the download server support? (Message 1213367)
Posted 2 Apr 2012 by Profile Andy Lee Robinson
The plan right now is to replace all 3 of the current download/upload servers and compile them into one server.

hm... then it'll probably need the tcp/ip settings tweaking a lot to handle the number of sessions and a shedload of cpu for connection tracking and massive disk concurrency.

I think 4 servers, using iptables to route based on least significant two bits over a 1G network, or 4x100mpbs split between different providers would get things running acceptably again.

The current state of affairs is really really painful, with so much bandwidth wasted in retries.

The lowest hanging fruit (knee level) is to change the WU data encoding for MB tasks from base64 to binary (ala astropulse), that would win about 20% efficiency for a simple tweak.
16) Message boards : News : Your chance to be famous. (Message 1209522)
Posted 24 Mar 2012 by Profile Andy Lee Robinson
I wonder why msattler springs to mind... :)
17) Message boards : Number crunching : memory as a ram disk (Message 1191018)
Posted 2 Feb 2012 by Profile Andy Lee Robinson
Not quite true... might gain an extra couple of seconds per day!
18) Message boards : Number crunching : Sparse FFT: 10x speed (Message 1187056)
Posted 21 Jan 2012 by Profile Andy Lee Robinson
See the obscure and misleading "SODA, with a D" topic...

I don't think it'll be useful because it throws the baby out with the bathwater, and the babies we are looking for are really really really tiny!
19) Message boards : Number crunching : SODA, with a D (Message 1186962)
Posted 21 Jan 2012 by Profile Andy Lee Robinson
Thanks Martin, I do know what they are, where and why they are used!
(I wrote an FFT algorithm in 6502 ASM to do digital vocoding amongst other things for my postgrad thesis 25 years ago!)

This algorithm is not a lossless fast FFT, it economizes by throwing away insignificant frequencies and working on the rest.
Not too useful for SETI where everything is significant - the faintest of signals may be overlooked and discarded.
It might be useful for superficial scanning, then do a deeper search with the normal FFT.
20) Message boards : Number crunching : Breathing life into an old computer (Message 1186785)
Posted 20 Jan 2012 by Profile Andy Lee Robinson
Now I just have to speed up the Pentium 4.

Bin it, and spend the money you save on its electricity bill on another GPU for the other machine.

Next 20

Copyright © 2014 University of California