Posts by Andy Lee Robinson

1) Message boards : Number crunching : Say goodbye to SETI@home v7. (Message 1803848)
Posted 21 Jul 2016 by Profile Andy Lee Robinson
Post:
Just noticed https://setistats.haveland.com/ wasn't working properly so just fixed it with a new cert.

Look forward to v7 retiring so I can tidy up the page.

sah_v8 stats still haven't been fixed after I reported it a few months ago.
sah_status.xml stats published by the project duplicates astropulse stats to the v8 stats.
This should be easy to fix if someone could bring it to a developer's attention!
2) Message boards : Number crunching : Bug in sah_status.xml (Message 1782746)
Posted 27 Apr 2016 by Profile Andy Lee Robinson
Post:
I just discovered a bug in sah_status.xml while updating my setistats site:

sah_v8 statistics are duplicated ap stats:

Sample selection:
<sah_current_result_creation_rate>0/sec</sah_current_result_creation_rate>
<ap_current_result_creation_rate>0.0136/sec</ap_current_result_creation_rate>
<sah_v8_current_result_creation_rate>0.0136/sec</sah_v8_current_result_creation_rate>
<sah_results_in_progress>324</sah_results_in_progress>
<ap_results_in_progress>9340</ap_results_in_progress>
<sah_v8_results_in_progress>9340</sah_v8_results_in_progress>


I hope someone could attract the attention of someone that could fix this!
3) Message boards : Number crunching : Update on Linux 64 -Nividia-V8-MB ????? (Message 1771338)
Posted 13 Mar 2016 by Profile Andy Lee Robinson
Post:
Solved it soon afterwards but didn't see opportunity to report.

The PSU's high CFM fan was the culprit and should have been my first suspect but I didn't notice at first because it draws air in through a filter on the bottom and isn't visible.
It had gotten stiff and stopped after about 3 years of continuous use, so I cleaned and oiled it to buy some time and ordered a new one.
PSU was simply overheating without enough airflow under load, and would explain why it would reboot during the POST until it had cooled down enough to complete.

Wasn't much fun fscking raid disks and resynchronizing the mysql databases several times, but it hasn't crashed since even while throwing everything I have at it. :-)

Very relieved it wasn't the GPU!
4) Message boards : Number crunching : Update on Linux 64 -Nividia-V8-MB ????? (Message 1766750)
Posted 21 Feb 2016 by Profile Andy Lee Robinson
Post:
Thanks Jason, two machines working and crunching again, third waiting new GPU.

Yes, just need to include an app_info in the package.
Incidentally, the file doesn't appear to be true xml if the order of entries are significant - the containers don't appear to be very hierarchical.
5) Message boards : Number crunching : Update on Linux 64 -Nividia-V8-MB ????? (Message 1766746)
Posted 21 Feb 2016 by Profile Andy Lee Robinson
Post:
Hvala!

Got it working eventually - doing my nut because I had this:
<file_info>
  <file_name>libcudart.so.6.0</file_name>
  <executable/>
</file_info>

<file_info> expects "name", but <file_ref> expects "file_name":
<file_ref>
  <file_name>libcudart.so.6.0</file_name>
</file_ref>


Don't ya just love consistency?

Yep, the video card was also a problem :( Replaced it with a smaller one and the reboots have stopped. Really annoying.
6) Message boards : Number crunching : Update on Linux 64 -Nividia-V8-MB ????? (Message 1766585)
Posted 20 Feb 2016 by Profile Andy Lee Robinson
Post:
Thanks Jason,

Have tried to install it, but need a working xml spec to add to app_info.xml

I made some guesses, but boinc just deletes the app after restarting, even if owned by root.

I have no idea what version_num or plan_class should or, or even if coproc/type is correct.

    <app>
      <name>setiathome_v8</name>
    </app>
    <file_info>
      <name>setiathome_x41zi_x86_64-pc-linux-gnu_cuda60</name>
      <executable/>
    </file_info>
    <app_version>
        <app_name>setiathome_v8</app_name>
        <version_num>800</version_num>
        <platform>x86_64-pc-linux-gnu</platform>
        <plan_class>cuda_fermi</plan_class>
        <avg_ncpus>0.05</avg_ncpus>
        <max_ncpus>1.0</max_ncpus>
        <coproc>
            <type>CUDA</type>
            <count>1</count>
        </coproc>
        <file_ref>
            <file_name>setiathome_x41zi_x86_64-pc-linux-gnu_cuda60</file_name>
            <main_program/>
        </file_ref>
        <file_ref>
            <file_name>libcudart.so.6.0</file_name>
        </file_ref>
        <file_ref>
            <file_name>libcufft.so.6.0</file_name>
        </file_ref>
    </app_version>


(I also have another problem in that a month ago this machine started resetting without warning after a few minutes to a few hours of running boinc, and even resets during the boot up. Happens with cpu apps or large videos - have swapped mainboards with another machine, peripherals and ram, and still happens. Only constants were the video card GTX760 and PSU which I thoroughly cleaned. Hope it's not the GPU - next thing to swap and test.)
7) Message boards : Number crunching : Update on Linux 64 -Nividia-V8-MB ????? (Message 1766075)
Posted 18 Feb 2016 by Profile Andy Lee Robinson
Post:
bump... :-)
8) Message boards : Number crunching : Wanna save energy on gpu crunshing ? (Message 1765871)
Posted 17 Feb 2016 by Profile Andy Lee Robinson
Post:
Some scary numbers there with the costs of power!

I guess one could buy another Arecibo antenna with what all participants have spent on energy in contributing to the project so far.

I use my remote webserver to crunch permanently as the electricity isn't counted in the rack rental.
At home I use the servers for heating so crunching is much more justifiable.

Costs are much more of a consideration than they used to be!
9) Message boards : Number crunching : Wanna save energy on gpu crunshing ? (Message 1765738)
Posted 17 Feb 2016 by Profile Andy Lee Robinson
Post:
Just to clarify units because a lot of people seem to be confused:

The Watt is an instantaneous measurement of power = a rate of flow.
Energy is measured in joules which are watts per second = how much flowed in time.
1 kWh = 3,600,000 joules.

21 watts per task is meaningless without the time it took, so assuming 1 hour:

21wh or 0.021kWh is 75,600 joules (enough energy to lift 100 kg by 77 meters!)
10) Message boards : Number crunching : OS on a HDD and RAID 0 with two 6TB HDDs? (Message 1765508)
Posted 16 Feb 2016 by Profile Andy Lee Robinson
Post:
RAID0
Risky Array of Irretrievable Data!
11) Message boards : Number crunching : OS on a HDD and RAID 0 with two 6TB HDDs? (Message 1765318)
Posted 16 Feb 2016 by Profile Andy Lee Robinson
Post:
Seconded - anyone using RAID0 on such large drives must be nuts unless the data really is temporary and expendable.

I use an array of 6x1TB drives as RAID10 using mdadm software raid.
A disk fails about once a year, and it's fairly trivial to replace and resync.
I'd go for more smaller drives than fewer larger drives because of the time taken to resync is quicker and the data rate is faster.
I also keep it synced with another machine that has similar capacity, in case something really bad happens.

An array of 6TB drives could take a half a day or more to resync, and bigger drives mean bigger storage and more time in housekeeping.

Cannot stress enough that RAID is not a substitute for keeping backups!

Also disk speed and capacity is practically of negligible benefit for SAH because the work is all FPU/GPU based.

A couple of SSDs and a dedicated NAS for media can make quite a decent system.
12) Message boards : Number crunching : Update on Linux 64 -Nividia-V8-MB ????? (Message 1763961)
Posted 11 Feb 2016 by Profile Andy Lee Robinson
Post:
Thanks Jason, nice to know that you're still working on it - looking forward to getting crunching again!
13) Message boards : Number crunching : Update on Linux 64 -Nividia-V8-MB ????? (Message 1763878)
Posted 10 Feb 2016 by Profile Andy Lee Robinson
Post:
It's often painful when life gets in the way :-(
Hope you managed to get back into the groove...
Do you have any more news on progress of Linux v8 cuda?
We've got machines twiddling their virtual thumbs!
14) Message boards : News : New Data from AO (Message 1609571)
Posted 5 Dec 2014 by Profile Andy Lee Robinson
Post:
Yay! At last the wheels are moving again!
http://setistats.haveland.com/
15) Message boards : Number crunching : Panic Mode On (92) Server Problems? (Message 1605442)
Posted 25 Nov 2014 by Profile Andy Lee Robinson
Post:
Not much happening, work in progress still steadily falling.
http://setistats.haveland.com/

Might be a good idea to have a thorough cleanout before restart!
16) Message boards : Number crunching : Welcome Back! (Message 1311505)
Posted 5 Dec 2012 by Profile Andy Lee Robinson
Post:
Yay! and http://setistats.haveland.com came back too without any twiddling.
17) Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16 (Message 1308372)
Posted 21 Nov 2012 by Profile Andy Lee Robinson
Post:
Unlikely to be the onboard nic and drivers, it deals with gigabytes of transfers without upsetting CUDA, a packet has to be received before being rejected, so I suspect something else. (I'll modify the iptables rules to drop silently and see if that helps).

I grew up with hardware interrupts and non-maskable interrupts on a 6502 micro!
I don't know the intricacies of CUDA and hardware level architecture, but it is my understanding that non-urgent interrupts should themselves be interruptible, and I would define a boinc task and functions called, cuda or otherwise to be non-urgent!

Please have a look at the CUDA app code again and consider a retry if a routine fails. I think the CUDA library should be looked at too by NVidia to respect the demands of a system and yield - adapting a kernel to accommodate a library's deficiencies or inaccurate assumptions is a bigger task, though probably not without precedent.
I'll have a look to see where I can file a bug report!
18) Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16 (Message 1308116)
Posted 20 Nov 2012 by Profile Andy Lee Robinson
Post:
Jason, I've identified why it happens! Just don't know how to solve it yet though.

I get a ton of spam - many thousands per day. DKIM, virus and spamassassin scanning takes up a lot of webserver CPU.
Given that most of the spam comes from countries that none of my clients do business with - VN, CN, UA, TH, ID, BR etc, I did something clever on the webserver to ease the load.

I set up iptables rules to preroute the addresses from just those countries and send them over openvpn to here, then to the i7 with the nvidia 550 card to process the spam and automatically report to spamcop etc.

I also wrote log scanners to ban ip addresses of antisocial machines, port scanners, ssh/ftp attacks, phpmyadmin, http proxy probes etc...

Not only do they defend servers from attack, the bad ip addresses are also distributed over mysql replication to all other servers and added to their iptables too. (They are purged automatically after a few days depending on history).

The really weird thing, I noticed in the messages log that CUDA errors were happening when iptables blocked an address! WTF???

I stopped proxying mail traffic to the i7 now, and the CUDA errors have gone away.

It looks like there is a path to investigate, but how in the world does a net packet rejection cause CUDA to fail?

The kernel and rsyslog would be involved, and maybe a writing line to the console and messages log introduced some kind of delay that caused the error.

I noticed it also happened with Einstein CUDA app. and occurrence is very strongly correlated from the messages file:
The app that was running at the time was also aborted with an error, and the seti app also generated similar NVRM errors.

...
Nov 20 10:21:21 ares kernel: [875066.460137] MAIL_DROP:IN=em1 OUT= SRC=109.162.92.6
Nov 20 10:21:21 ares kernel: [875066.471965] NVRM: Xid (0000:03:00): 13, 0001 00000000 000090c0 00002390 00000000 00000000
Nov 20 10:21:24 ares kernel: [875069.418647] MAIL_DROP:IN=em1 OUT= SRC=109.162.92.6
Nov 20 10:21:25 ares kernel: [875071.061918] NVRM: Xid (0000:03:00): 13, 0001 00000000 000090c0 00001b0c 00000000 00000000
...

Perhaps a workaround could be for a CUDA app to handle these errors more gracefully, pause and retry n times if a function fails because of an occasional kernel hiccup?

Meanwhile, I hope relevant maintainers can look into this bizarre behaviour and solve it.
19) Message boards : Number crunching : Suggestions for people having problems connecting to the servers (Message 1307177)
Posted 17 Nov 2012 by Profile Andy Lee Robinson
Post:
Something needs a really good kicking!
Uploads are working, but I can't report or get any new work for the last couple of days, apart from a couple of freak occasions. I configured a socks proxy, SS5 on my webserver, and that is similarly unreliable.
Curiously boinc on the webserver itself does appear to report results, so I would expect that running the proxy on it would help, but no. Perhaps the scheduler chooses a server based on host id and not its ip address?
20) Message boards : Number crunching : Linux Fedora 17 CUDA pain. Worked on Fedora 16 (Message 1302773)
Posted 6 Nov 2012 by Profile Andy Lee Robinson
Post:
I'm supposed to be a professional linux guru, but this is driving me mental!

A few months ago I managed to get lunatics linux+cuda app running on fc16, and since upgrading to fc17 and boinc-client-7.0.29-1.r25790svn.fc17.x86_64 GPU tasks are aborting with this error about 90% of the time.

Cuda error '(cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost))' in file 'cuda/cudaAcc_summax.cu' in line 239 : unspecified launch failure.

app_info.xml is correct as far as I know.

ldd setiathome_x41g_x86_64-pc-linux-gnu_cuda32 gives this:
linux-vdso.so.1 => (0x00007fff7dfff000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003506e00000)
libcudart.so.3 (0x00007faf91bd6000)
libcufft.so.3 (0x00007faf8fe20000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000000350d600000)
libm.so.6 => /lib64/libm.so.6 (0x0000003507200000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003508e00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003506a00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003506600000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003507600000)
librt.so.1 => /lib64/librt.so.1 (0x0000003507a00000)

libs are in /var/lib/boinc/projects/setiathome.berkeley.edu
-rwxr-xr-x 1 boinc boinc 313872 Dec 2 2011 libcudart.so.3
-rwxr-xr-x 1 boinc boinc 28317K Dec 2 2011 libcufft.so.3

/dev contains these, launched by nvidia-smi -pm 1 in rc.local
crw-rw-rw- 1 root root 195, 0 Nov 5 09:42 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 5 09:42 nvidiactl

/etc/ld.so.conf points to where cuda is installed:
/usr/local/cuda/lib
/usr/local/cuda/lib64
and
/usr/lib64/nvidia
/usr/lib/nvidia

echo $PATH
/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/cuda/bin:/root/bin:/usr/local/cuda/bin

In directory /var/lib/boinc/projects/setiathome.berkeley.edu
I copied a failed WU to work_unit.sah and ran
./setiathome_x41g_x86_64-pc-linux-gnu_cuda32 -standalone
and it completed OK.
I'm at a loss to explain why it won't work reliably under boinc-client :(


Next 20


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.