High performance Linux clients at SETI

Author	Message
Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462	Message 1984435 - Posted: 10 Mar 2019, 20:24:00 UTC I just started getting http internal server errors. Is it just me? Tom A proud member of the OFA (Old Farts Association). ID: 1984435 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 38198 Credit: 261,360,520 RAC: 489	Message 1984440 - Posted: 10 Mar 2019, 20:56:12 UTC It maybe just you Tom as it's all working fine and quick here. Cheers. ID: 1984440 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462	Message 1984444 - Posted: 10 Mar 2019, 21:13:13 UTC - in response to Message 1984440. It maybe just you Tom as it's all working fine and quick here. Cheers. Yes, it was "just me". I dusted off a Linux/Cuda91 HD and Seti didn't want to play. I removed Seti, re-installed Tbars-all-in-One. Started the Boinc Manager up and am running a new computer id. :) Ton A proud member of the OFA (Old Farts Association). ID: 1984444 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1984449 - Posted: 10 Mar 2019, 21:25:49 UTC - in response to Message 1984435. I just started getting http internal server errors. Is it just me? Tom You really ought to look at what you send to the server to see what makes it fall over. It's the contents of the file sched_request_setiathome.berkeley.edu.xml which remains unaltered in your top-level BOINC data directory until overwritten by the next request five minutes later. Other people - notably Keith - also complain about http 500 Internal Server Errors. Could something indigestible be being generated by that non-standard BOINC version? ID: 1984449 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1984475 - Posted: 11 Mar 2019, 0:30:19 UTC - in response to Message 1984449. I only get those on congested Tuesdays as can be expected. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1984475 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1984476 - Posted: 11 Mar 2019, 0:43:24 UTC - in response to Message 1984449. Could something indigestible be being generated by that non-standard BOINC version? Just curious as to your definition of 'non-standard'. As the person that compiled 7.8.3 I can declare it is probably More 'standard' than any other version of BOINC for Ubuntu you will find. The only change from the client release is the omission of a failed Manager Bug of a Bug Fix that caused more problems than it allegedly fixed. The BOINC part is untouched from the release. So, without that Bad Manager 'Bug Fix', it is more standard than any compile that includes that alleged 'Bug Fix' which itself is a Bug. I think I've had One Person say they had a display problem in a Non-Ubuntu system, No complaints from those running Ubuntu. There is only one part of the Berkeley download page for Linux that is still current, that would be the part that claims; Linux x64 Tested on the current Ubuntu distribution; may work on others. Shame Linux is the only Platform BOINC will not provide an App. The 7.8.3 version would have been a welcome addition to that page, as it is, those Linux Apps at Berkeley should probably be removed as they haven't worked on any newer versions of Linux in years. BTW, Keith hasn't used 7.8.3 in a very long time. At present he isn't using any Version of BOINC from the All-In-One package. The only time I see 'http 500 Internal Server Errors' in Linux is when My other Non-Linux Machines are giving the same Error after an Outage. ID: 1984476 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1984479 - Posted: 11 Mar 2019, 1:25:40 UTC Last modified: 11 Mar 2019, 1:27:48 UTC Anybody who reads the forums regularly probably has developed the impression as I have that there is an anti-Linux bias present by many members. Even have one member using a username professing his bias. So pretty obvious. And Richard, you yourself have told me you know nothing about Linux and that I was left to my own devices to compile the BOINC platform once I uncovered a bug that had been present for years. So any wonder since we have no official Linux support, for apps or recent BOINC releases that we are left to our own devices to support ourselves? And TBar's BOINC 7.8.3 fixed a bug that had been present for years and never attended to by the BOINC code maintainers with the jumping task lists sorting bug. Did he get any recognition of that fact? No. Just ignored as usual with Linux being the red-headed stepchild. I agree that old 7.2.42 BOINC repository version should be removed. Causes more issues than it helps. If the new user needs to install Linux versions of BOINC, they should get it from their distro repository as it at least will be current within a couple of releases of the current code branch. Or approach one of the current Linux users for help and support as we actually can offer some practical support and knowledge that the Windows BOINC support mechanism is useless for now. We certainly are not getting any direct support from normal BOINC resources that all Windows and MAC users regularly receive. So the Linux user community is left to support themselves. As we traditionally have. But then we see negative posts and comments alluding to our use of Linux. I just try an ignore the obvious bias and soldier on as best I can. But it is disheartening to regularly see the anti Linux bias since I want to assume all members who contribute their processing power would be welcomed with open arms and it shouldn't matter what platform they run their computing devices on. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1984479 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1984524 - Posted: 11 Mar 2019, 8:46:44 UTC Last stats said Seti had 92K active users. There is nowhere near 10% or 9K Linux users. I would hazard a guess at maybe a couple hundred of active Linux users. Definitely a minority. But what I was commenting on are the digs buried in comments that the Linux users are "cheating the system" "causing undo pressures to the database" or "Linux users are not using the system as designed" vis-a-vis that they aren't using Windows as the majority of users are. That is the type of bias I read in the posts. The Seti servers run on Scientific Linux so there must be someone at Seti that knows Linux. I don't have any animosity towards Windows. I started BOINC on Windows but before BOINC on Classic I was an OS/2 Warp user, an even smaller minority. It never got any love either. I never had much issues with Windows 7 but the one machine I converted over to Windows 10 was a large mistake. Lots of issues and finally I decided I had had enough of Windows and decided to revisit Linux. It just so happened that the Linux special app appeared at the same time and soon proved itself to be the most efficient at processing Seti MB tasks. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1984524 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1984526 - Posted: 11 Mar 2019, 8:57:47 UTC Sorry - I seem to have touched a raw nerve here. I haven't got time to answer all those points before some of you leave us for a night's sleep, but I hope to have a full reply posted before you wake up again. Apologies if you felt I disrespected Linux users - that wan't my intention at all - but it does go to the heart of how we analyse the causes of problems, and what we do with that information once we have it. Later. ID: 1984526 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9959 Credit: 103,452,613 RAC: 328	Message 1984528 - Posted: 11 Mar 2019, 9:18:25 UTC Last modified: 11 Mar 2019, 9:26:53 UTC I think the problem is two fold. 1) The Linux app is so fast that each machine with a modern GPU cannot last through a normal outage, let alone the sort of problems we had last week. This has caused problem 2 2) Some Linux special app users have recompiled Boinc so as to report large numbers of GPU's typically 64 or 48 so to be able to stockpile work for the outage. What this does is this. If you look at just the top 20 machines, and assume each one has an average of 4 GPU's, after an outage or longer shutdown they would be trying to return 20x4x100=8,000 tasks Of course they will then request a further 8,000 tasks, however there are not 80 GPU's in the top 20 machines but 595, so they will be returning and asking for up to 595x100=59,500 tasks The top twenty machines are asking for 7 times the amount of work you might reasonably expect. Now to me that would appear to help with the slowness of the recovery after an outage, and I think it might just annoy or upset others. Please note I having nothing against Linux, I have even started a new Linux cruncher yesterday, however I can see how it make make others question things ID: 1984528 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 1984532 - Posted: 11 Mar 2019, 10:08:47 UTC - in response to Message 1984528. Last modified: 11 Mar 2019, 10:09:49 UTC Now to me that would appear to help with the slowness of the recovery after an outage, and I think it might just annoy or upset others. Please note I having nothing against Linux, I have even started a new Linux cruncher yesterday, however I can see how it make make others question things This is part of our agenda which we have provided as an extra information when using "buffered versions". ".....Again to releave the pressure on the servers to give everybody else what they need until it's time to start to fill up the buffers again. Why? To use the advantage of GPU spoofed Boinc executable to actually benefit the servers instead of doing DDOS war amongst all other users. Use the benefits of larger cache to ease the pressure instead as a polite favour. " To put it in Another perspective, my hosts doesn't even try to download after the servers comes live. Aprox. 12 hours after give or take a few hours the first host is allowed to Contact to download workunits, then about one hour later, the other one gets allowance and so on. We don't want to be part of the flood of users want to get their share of WU's because it's empty or near empty. Some of us has such amount of fast GPUs so it's needed to try to get something amongst the other perhaps 140K hosts+ or so but for me it's not needed to be part and try to elbow to get work. So this special version has allowed us to AID instead of hampering the system. That's a huge difference isn't it. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 1984532 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9959 Credit: 103,452,613 RAC: 328	Message 1984536 - Posted: 11 Mar 2019, 11:17:30 UTC Last modified: 11 Mar 2019, 11:28:39 UTC As suggested, starting a thread to separate this discussion from the "panic mode thread" Thanks to Richard for suggesting it. I had the same thought while out for my morning walk! ID: 1984536 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9959 Credit: 103,452,613 RAC: 328	Message 1984538 - Posted: 11 Mar 2019, 11:35:48 UTC ".....Again to releave the pressure on the servers to give everybody else what they need until it's time to start to fill up the buffers again. Why? To use the advantage of GPU spoofed Boinc executable to actually benefit the servers instead of doing DDOS war amongst all other users. Use the benefits of larger cache to ease the pressure instead as a polite favour. " To put it in Another perspective, my hosts doesn't even try to download after the servers comes live. Aprox. 12 hours after give or take a few hours the first host is allowed to Contact to download workunits, then about one hour later, the other one gets allowance and so on. We don't want to be part of the flood of users want to get their share of WU's because it's empty or near empty. So you are saying that after any outage, all the users with "spoofed" GPU's wait until the traffic dies down before reporting or asking for new work. If so then I applaud you community spirit, and it may help to let this fact be better known. ID: 1984538 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1984540 - Posted: 11 Mar 2019, 11:49:39 UTC - in response to Message 1984538. I applaud your community spirit, and it may help to let this fact be better known. Here, here. I agree on both counts. ID: 1984540 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1984543 - Posted: 11 Mar 2019, 12:40:01 UTC Last modified: 11 Mar 2019, 13:18:24 UTC Adding more info to the thread, not gasoline please understand it. We are only few who use this spoofed builds, all are heavy users, with high performance crunchers and besides me (who knows almost nothing) the others are well trained on the "boinc mysteries". We All care about the SETI project and the integrity of the DB and try to do all we can to preserve it. That's is one of the main reason why we keep this builds on a closed circle. Makes little sense the claims about them, we all use the rescheduling before this builds, the only thing who changes is now we not need the rescheduler anymore, nothing changes about the number of WU stored on our computer caches and they present no crunching speed or performance gain. I know fear about the unknown is natural, but this is not the case to be fear about. We only change the program to avoid the use of the rescheduler, not the way the boinc works, so I'm sure has nothing to do with the http error posted on the thread or the last week DB crash as was suggested on another thread. Adding of what Vyper's posted, from our team closed forum: ------------------------------------------------------------ Also pulling 6000 extra tasks from the server right after maintenance/outage is just stupid management in my eyes. Be kind to others, and the servers! I see Vyper shows a scheduled network disable during that time to prevent that -- Good going there! Sorry. My mistake. I incorrectly imagine the mate who use this builds is an advanced user and knows about bunkering and how to control the request for new work after the outages like Vypers explained. The main idea of this builds is exactly help to pass the outages without pain. For us & the servers. In my case the host automatically shout down the request for the new work an hour before the outages and return to ask for new work after 12 hrs only. 7AM to 7PM in my time zone (UTC -5 similar to EST in the US just not have the DST). <day_prefs> <day_of_week>2</day_of_week> <net_start_hour>19.00</net_start_hour> <net_end_hour>7.00</net_end_hour> </day_prefs> <edit> If the outage is unscheduled i manage the start/stop request for new work manually. But the large cache could keep my host crunching for about 1.5 days even on this unexpected outages. ------------------------------------------------------------ So you could see we are trying to do our best to avoid pushing large amount of WU when the outages happening. If anyone wish to blame us for anything be free to post but please not because the fear of the unknown. We just find a different path to follow instead of the rescheduling. I know a lot uses the reschedulers (there are a lot of them who works fine), and nobody say nothing about and the impact on the Wu cache is exactly the same ID: 1984543 ·

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9959 Credit: 103,452,613 RAC: 328	Message 1984545 - Posted: 11 Mar 2019, 13:45:41 UTC We are only few who use this spoofed builds, all are heavy users, with high performance crunchers and besides me (who knows almost nothing) the others are well trained on the "boinc mysteries". We All care about the SETI project and the integrity of the DB and try to do all we can to preserve it. That's is one of the main reason why we keep this builds on a closed circle. Thank you for explaining. This is the very first I, and I expect others have heard about the "shutdown" during and after outages, this is to be commended, but I think that when anyone who cares to look can see these 64 GPU machines in the top 20, it might have been a good "PR" exercise to let people know that you weren't "swamping the server" I know a lot uses the reschedulers (there are a lot of them who works fine), and nobody say nothing about and the impact on the Wu cache is exactly the same Yes I should probably have mentioned them in my earlier post, however they are not as easy to see, but after what you have said, I hope they are doing the same thing and not swamping the servers with thousands of tasks after an outage and displaying the same community spirit. ID: 1984545 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 1984548 - Posted: 11 Mar 2019, 14:11:24 UTC - in response to Message 1984545. We are only few who use this spoofed builds, all are heavy users, with high performance crunchers and besides me (who knows almost nothing) the others are well trained on the "boinc mysteries". We All care about the SETI project and the integrity of the DB and try to do all we can to preserve it. That's is one of the main reason why we keep this builds on a closed circle. Thank you for explaining. This is the very first I, and I expect others have heard about the "shutdown" during and after outages, this is to be commended, but I think that when anyone who cares to look can see these 64 GPU machines in the top 20, it might have been a good "PR" exercise to let people know that you weren't "swamping the server" I know a lot uses the reschedulers (there are a lot of them who works fine), and nobody say nothing about and the impact on the Wu cache is exactly the same Yes I should probably have mentioned them in my earlier post, however they are not as easy to see, but after what you have said, I hope they are doing the same thing and not swamping the servers with thousands of tasks after an outage and displaying the same community spirit. I can't speak for that "everyone" does it to 100% in the team, in the guide how to use the Boinc executable i've written about that it's a recommendation to do so because when we all got tasks to Crunch, why should the host act as a part and "steal" bandwidth for 140000 other hosts trying to report/download stuff in a crowded Connection. If the particular host is participating to that we only add a whole bunch of TCP timeouts to no avail anyway because seti is totally flooded with demand/report requests. We even have a guide that is set to only report 100 results at a time just to prevent these timeouts, because that's only unneccesary requests. https://setistats.haveland.com/sah_v8_creation.html If we look at this chart (weekly) we can see that after an extended outage it took almost two whole days for the systems to get back to where it was Before. And that is for only a real downtime for about 12 to 16 hours. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 1984548 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1984567 - Posted: 11 Mar 2019, 16:26:08 UTC OK, since I'm the one who started this hare running, I'd better try and answer some of the points raised overnight. Tom M wrote: I dusted off a Linux/Cuda91 HD and Seti didn't want to play. I removed Seti, re-installed Tbars-all-in-One. That's where I came in. There had been an http 500 internal server error. On a Sunday night, that didn't seem like the normal Tuesday outage recovery problem, and I wondered whether something else might be causing it. Given that the only thing the servers ever see from our computers is that sched_request file, I asked (with a question mark) whether there might be something unusual about it? More for future reference than anything else, since Tom had already re-installed and got past the problem by that stage. TBar wrote: As the person that compiled 7.8.3 ... That's the first time the code version number was quoted, and to be honest it makes me even more suspicious. In my opinion, v7.8.3 is probably one of the poorest code versions to base a special version on. I've nothing against home builds - I build and run them myself - but I try to keep awareness of what is good and bad about each one, and I usually build them to explore an ongoing problem, or to test a potential fix. Some background: BOINC development was funded by the US Government's NSF through a succession of renewed grants. But sometime in late 2014 or early 2015 the grant was not renewed, and in Summer 2015 BOINC started losing core staff. A version 7.6 was released around that time (and several patches/fixes were added later), but development work basically stopped. The BOINC project management was nominally handed over to "The Community", but with no preparatory community development work and no visible structure. A management committee existed, but on paper only - I'm pretty certain it never met. Then, in Summer 2017, representatives from some of the major projects - most notably Kevin Reed from World Community Grid - called together a working party to get the show back on the road. Myself and Jord van der Elst (who you'll know as 'Ageless') were invited to sit on that working party to provide perspective from the user/volunteer viewpoint. And round about that time, BOINC v7.8 was released with whatever sporadic improvements had trickled in by then. But it was never thoroughly tested, and many significant improvements which were made almost as soon as development restarted were never incorporated into 7.8 Since then, I've spent many hours in teleconference meetings with BOINC staff, project staff, and volunteer developers, and I've got to know and understand them much better than before. I've even managed to code some user-requested improvements myself. And those improvements are in the C++ codebase which is common across all platforms - I personally test them in Windows, but they will appear on Mac and Linux as well. I was personally asked to act as Release Manager for the 7.10 version - and it was a big eye-opener as to how much work goes into a BOINC release. The codebase is essentially complete before we start, but packaging, documentation, included files and so on all have to be checked. My personal mailing list for that release included Laurence Field of CERN (BOINC lead for Linux), Gianfranco Costamagna (LocutusOfBorg PPA), and Germano Massullo (Fedora package maintainer). So I can assure you that Linux was not ignored in the release process. One of the things that became clear during this process is that many key BOINC people are very knowledgeable about Linux indeed - indeed, I think there's no recognised way of running BOINC *servers* except on Linux. But these people don't run the BOINC client in the same way that we home volunteers would. So when people complain about Linux users being ignored by the developers, that really should be read in the context of 'Linux CLIENT users'. And I think much of the feeling of isolation comes, not so much from the BOINC tools, but from the way they're packaged and distributed. Windows users have always had an Installer which presents them with a choice of Service mode or User mode installations. In recent years, User mode installations have become more popular because of the restrictions which Microsoft have placed on GPU drivers. For Linux, GPU drivers are integrated with the kernel (something which keeps catching people out), and so Service mode continues to be viable. That's the way the package maintainers like to work, too, and it's the only reason why the lonely v7.2.42 user-mode script still appears on the download page. The binary files would be the same whichever way they were installed, so in principle that old script could be updated to deliver modern binaries - any takers? Another problem which I think I've identified from the outside is the nature of the growing home Linux user base. Linux users in academic and scientific circles seem to be happy (happiest?) working at the command line, but refugees from Windows 8 and 10 prefer working with a GUI. And there are a lot of them to choose from now... So, perhaps we should think in terms of a new, GUI-based installation interface which allowed a choice of User or Service mode installations on as wide a range of Linux distros as possible. Does that sound achievable, and if so - how? I think most of the other misunderstandings have been - most helpfully - covered by Vyper and Juan's description of how the buffered high-performance clients were designed to be used. But if there are any other remaining issues which require additional work, I think I have the contacts now and I'll be happy to feed them in where I can. ID: 1984567 ·

Sirius B Volunteer tester Send message Joined: 26 Dec 00 Posts: 24930 Credit: 3,081,182 RAC: 7	Message 1984574 - Posted: 11 Mar 2019, 17:12:56 UTC - in response to Message 1984567. Last modified: 11 Mar 2019, 17:14:26 UTC Great post. Another problem which I think I've identified from the outside is the nature of the growing home Linux user base. Linux users in academic and scientific circles seem to be happy (happiest?) working at the command line, but refugees from Windows 8 and 10 prefer working with a GUI. And there are a lot of them to choose from now... So, perhaps we should think in terms of a new, GUI-based installation interface which allowed a choice of User or Service mode installations on as wide a range of Linux distros as possible. Does that sound achievable, and if so - how? I've tried Linux (Mint) & liked it. However & I have brought this up in the past on these boards, is the main issue you've highlighted. Blame MS for the GUI. During the DoS days, one had to use CLI. The number of times I messed up Autoexec.bat & Config.sys files - had to learn to check files before saving & replacing current files. GUI made things so easy. :-) After so many years of that, it won't be easy to return to using CLI. :-( Edited for spelling. ID: 1984574 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1984578 - Posted: 11 Mar 2019, 17:27:59 UTC - in response to Message 1984572. At the bottom of every seti page we see this SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956. Are we now saying that should be amended to "was"? You'll have to ask Eric. https://www.nsf.gov/awardsearch/showAward?AWD_ID=0307956&HistoricalAwards=false still says 'Continuing grant'. But it doesn't say anything about high performance Linux clients. SETI != BOINC. ID: 1984578 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.