Comedy (Jun 17 2009)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 908453 - Posted: 17 Jun 2009, 20:16:11 UTC I've been busy. Almost too much to write about, none of it all that interesting in the grand scheme of things, so I'll just stick to recent stuff. Our main problem lately has been the mysql database. Given the increased number of users, and the lack of astropulse work (which tends to "slow things down"), the result table in the mysql database is under constant heavy attack. Over the course of a week this table gets severely fragmented, thus resulting in more disk i/o to do the same selects/updates. This has always been a problem, which is why we "compress" the database every tuesday. However, the increased use means a larger, more fragmented table, and it doesn't fit so easily into memory. This is really a problem when the splitter comes along every ten minutes and checks to see if there's enough work available to send out (thus asking the question: should I bother generating more work). This is a simple count on the result table, but if we're in a "bad state" this count which normally takes a second could take literally hours, and stall all other queries, like the feeder, and therefore nobody can get any work. There are up to six splitters running at any given time, so multiple this problem by six. We came up with several obvious solutions to this problem, all of which had non-obvious opposite results. Finally we had another thing to try, which was to make a tiny database table which contains these counts, and have a separate program that runs every so often do these counts and populate the proper table. This way instead of six splitters doing a count query every ten minutes, one program does a single count query every hour (and against the replica database). We made the necessary changes and fired it off yesterday after the outage. Of course it took forever to recover from the outage. When I checked in again at midnight last night I found the splitters finally got the call to generate more work.. and were failing on science database inserts. I figured this was some kind of compile problem, so I fell back to the previous splitter version... but that one was failing reading the raw data files! Then I realized we were in a spate of raw data files that were deemed "questionable" so this wasn't a surprise. I let it go as it was late. As expected, nature took its course and a couple hours later the splitter finally found files it could read and got to work. That is, until our main /home account server crashed! When that happens, it kills everything. Jeff got in early and was already recovering that system before I noticed. He pretty much had it booted up just when I arrived. However, all the systems were hanging on various other systems due to our web of cross-automounts. I had to reboot 5 or 6 of them and do all the cleanup following that. In one lucky case I was able to clean up the automounter maps without having to reboot. So we're recovering from all that now. Hopefully we can figure out the new splitter problems and get that working as well or else we'll start hitting those bad mysql periods really soon. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 908453 ·

Johnney Guinness Volunteer tester Send message Joined: 11 Sep 06 Posts: 3093 Credit: 2,652,287 RAC: 0	Message 908464 - Posted: 17 Jun 2009, 20:51:15 UTC Last modified: 17 Jun 2009, 20:52:57 UTC Matt, It sounds like the "ER" in there every day. Its surgery on the MySql databases every few hours!. John. ID: 908464 ·

B-Man Volunteer tester Send message Joined: 11 Feb 01 Posts: 253 Credit: 147,366 RAC: 0	Message 908466 - Posted: 17 Jun 2009, 20:54:38 UTC It looks like you are like the classic Little Dutch boy at the dike and are running out of fingers to stick in the dike. Good luck fixing all the issues. ID: 908466 ·

Vid Vidmar* Volunteer tester Send message Joined: 19 Aug 99 Posts: 136 Credit: 1,830,317 RAC: 0	Message 908480 - Posted: 17 Jun 2009, 21:13:55 UTC - in response to Message 908453. ... We came up with several obvious solutions to this problem, all of which had non-obvious opposite results. Finally we had another thing to try, which was to make a tiny database table which contains these counts, and have a separate program that runs every so often do these counts and populate the proper table... Hey. Thanks for updating us. Must have been hellish booting all those 'puters and making them work. But, what really interests me, why did you choose to use a separate program instead of triggers? BR, ID: 908480 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20635 Credit: 7,508,002 RAC: 20	Message 908492 - Posted: 17 Jun 2009, 21:42:33 UTC - in response to Message 908453. Last modified: 17 Jun 2009, 21:43:39 UTC I've been busy... Sounds like quite a maelstrom! Our main problem lately has been the mysql database. Given the increased number of users, and the lack of astropulse work (which tends to "slow things down")... As in "slows down the flood of results from the users" 'slow down' and so lets the servers speed up! :-) ... hanging on various other systems due to our web of cross-automounts. I had to reboot 5 or 6 of them and do all the cleanup following that. ... Any possibility of arranging the mounts to be in a tree structure rather than a random mapping?... Or rearrange the data locations to avoid them in the first place?... (I'm sure you're on to that if at all possible already. I've played with multiple network mounts. Very useful on guaranteed dedicated bandwidth or if for just as temporary fixes. Otherwise, it is a nightmare just waiting to happen...) or else we'll start hitting those bad mysql periods really soon. The central database is a recurring theme... Can the database itself be split up across multiple machines to ease the central bottleneck? Could some operations run offline from updating the database in real-time to be instead batched through more efficiently? Good luck, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 908492 ·

ra]in-man Send message Joined: 25 Aug 06 Posts: 2 Credit: 503,787 RAC: 0	Message 908536 - Posted: 17 Jun 2009, 22:28:46 UTC so im assuming this is why my completed seti projects are having issues uploading, and getting new jobs is a pain ID: 908536 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 908554 - Posted: 17 Jun 2009, 23:04:43 UTC - in response to Message 908453. Matt, I have a constructive idea. Instead of deleting results every 24 hours, could you just schedule this operation late Monday night/Tue morning (weekly) before the Tuesday outage? That would avoid the fragmented table extents and indexes during the week. Before the outage, the validated results at least 24 hours old would be purged. During the outage, you can do your compress as normal. The only downside is that this method will consume more disk space on the database server. But the performance should increase. It might also help the database if the compress operation you do has the option to "reuse storage"; Oracle has this feature. This keeps the extents allocated for the table and index empty but locked for its table. That way, when more inserts happen, there is no need for the DB system to "find free space". You may have to drop index(es) before this delete operation and re-create them afterward for acceptable performance. Please point the DBA (Jeff?) to this post. He could tell in about 10 seconds if either or both of these strategies will work or not. I hope it helps. ID: 908554 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 908572 - Posted: 17 Jun 2009, 23:54:29 UTC Matt, you should see this thread http://setiathome.berkeley.edu/forum_thread.php?id=54204 A lot of people are getting validate errors. (myself included) PROUD MEMBER OF Team Starfire World BOINC ID: 908572 ·

Berserker Volunteer tester Send message Joined: 2 Jun 99 Posts: 105 Credit: 5,440,087 RAC: 0	Message 908580 - Posted: 18 Jun 2009, 0:00:37 UTC - in response to Message 908554. Last modified: 18 Jun 2009, 0:03:37 UTC The only downside is that this method will consume more disk space on the database server. But the performance should increase. Far from the only downside I'm afraid. You just replaced one problem (fragmentation) with another (seven times more records). The whole point of db_purge is to stop the tables getting so big that they won't fit in memory. The optimal solution is to design the query such that it does not need to traverse all rows - an inherently expensive operation - but I think we can safely assume that isn't possible or it would have been done already. The next best is to reduce the number of times the query is used, for example by caching the result, which is what was attempted. After that comes keeping the table in memory to avoid using slow disks, which is what db_purge is supposed to do (but isn't doing very well right now). Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking. ID: 908580 ·

Keith Jones Send message Joined: 25 Oct 00 Posts: 17 Credit: 5,266,279 RAC: 0	Message 908582 - Posted: 18 Jun 2009, 0:03:01 UTC - in response to Message 908554. Hi Matt, Sorry to hear about all the difficulties. Thanks for all the work. Hang in there! I wasn't sure what version of MySQL you're currently using (or even planning to use in the future) so at the risk of teaching you to suck eggs I thought I'd better mention that MySQL began shipping version 7 in April. It supports clustering, in-memory tables, load balancing, fault tolerance etc. If you haven't upgraded to it , it might be worth a thought so you can spread the load around as per ML1's suggestion. Best wishes, Keith ID: 908582 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 908607 - Posted: 18 Jun 2009, 1:36:33 UTC - in response to Message 908580. The only downside is that this method will consume more disk space on the database server. But the performance should increase. Far from the only downside I'm afraid. You just replaced one problem (fragmentation) with another (seven times more records). The whole point of db_purge is to stop the tables getting so big that they won't fit in memory. The optimal solution is to design the query such that it does not need to traverse all rows - an inherently expensive operation - but I think we can safely assume that isn't possible or it would have been done already. The next best is to reduce the number of times the query is used, for example by caching the result, which is what was attempted. After that comes keeping the table in memory to avoid using slow disks, which is what db_purge is supposed to do (but isn't doing very well right now). What query are you talking about that must traverse all rows? That's a bad query in any context. The problem I was addressing relates to minimizing disk I/O. The table is obviously too big to fit completely into memory anyway (remember to include indexes). So, reducing I/O should also reduce the amount of steady-state memory usage for that table. Having contiguous records in an index also helps it when searching, because the search algorithm goes from log(n)+(total_records / deleted_records) to log(n). By improving the index seek time, queries will average shorter. By optimizing the most common case, we're making the most impact on performance. There are several other things they could do, including lowering the value of query_cache_min_res_unit to be the data size of one row from the result table (default is 4K). But without actually being there to work on this and see the performance metrics (e.g., Qcache_lowmem_prunes, etc.), I can only make thoughtful, constructive suggestions and research them before saying anything. It's up to Matt, Jeff, and the rest of them to find the time for discussion, decision, and implementation. Performance vs storage space has always been a hallmark trade-off in computer science applications, but here I am simply presenting an alternative to status quo, that to the best of my professional experience, will accomplish what I said it would. ID: 908607 ·

ra]in-man Send message Joined: 25 Aug 06 Posts: 2 Credit: 503,787 RAC: 0	Message 908609 - Posted: 18 Jun 2009, 2:00:07 UTC Sorry if this is the wrong section to post...but i am unable to upload completed seti@home work 6/17/2009 8:47:05 PM SETI@home Started upload of 14mr09ac.1821.2526.11.8.232_2_0 6/17/2009 8:47:22 PM Project communication failed: attempting access to reference site 6/17/2009 8:47:22 PM SETI@home Temporarily failed upload of 01mr09ae.31974.7025.7.8.170_3_0: connect() failed 6/17/2009 8:47:22 PM SETI@home Backing off 2 hr 14 min 37 sec on upload of 01mr09ae.31974.7025.7.8.170_3_0 6/17/2009 8:47:22 PM SETI@home Started upload of 10mr09ac.6383.762157.5.8.217_0_0 6/17/2009 8:47:23 PM Internet access OK - project servers may be temporarily down. 6/17/2009 8:47:27 PM Project communication failed: attempting access to reference site 6/17/2009 8:47:27 PM SETI@home Temporarily failed upload of 14mr09ac.1821.2526.11.8.232_2_0: connect() failed 6/17/2009 8:47:27 PM SETI@home Backing off 22 min 24 sec on upload of 14mr09ac.1821.2526.11.8.232_2_0 6/17/2009 8:47:27 PM SETI@home Started upload of 12mr09ac.9323.17250.5.8.227_1_0 6/17/2009 8:47:28 PM Internet access OK - project servers may be temporarily down. 6/17/2009 8:47:44 PM Project communication failed: attempting access to reference site 6/17/2009 8:47:44 PM SETI@home Temporarily failed upload of 10mr09ac.6383.762157.5.8.217_0_0: connect() failed 6/17/2009 8:47:44 PM SETI@home Backing off 1 hr 44 min 21 sec on upload of 10mr09ac.6383.762157.5.8.217_0_0 6/17/2009 8:47:45 PM Internet access OK - project servers may be temporarily down. 6/17/2009 8:47:49 PM Project communication failed: attempting access to reference site 6/17/2009 8:47:49 PM SETI@home Temporarily failed upload of 12mr09ac.9323.17250.5.8.227_1_0: connect() failed 6/17/2009 8:47:49 PM SETI@home Backing off 12 min 44 sec on upload of 12mr09ac.9323.17250.5.8.227_1_0 6/17/2009 8:47:50 PM Internet access OK - project servers may be temporarily down. 6/17/2009 8:49:34 PM SETI@home Started upload of 04mr09ac.12347.481.6.8.24_0_0 6/17/2009 8:49:55 PM SETI@home work fetch suspended by user 6/17/2009 8:49:55 PM Project communication failed: attempting access to reference site 6/17/2009 8:49:55 PM SETI@home Temporarily failed upload of 04mr09ac.12347.481.6.8.24_0_0: connect() failed 6/17/2009 8:49:55 PM SETI@home Backing off 3 min 44 sec on upload of 04mr09ac.12347.481.6.8.24_0_0 6/17/2009 8:49:56 PM Internet access OK - project servers may be temporarily down. 6/17/2009 8:53:39 PM SETI@home Started upload of 04mr09ac.12347.481.6.8.24_0_0 6/17/2009 8:54:01 PM Project communication failed: attempting access to reference site 6/17/2009 8:54:01 PM SETI@home Temporarily failed upload of 04mr09ac.12347.481.6.8.24_0_0: connect() failed 6/17/2009 8:54:01 PM SETI@home Backing off 10 min 3 sec on upload of 04mr09ac.12347.481.6.8.24_0_0 ID: 908609 ·

Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0	Message 908614 - Posted: 18 Jun 2009, 2:26:41 UTC - in response to Message 908607. Performance vs storage space has always been a hallmark trade-off in computer science applications, but here I am simply presenting an alternative to status quo, that to the best of my professional experience, will accomplish what I said it would. Well, to avoid hitting the db, I'd just add another line to the code that would add +n or -n to a persistent key in memcache after the sql update. It would only use a few bytes instead of causing such grief. The memcache key could be reset daily or weekly with absolute values from one of more count(*) queries, and then just record deltas as the db is updated. The values would be available immediately to all machines aware of the memcache pool, and not hit the db at all. It could also be used for real time stats on the public website with no hit on the db either. yum install memcached See http://blogs.vinuthomas.com/2006/02/06/memcached-with-php/ ID: 908614 ·

Norwich Gadfly Send message Joined: 29 Dec 08 Posts: 100 Credit: 488,414 RAC: 0	Message 908663 - Posted: 18 Jun 2009, 7:10:27 UTC - in response to Message 908492. Last modified: 18 Jun 2009, 7:10:56 UTC Could some operations run offline from updating the database in real-time to be instead batched through more efficiently? Martin Yes, that reminds me of the dark ages (well the 1970s actually) where updates were glorified merges of transactions with the previous master to create the new generation of the master. Master files were copied to indexed-sequential when random access was required. ID: 908663 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 908756 - Posted: 18 Jun 2009, 14:36:28 UTC Thanks Matt for the (daily) update! thumb up Could someone guess/say when SETI@home (and -Beta Test) will be again running well? My GPU cruncher is offline and I miss the noise of the fan.. ;-) ID: 908756 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 908757 - Posted: 18 Jun 2009, 14:37:19 UTC Last modified: 18 Jun 2009, 14:37:42 UTC Matt, 1) I'm sure you've noticed, but just in case you run straight to the database when you get in - you've got another problem: no uploads since yesterday. That can't be a database problem, of course, and it's not a bandwidth problem either (been below 40 Mbit/s all day). So I presume it's one of those pesky mounts, or an underlying file system problem on the upload data store. 2) I hadn't realised quite how successful you'd been in recruiting active users recently: If they are going to enjoy a satisfactory user experience (and not take their spare cycles away again), you've GOT to get that database under control: and if it's grown too big to fit in memory, I don't see how that's going to be possible. Instead of trying to get it to work at the present (bloated) size, have you thought about trying to reduce it to a manageable size? Size will be proportional to (active_users)*(tasks_per_user): you want to maximize the first term, so that would mean reducing the second. There is an unfortunate tendency among crunchers to increase cache sizes to the maximum at the first sign of trouble. That solves the problem from their own personal perspective, but I think too few of them (us?) stop to think about the strain they impose on the project they purport to support. I saw an interesting FAQ at GPUgrid this morning, while looking for an alternative CUDA project: they "... give an additional 25% [credit] for WUs returned within two days. This is useful for us to reduce latency of the results ..." - in other words, use a bit of economic social engineering instead of brute database force to solve the problem. Look at host 4947578: eight cores, 5 GPUs, and a turnround of 19.4 days. How much does that add to your query load? ID: 908757 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 908758 - Posted: 18 Jun 2009, 14:39:15 UTC @ ra]in-man Have a look in the NC forum: http://setiathome.berkeley.edu/forum_forum.php?id=10 You are not alone with this prob.. ID: 908758 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 908762 - Posted: 18 Jun 2009, 14:49:42 UTC - in response to Message 908757. Last modified: 18 Jun 2009, 14:50:21 UTC ... Look at host 4947578: eight cores, 5 GPUs, and a turnround of 19.4 days. How much does that add to your query load? AFAIK, for example.. if you switch off the PC and the claimed credits will be granted, this will increase this value. BTW. Normally it should be 6 GPUs.. 3 x GTX295.. maybe the user have probs to enable the last GPU..? ID: 908762 ·

Virtual Boss* Volunteer tester Send message Joined: 4 May 08 Posts: 417 Credit: 6,440,287 RAC: 0	Message 908764 - Posted: 18 Jun 2009, 14:56:11 UTC - in response to Message 908453. This is really a problem when the splitter comes along every ten minutes and checks to see if there's enough work available to send out (thus asking the question: should I bother generating more work). This is a simple count on the result table, but if we're in a "bad state" this count which normally takes a second could take literally hours, and stall all other queries, like the feeder, and therefore nobody can get any work. There are up to six splitters running at any given time, so multiple this problem by six. Is it possible to have different settings for the each of the splitters? If you could set individual timing & turn on/off values on queue size..... For example - 1 splitter - every 10 mins, turnon @ <90% full, turnoff @ >100% full. 2 splitters - every 30 minutes, turnon @ <80% full, turnoff @ >95% full. 2 splitters - every 60 minutes, turnon @ <70% full, turnoff @ >85% full. Rest of splitters - every 120 minutes, turnon @ <60% full, turnoff @ >75% full. This would reduce the query load considerably while maintaining a good supply of WU,s ready to send. PS: The values I suggested were only guesswork on my part, and would probably have to be adjusted for optimum performance. ID: 908764 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 908791 - Posted: 18 Jun 2009, 16:22:39 UTC - in response to Message 908757. If they are going to enjoy a satisfactory user experience (and not take their spare cycles away again), you've GOT to get that database under control: and if it's grown too big to fit in memory, I don't see how that's going to be possible. Hmmm.... Admittedly, I learned about databases when 64k was a lot of memory (and it was core, not RAM) but you sure couldn't keep the whole database in memory then, and I'm not sure why it "must" fit now. I don't claim to be a MySQL expert by any means, I just assume that it's competent. You have to measure to work this out, but scanning the whole database is likely rarely needed. There is probably an active region that'd be nice to have cached. But the whole database? Seems unlikely, even in this day of much RAM. I can definitely see that some operations (like getting a count) would be expensive -- and would tend to push the active records from the cache, and it sounds like they've got a start on preventing those. It does sound like MySQL does not recover used space well on the scale that SETI adds and removes records. That's a (bigger) flaw. ID: 908791 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.