Emergence (Aug 13 2013)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1402757 - Posted: 13 Aug 2013, 21:07:53 UTC Hello again! Once again I'm emerging from a span of time where I was either out of the lab or in the lab working on non-newsworthy development, and realizing it's been way too long since I drummed up one of these reports. We had our usual Tuesday outage again today. Same old, same old. However last week we had some scary, unexpected server crashes. First oscar (our main mysql server) crashed, and then a couple hours after that so did carolyn (the replica). Neither crashed as much as the kernels got into some sort of dead lock and couldn't be wedged - in both cases we got the people down at the colocation facility to reboot the machines for us and all was well. Except the replica database needed to be resync'ed. I did so rather quickly though the project has been up for a while and thus not at a safe, clean break point. I thought all was well until after coming out of today's outage when the replica hit a point of confusion in its logs. I guess I need that clean break point - I'm resync'ing again now and will do so again more safely next week. No big deal - this isn't hurting normal operations in the least. Though largely we are under normal operating conditions, there are other behind the scenes activities going on - news to come when the time is right. One thing I can mention is that we're closer and closer to deciding that getting our science database entirely on solid state drives is going to be unavoidable if we are to ever analyze all this data. We just keep hitting disk i/o bottlenecks no matter what we try to speed things up. Any other thoughts and questions? Am I missing anything? Yes, I know about the splitters getting stuck on some files... - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1402757 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 1402776 - Posted: 13 Aug 2013, 21:33:12 UTC Thanks Matt. ID: 1402776 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1402796 - Posted: 13 Aug 2013, 22:20:16 UTC - in response to Message 1402757. Thanks for all that Matt, Claggy ID: 1402796 ·

mr.mac52 Send message Joined: 18 Mar 03 Posts: 67 Credit: 245,882,461 RAC: 0	Message 1402831 - Posted: 13 Aug 2013, 23:29:02 UTC - in response to Message 1402757. Thanks Matt, nice to hear from you and some of what is going on down home. John ID: 1402831 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1402897 - Posted: 14 Aug 2013, 4:21:16 UTC 1) SSD is a good way to go for maximum I/O (even cheap SSDs advertise over 50,000 IOPS). My only concern would be the number of writes they receive could end up shortening their lives by a bit, even though it is something like 100,000 writes to any one sector. Databases do a lot of writing, so it's not too far-fetched. You would want to go with the more durable enterprise-level SSDs, and make sure you do some sort of RAID setup on them so that if one goes bad, you can swap in a spare and keep going, and send the bad one off to be RMA'ed if the warranty period is still valid. 2) This one is a low-priority, but it involves the database, so anything that can help keep it tidy is worth looking into: When we get stuck WUs where _0 missed the deadline and it got sent out to _2, and then _0 reports before _2 does, _0 and _1 validate, and _2 gets left in "awaiting validation" forever (one of many examples). For this situation, are the uploaded result files for each completed task still on-disk and available for the validator to use them to cross-check? I heard one time that once two results are compared, a canonical is chosen and the other one is "discarded." Does discarded mean deleted immediately, or what? It just seems that once a canonical is chosen, anything that gets reported after that never gets validated. Either there is nothing to validate against, or there is just some logic tweaking needed somewhere. I know it is a daunting task to break it down and look at every line of code, but if we can try to narrow it down to a small portion/section, it would surely make it much easier to fix it. Either the validator doesn't check to see if all of the results have reported, or it checks the first two that it can, chooses a canonical, discards the other, and when the third comes along, it is expecting to be comparing against the other two, but only finds one (the canonical) and then just stops. Getting to the bottom of this issue (and I imagine there are probably tens of thousands overall) can clean the database up quite a bit, and every little bit helps. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1402897 ·

Thomas Volunteer tester Send message Joined: 9 Dec 11 Posts: 1499 Credit: 1,345,576 RAC: 0	Message 1402994 - Posted: 14 Aug 2013, 9:14:40 UTC Thanks for the heads-up Matt and for the news about the outage of last week. ID: 1402994 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1403073 - Posted: 14 Aug 2013, 13:43:49 UTC - in response to Message 1402897. 2) This one is a low-priority, but it involves the database, so anything that can help keep it tidy is worth looking into: When we get stuck WUs where _0 missed the deadline and it got sent out to _2, and then _0 reports before _2 does, _0 and _1 validate, and _2 gets left in "awaiting validation" forever (one of many examples). For this situation, are the uploaded result files for each completed task still on-disk and available for the validator to use them to cross-check? I heard one time that once two results are compared, a canonical is chosen and the other one is "discarded." Does discarded mean deleted immediately, or what? It just seems that once a canonical is chosen, anything that gets reported after that never gets validated. Either there is nothing to validate against, or there is just some logic tweaking needed somewhere. I know it is a daunting task to break it down and look at every line of code, but if we can try to narrow it down to a small portion/section, it would surely make it much easier to fix it. Either the validator doesn't check to see if all of the results have reported, or it checks the first two that it can, chooses a canonical, discards the other, and when the third comes along, it is expecting to be comparing against the other two, but only finds one (the canonical) and then just stops. Getting to the bottom of this issue (and I imagine there are probably tens of thousands overall) can clean the database up quite a bit, and every little bit helps. In a similar vein, we are now also finding WUs where one or both of the users reported some error or other but they've been validated anyway, only they're not being purged. I'm wondering first of all how these can be valid science; I have one where I returned a -9 and my wingmate returned a "CUFFT in line 62" error. Second, why aren't they purging? Is it because each of them has a third user who timed out? One other observation: I notice that the nitpickers are showing as Running. Is this cause for celebration? Even a little bit? David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1403073 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1403074 - Posted: 14 Aug 2013, 13:45:27 UTC - in response to Message 1402994. Thanks for the heads-up Matt and for the news about the outage of last week. Oh, I meant to say that too. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1403074 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20635 Credit: 7,508,002 RAC: 20	Message 1403240 - Posted: 14 Aug 2013, 22:36:05 UTC - in response to Message 1402757. ... One thing I can mention is that we're closer and closer to deciding that getting our science database entirely on solid state drives is going to be unavoidable if we are to ever analyze all this data. We just keep hitting disk i/o bottlenecks no matter what we try to speed things up... Have these people sponsor you to try out their new kit? Skyera unveils rival-crushing 21PB-a-rack flash monster That should be a win-win. You get rapid access to your largest database of the universe in the universe. They get galactic scale Marketing hype for their hardware... Yes? ;-) Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 1403240 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 35583 Credit: 261,360,520 RAC: 489	Message 1403253 - Posted: 14 Aug 2013, 23:36:13 UTC One other observation: I notice that the nitpickers are showing as Running. Is this cause for celebration? Even a little bit? Your answer to this was here, http://setiathome.berkeley.edu/forum_thread.php?id=72057&postid=1382867, in Matt's last report. ;-) Cheers. ID: 1403253 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19227 Credit: 40,757,560 RAC: 67	Message 1403280 - Posted: 15 Aug 2013, 2:36:04 UTC Last modified: 15 Aug 2013, 2:37:01 UTC Thanks for the update Matt. Just in case nobody there has noticed, "tape" 20jn12ac 50.20 GB (13) (done) has been in this state for a few days at least. ID: 1403280 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 1403352 - Posted: 15 Aug 2013, 8:04:26 UTC - in response to Message 1403280. Thanks for the update Matt. Just in case nobody there has noticed, "tape" 20jn12ac 50.20 GB (13) (done) has been in this state for a few days at least. Yes, I think they know about that one - they manually stopped the splitters wasting time on it. Matt said "files" (plural) - I think the other problematic one is 23mr08aa, which has never moved beyond the first channel, but is still tying up a splitter. ID: 1403352 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 1404154 - Posted: 17 Aug 2013, 5:41:46 UTC Thanks for the update Matt it is very much appreciated. I have a couple of questions: When you transfer data to be split to the servers how much do you send. On roughly the 14th August there was some work sent . I work this out to be 18 gig a minute I was working with the figure that it was 300 MB a second. I think the transfer lasted around 16 hours so that works out to be 1080 gig per hour which would work out to be 17,280 gig for the 16 hours if my mathematics is correct? My 2nd question is what speed are the disks/media that you are transferring them onto lastly how much total storage for work to be split is there? I must say 300 MB per second average is very impressive for data to be transferred to media. I have to agree with Cosmic_Ocean regarding the number of writes the SSD's will receive, ID: 1404154 ·

tullio Volunteer tester Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1	Message 1404744 - Posted: 18 Aug 2013, 16:19:16 UTC I am starting to use SSDs. A 120 GB disk on my laptop reaches about 200 MB/s in sequential read, measured by the "hdparm -t -T" Linux command. A 250 GB disk reaches up to 380 MB/s on the same command. Tullio ID: 1404744 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1404913 - Posted: 19 Aug 2013, 2:59:32 UTC - in response to Message 1404154. I must say 300 MB per second average is very impressive for data to be transferred to media. 300MB/sec isn't really all that unrealistic. Even with 7200RPM mechanical disks, if you put 3-4 in a RAID array so that you are writing to at least 3-4 disks at any given time, you can achieve that number. My 4x500GB RAID-5 array is using WD RE2 drives from...2007/8, and it gets 190MB/sec write and 225MB/sec read. A single spare 500GB RE2 does about 80/105. I think it is due to my 16KB stripe size. When I can find somewhere to dump 1400GB of data to, I'm going to rebuild the array with a 64KB stripe (actually, I think the newer firmware for my Areca card now supports 128KB). Friend of mine built a 4x1TB RAID5 with the same Areca card two years ago using Seagate disks. He went with 64KB stripe and was getting 270/350. And I've seen some reviews for the WD 4TB Black Edition drives, and they have some really awesome transfer speeds just as a single disk.. high 100's. However, gigabit has an actual limit of ~128MiB/sec. 300Mbit/sec isn't really all that amazing.. only about 36MiB/sec, which some USB-connected devices can do. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1404913 ·

Shannock9 Send message Joined: 4 Jul 99 Posts: 1396 Credit: 634,964 RAC: 0	Message 1405672 - Posted: 20 Aug 2013, 22:16:01 UTC Last modified: 20 Aug 2013, 22:22:09 UTC I must say 300 MB per second average is very impressive for data to be transferred to media. "Never underestimate the bandwidth of a station wagon full of disk packs trundling down the freeway." - my boss, circa 1968 ID: 1405672 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 1405867 - Posted: 21 Aug 2013, 8:22:04 UTC - in response to Message 1404913. [quote] However, gigabit has an actual limit of ~128MiB/sec. 300Mbit/sec isn't really all that amazing.. only about 36MiB/sec, which some USB-connected devices can do. 300 Mb a second is not that fast in terms of transferring from hard drive to hard drive. Where I was coming from is the point that to see data transfer up on the upload link at 300 Mb a second for more than an hour (16 hours) in the case that I was referring to is impressive to see in my opinion. ID: 1405867 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 1407276 - Posted: 24 Aug 2013, 12:38:04 UTC - in response to Message 1402897. Cosmic_Ocean wrote: When we get stuck WUs where _0 missed the deadline and it got sent out to _2, and then _0 reports before _2 does, _0 and _1 validate, and _2 gets left in "awaiting validation" forever (one of many examples). For this situation, are the uploaded result files for each completed task still on-disk and available for the validator to use them to cross-check? I heard one time that once two results are compared, a canonical is chosen and the other one is "discarded." Does discarded mean deleted immediately, or what? It just seems that once a canonical is chosen, anything that gets reported after that never gets validated. Either there is nothing to validate against, or there is just some logic tweaking needed somewhere. I know it is a daunting task to break it down and look at every line of code, but if we can try to narrow it down to a small portion/section, it would surely make it much easier to fix it. Either the validator doesn't check to see if all of the results have reported, or it checks the first two that it can, chooses a canonical, discards the other, and when the third comes along ... it should be marked as "too late to validate" and no credit awarded. That's how it works on MB, or at least how it worked last time I saw it. Only on AP something seems to be broken with that. Only if the late result arrives before the resend all 3 are validated. ID: 1407276 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1407551 - Posted: 25 Aug 2013, 8:15:41 UTC - in response to Message 1407276. ... it should be marked as "too late to validate" and no credit awarded. That's how it works on MB, or at least how it worked last time I saw it. Only on AP something seems to be broken with that. Only if the late result arrives before the resend all 3 are validated. I have seen "too late to validate" before, but that's only if the late one gets reported after two of them validate and before the WU gets purged. There is nothing stopping you from reporting work late otherwise though. You even get messages in BOINC that say "you may not get credit for this, consider aborting it," but you can still crunch it and try to report it. The case I'm talking about though, the late wingmate returns and reports before the third wingmate does and it causes it to just hang, and the third wingmate is the one that ends up being punished for something they didn't do. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1407551 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 35583 Credit: 261,360,520 RAC: 489	Message 1407582 - Posted: 25 Aug 2013, 10:51:29 UTC - in response to Message 1407551. ... it should be marked as "too late to validate" and no credit awarded. That's how it works on MB, or at least how it worked last time I saw it. Only on AP something seems to be broken with that. Only if the late result arrives before the resend all 3 are validated. I have seen "too late to validate" before, but that's only if the late one gets reported after two of them validate and before the WU gets purged. There is nothing stopping you from reporting work late otherwise though. You even get messages in BOINC that say "you may not get credit for this, consider aborting it," but you can still crunch it and try to report it. The case I'm talking about though, the late wingmate returns and reports before the third wingmate does and it causes it to just hang, and the third wingmate is the one that ends up being punished for something they didn't do. And I've seen all 3 validate at times also. Cheers. ID: 1407582 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.