Emergence (Aug 13 2013)

Message boards : Technical News : Emergence (Aug 13 2013)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1402757 - Posted: 13 Aug 2013, 21:07:53 UTC

Hello again! Once again I'm emerging from a span of time where I was either out of the lab or in the lab working on non-newsworthy development, and realizing it's been way too long since I drummed up one of these reports.

We had our usual Tuesday outage again today. Same old, same old. However last week we had some scary, unexpected server crashes. First oscar (our main mysql server) crashed, and then a couple hours after that so did carolyn (the replica). Neither crashed as much as the kernels got into some sort of dead lock and couldn't be wedged - in both cases we got the people down at the colocation facility to reboot the machines for us and all was well. Except the replica database needed to be resync'ed. I did so rather quickly though the project has been up for a while and thus not at a safe, clean break point. I thought all was well until after coming out of today's outage when the replica hit a point of confusion in its logs. I guess I need that clean break point - I'm resync'ing again now and will do so again more safely next week. No big deal - this isn't hurting normal operations in the least.

Though largely we are under normal operating conditions, there are other behind the scenes activities going on - news to come when the time is right. One thing I can mention is that we're closer and closer to deciding that getting our science database entirely on solid state drives is going to be unavoidable if we are to ever analyze all this data. We just keep hitting disk i/o bottlenecks no matter what we try to speed things up.

Any other thoughts and questions? Am I missing anything? Yes, I know about the splitters getting stuck on some files...

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1402757 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1402776 - Posted: 13 Aug 2013, 21:33:12 UTC

Thanks Matt.

ID: 1402776 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1402796 - Posted: 13 Aug 2013, 22:20:16 UTC - in response to Message 1402757.  

Thanks for all that Matt,

Claggy
ID: 1402796 · Report as offensive
Profile mr.mac52
Avatar

Send message
Joined: 18 Mar 03
Posts: 67
Credit: 245,882,461
RAC: 0
United States
Message 1402831 - Posted: 13 Aug 2013, 23:29:02 UTC - in response to Message 1402757.  

Thanks Matt, nice to hear from you and some of what is going on down home.

John
ID: 1402831 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1402897 - Posted: 14 Aug 2013, 4:21:16 UTC

1) SSD is a good way to go for maximum I/O (even cheap SSDs advertise over 50,000 IOPS). My only concern would be the number of writes they receive could end up shortening their lives by a bit, even though it is something like 100,000 writes to any one sector. Databases do a lot of writing, so it's not too far-fetched.

You would want to go with the more durable enterprise-level SSDs, and make sure you do some sort of RAID setup on them so that if one goes bad, you can swap in a spare and keep going, and send the bad one off to be RMA'ed if the warranty period is still valid.



2) This one is a low-priority, but it involves the database, so anything that can help keep it tidy is worth looking into:

When we get stuck WUs where _0 missed the deadline and it got sent out to _2, and then _0 reports before _2 does, _0 and _1 validate, and _2 gets left in "awaiting validation" forever (one of many examples).

For this situation, are the uploaded result files for each completed task still on-disk and available for the validator to use them to cross-check? I heard one time that once two results are compared, a canonical is chosen and the other one is "discarded." Does discarded mean deleted immediately, or what?

It just seems that once a canonical is chosen, anything that gets reported after that never gets validated. Either there is nothing to validate against, or there is just some logic tweaking needed somewhere. I know it is a daunting task to break it down and look at every line of code, but if we can try to narrow it down to a small portion/section, it would surely make it much easier to fix it.

Either the validator doesn't check to see if all of the results have reported, or it checks the first two that it can, chooses a canonical, discards the other, and when the third comes along, it is expecting to be comparing against the other two, but only finds one (the canonical) and then just stops.

Getting to the bottom of this issue (and I imagine there are probably tens of thousands overall) can clean the database up quite a bit, and every little bit helps.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1402897 · Report as offensive
Thomas
Volunteer tester

Send message
Joined: 9 Dec 11
Posts: 1499
Credit: 1,345,576
RAC: 0
France
Message 1402994 - Posted: 14 Aug 2013, 9:14:40 UTC

Thanks for the heads-up Matt and for the news about the outage of last week.
ID: 1402994 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1403073 - Posted: 14 Aug 2013, 13:43:49 UTC - in response to Message 1402897.  

2) This one is a low-priority, but it involves the database, so anything that can help keep it tidy is worth looking into:

When we get stuck WUs where _0 missed the deadline and it got sent out to _2, and then _0 reports before _2 does, _0 and _1 validate, and _2 gets left in "awaiting validation" forever (one of many examples).

For this situation, are the uploaded result files for each completed task still on-disk and available for the validator to use them to cross-check? I heard one time that once two results are compared, a canonical is chosen and the other one is "discarded." Does discarded mean deleted immediately, or what?

It just seems that once a canonical is chosen, anything that gets reported after that never gets validated. Either there is nothing to validate against, or there is just some logic tweaking needed somewhere. I know it is a daunting task to break it down and look at every line of code, but if we can try to narrow it down to a small portion/section, it would surely make it much easier to fix it.

Either the validator doesn't check to see if all of the results have reported, or it checks the first two that it can, chooses a canonical, discards the other, and when the third comes along, it is expecting to be comparing against the other two, but only finds one (the canonical) and then just stops.

Getting to the bottom of this issue (and I imagine there are probably tens of thousands overall) can clean the database up quite a bit, and every little bit helps.

In a similar vein, we are now also finding WUs where one or both of the users reported some error or other but they've been validated anyway, only they're not being purged. I'm wondering first of all how these can be valid science; I have one where I returned a -9 and my wingmate returned a "CUFFT in line 62" error. Second, why aren't they purging? Is it because each of them has a third user who timed out?

One other observation: I notice that the nitpickers are showing as Running. Is this cause for celebration? Even a little bit?

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1403073 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1403074 - Posted: 14 Aug 2013, 13:45:27 UTC - in response to Message 1402994.  

Thanks for the heads-up Matt and for the news about the outage of last week.

Oh, I meant to say that too.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1403074 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20140
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1403240 - Posted: 14 Aug 2013, 22:36:05 UTC - in response to Message 1402757.  

... One thing I can mention is that we're closer and closer to deciding that getting our science database entirely on solid state drives is going to be unavoidable if we are to ever analyze all this data. We just keep hitting disk i/o bottlenecks no matter what we try to speed things up...

Have these people sponsor you to try out their new kit?

Skyera unveils rival-crushing 21PB-a-rack flash monster


That should be a win-win. You get rapid access to your largest database of the universe in the universe. They get galactic scale Marketing hype for their hardware...

Yes?

;-)


Keep searchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1403240 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1403253 - Posted: 14 Aug 2013, 23:36:13 UTC


One other observation: I notice that the nitpickers are showing as Running. Is this cause for celebration? Even a little bit?


Your answer to this was here, http://setiathome.berkeley.edu/forum_thread.php?id=72057&postid=1382867, in Matt's last report. ;-)

Cheers.
ID: 1403253 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1403280 - Posted: 15 Aug 2013, 2:36:04 UTC
Last modified: 15 Aug 2013, 2:37:01 UTC

Thanks for the update Matt.

Just in case nobody there has noticed, "tape" 20jn12ac 50.20 GB (13) (done) has been in this state for a few days at least.
ID: 1403280 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1403352 - Posted: 15 Aug 2013, 8:04:26 UTC - in response to Message 1403280.  

Thanks for the update Matt.

Just in case nobody there has noticed, "tape" 20jn12ac 50.20 GB (13) (done) has been in this state for a few days at least.

Yes, I think they know about that one - they manually stopped the splitters wasting time on it.

Matt said "files" (plural) - I think the other problematic one is 23mr08aa, which has never moved beyond the first channel, but is still tying up a splitter.
ID: 1403352 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1639
Credit: 12,921,799
RAC: 89
New Zealand
Message 1404154 - Posted: 17 Aug 2013, 5:41:46 UTC

Thanks for the update Matt it is very much appreciated. I have a couple of questions: When you transfer data to be split to the servers how much do you send. On roughly the 14th August there was some work sent . I work this out to be 18 gig a minute I was working with the figure that it was 300 MB a second. I think the transfer lasted around 16 hours so that works out to be 1080 gig per hour which would work out to be 17,280 gig for the 16 hours if my mathematics is correct? My 2nd question is what speed are the disks/media that you are transferring them onto lastly how much total storage for work to be split is there?
I must say 300 MB per second average is very impressive for data to be transferred to media.
I have to agree with Cosmic_Ocean regarding the number of writes the SSD's will receive,

ID: 1404154 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 8797
Credit: 2,930,782
RAC: 1
Italy
Message 1404744 - Posted: 18 Aug 2013, 16:19:16 UTC

I am starting to use SSDs. A 120 GB disk on my laptop reaches about 200 MB/s in sequential read, measured by the "hdparm -t -T" Linux command. A 250 GB disk reaches up to 380 MB/s on the same command.
Tullio
ID: 1404744 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1404913 - Posted: 19 Aug 2013, 2:59:32 UTC - in response to Message 1404154.  

I must say 300 MB per second average is very impressive for data to be transferred to media.

300MB/sec isn't really all that unrealistic. Even with 7200RPM mechanical disks, if you put 3-4 in a RAID array so that you are writing to at least 3-4 disks at any given time, you can achieve that number.

My 4x500GB RAID-5 array is using WD RE2 drives from...2007/8, and it gets 190MB/sec write and 225MB/sec read. A single spare 500GB RE2 does about 80/105. I think it is due to my 16KB stripe size. When I can find somewhere to dump 1400GB of data to, I'm going to rebuild the array with a 64KB stripe (actually, I think the newer firmware for my Areca card now supports 128KB).

Friend of mine built a 4x1TB RAID5 with the same Areca card two years ago using Seagate disks. He went with 64KB stripe and was getting 270/350.

And I've seen some reviews for the WD 4TB Black Edition drives, and they have some really awesome transfer speeds just as a single disk.. high 100's.

However, gigabit has an actual limit of ~128MiB/sec. 300Mbit/sec isn't really all that amazing.. only about 36MiB/sec, which some USB-connected devices can do.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1404913 · Report as offensive
Profile Shannock9
Avatar

Send message
Joined: 4 Jul 99
Posts: 1396
Credit: 634,964
RAC: 0
United Kingdom
Message 1405672 - Posted: 20 Aug 2013, 22:16:01 UTC
Last modified: 20 Aug 2013, 22:22:09 UTC

I must say 300 MB per second average is very impressive for data to be transferred to media.

"Never underestimate the bandwidth of a station wagon full of disk packs trundling down the freeway."
- my boss, circa 1968

ID: 1405672 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1639
Credit: 12,921,799
RAC: 89
New Zealand
Message 1405867 - Posted: 21 Aug 2013, 8:22:04 UTC - in response to Message 1404913.  

[quote]
However, gigabit has an actual limit of ~128MiB/sec. 300Mbit/sec isn't really all that amazing.. only about 36MiB/sec, which some USB-connected devices can do.

300 Mb a second is not that fast in terms of transferring from hard drive to hard drive. Where I was coming from is the point that to see data transfer up on the upload link at 300 Mb a second for more than an hour (16 hours) in the case that I was referring to is impressive to see in my opinion.
ID: 1405867 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1407276 - Posted: 24 Aug 2013, 12:38:04 UTC - in response to Message 1402897.  

Cosmic_Ocean wrote:
When we get stuck WUs where _0 missed the deadline and it got sent out to _2, and then _0 reports before _2 does, _0 and _1 validate, and _2 gets left in "awaiting validation" forever (one of many examples).

For this situation, are the uploaded result files for each completed task still on-disk and available for the validator to use them to cross-check? I heard one time that once two results are compared, a canonical is chosen and the other one is "discarded." Does discarded mean deleted immediately, or what?

It just seems that once a canonical is chosen, anything that gets reported after that never gets validated. Either there is nothing to validate against, or there is just some logic tweaking needed somewhere. I know it is a daunting task to break it down and look at every line of code, but if we can try to narrow it down to a small portion/section, it would surely make it much easier to fix it.

Either the validator doesn't check to see if all of the results have reported, or it checks the first two that it can, chooses a canonical, discards the other, and when the third comes along

... it should be marked as "too late to validate" and no credit awarded.

That's how it works on MB, or at least how it worked last time I saw it. Only on AP something seems to be broken with that. Only if the late result arrives before the resend all 3 are validated.
ID: 1407276 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1407551 - Posted: 25 Aug 2013, 8:15:41 UTC - in response to Message 1407276.  

... it should be marked as "too late to validate" and no credit awarded.

That's how it works on MB, or at least how it worked last time I saw it. Only on AP something seems to be broken with that. Only if the late result arrives before the resend all 3 are validated.

I have seen "too late to validate" before, but that's only if the late one gets reported after two of them validate and before the WU gets purged.

There is nothing stopping you from reporting work late otherwise though. You even get messages in BOINC that say "you may not get credit for this, consider aborting it," but you can still crunch it and try to report it.

The case I'm talking about though, the late wingmate returns and reports before the third wingmate does and it causes it to just hang, and the third wingmate is the one that ends up being punished for something they didn't do.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1407551 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1407582 - Posted: 25 Aug 2013, 10:51:29 UTC - in response to Message 1407551.  

... it should be marked as "too late to validate" and no credit awarded.

That's how it works on MB, or at least how it worked last time I saw it. Only on AP something seems to be broken with that. Only if the late result arrives before the resend all 3 are validated.

I have seen "too late to validate" before, but that's only if the late one gets reported after two of them validate and before the WU gets purged.

There is nothing stopping you from reporting work late otherwise though. You even get messages in BOINC that say "you may not get credit for this, consider aborting it," but you can still crunch it and try to report it.

The case I'm talking about though, the late wingmate returns and reports before the third wingmate does and it causes it to just hang, and the third wingmate is the one that ends up being punished for something they didn't do.

And I've seen all 3 validate at times also.

Cheers.
ID: 1407582 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Emergence (Aug 13 2013)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.