I Just Don't Believe It

Message boards : Number crunching : I Just Don't Believe It
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 165187 - Posted: 9 Sep 2005, 17:46:59 UTC

This copying of the upload/download folders just doesn't add-up.

Ok so I don't know the exact figures for the data sizes being moved, but the upper limit is 1Tb because that is what they have.

Unless I've got my math screwed, a Tb is ~1,000,000,000 bytes and 48 hours is 172,800 scconds

So, moving 1Tb in 48 hours means that the data is transferring at a stupidly pathetic 6Kb per second - less than that as it's not even a whole Tb.

Perhaps UCB should try connecting the systems with a direct serial cable rather than via a dial-up modem! It's not even transferring data at DSL speed... that would be 10 times faster.

UCB really need to be asking just why a simple copy/move operation cannot function on their system.

IMO, either the network, the raid, or the file system is badly screwed-up.

ID: 165187 · Report as offensive
Profile Rom Walton (BOINC)
Volunteer tester
Avatar

Send message
Joined: 28 Apr 00
Posts: 579
Credit: 130,733
RAC: 0
United States
Message 165193 - Posted: 9 Sep 2005, 18:00:20 UTC

Wouldn't the slowdown of file lookups, which slows down the validation process, also slow down a copy operation?

----- Rom
BOINC Development Team, U.C. Berkeley
My Blog
ID: 165193 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 165194 - Posted: 9 Sep 2005, 18:00:45 UTC - in response to Message 165187.  

UCB really need to be asking just why a simple copy/move operation cannot function on their system.

IMO, either the network, the raid, or the file system is badly screwed-up.

Maybe it's just me, but.....

If the problem is "poor performance on the current RAID" and the proposed fix is to "get the files off of the current raid" then isn't the file copy operation likely to be subject to the same "poor performance" issue?
ID: 165194 · Report as offensive
Profile Martin P.

Send message
Joined: 19 May 99
Posts: 294
Credit: 27,230,961
RAC: 2
Austria
Message 165195 - Posted: 9 Sep 2005, 18:01:36 UTC - in response to Message 165187.  
Last modified: 9 Sep 2005, 18:04:03 UTC

This copying of the upload/download folders just doesn't add-up.

Ok so I don't know the exact figures for the data sizes being moved, but the upper limit is 1Tb because that is what they have.

Unless I've got my math screwed, a Tb is ~1,000,000,000 bytes and 48 hours is 172,800 scconds

So, moving 1Tb in 48 hours means that the data is transferring at a stupidly pathetic 6Kb per second - less than that as it's not even a whole Tb.

Perhaps UCB should try connecting the systems with a direct serial cable rather than via a dial-up modem! It's not even transferring data at DSL speed... that would be 10 times faster.

UCB really need to be asking just why a simple copy/move operation cannot function on their system.

IMO, either the network, the raid, or the file system is badly screwed-up.


Not Quite:

1.000 bytes = 1 kilobyte
1.000.000 bytes = 1.000 kB = 1 MB
1.000.000.000 bytes = 1.000.000 kB = 1.000 MB = 1 GB
1.000.000.000.000 bytes = 1.000.000.000 kB = 1.000.000 MB = 1.000 GB = 1 TB

So, you are wrong by a factor of 1.000.

Mind: I used european thousands seperators! I also left away the 24 and it's multiples in 1024.

ID: 165195 · Report as offensive
Profile Francis Noel
Avatar

Send message
Joined: 30 Aug 05
Posts: 452
Credit: 142,832,523
RAC: 94
Canada
Message 165196 - Posted: 9 Sep 2005, 18:02:24 UTC
Last modified: 9 Sep 2005, 18:03:23 UTC

I snipped a wiki


A terabyte (derived from the SI prefix tera-) is a unit of information or computer storage equal to one trillion (one long scale billion) bytes. It is commonly abbreviated TB.

Because of irregularities in using the binary prefix in the definition and usage of the kilobyte, the exact number in common practice could be either one of the following:

1,000,000,000,000 bytes – 10004 or 1012.
1,099,511,627,776 bytes – 10244 or 240. This capacity may be expressed unambiguously as a tebibyte.

The prefix "tera" originates from the Greek word teras meaning 'monster'


I think a couple zeros were missing from your original calculations MikeS.
I'll let someone else do the math, I positively sock at it.

Edit : I got beat to the poste while reading that fascinating wiki, sorry for the redundancy.
mambo
ID: 165196 · Report as offensive
Profile MicroBeta

Send message
Joined: 22 Jun 04
Posts: 7
Credit: 224,853
RAC: 0
United States
Message 165198 - Posted: 9 Sep 2005, 18:08:10 UTC

Actually I think that 1,000,000,000 is a Gb.

If I have this correct then 6Kb/s would become
6000Kb/s or approx 6Mb/s.

Assuming:
Kb - Kilobyte = 2^10=1,024
Mb - Megabyet = 2^20=1,048,576
Gb - Gigabyte = 2^30=1,073,471,842
Tb - Terabyte = 2^40=1,099,511,627,776

1,099,511,627,776 bytes/172,800 sec = 6,362,714 b/s

If I have this wrong, let me know.

Mike

ID: 165198 · Report as offensive
Profile hooded.figure
Volunteer tester

Send message
Joined: 15 Dec 02
Posts: 33
Credit: 670,271
RAC: 0
United States
Message 165199 - Posted: 9 Sep 2005, 18:09:59 UTC - in response to Message 165198.  

Actually I think that 1,000,000,000 is a Gb.

If I have this correct then 6Kb/s would become
6000Kb/s or approx 6Mb/s.

Assuming:
Kb - Kilobyte = 2^10=1,024
Mb - Megabyet = 2^20=1,048,576
Gb - Gigabyte = 2^30=1,073,471,842
Tb - Terabyte = 2^40=1,099,511,627,776

1,099,511,627,776 bytes/172,800 sec = 6,362,714 b/s

If I have this wrong, let me know.

Mike


You are right, I was actually working this out when you posted

-matt
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety."
- Benjamin Franklin,
Historical Review of Pennsylvania, 1759.
ID: 165199 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 165202 - Posted: 9 Sep 2005, 18:16:32 UTC - in response to Message 165193.  

Wouldn't the slowdown of file lookups, which slows down the validation process, also slow down a copy operation?


Yes Rom, I expect it would. And that's what is worring me.
The slow file lookups are due to individual directory size, not the volume size.
If so, the directories may well end-up on different systems, but as I understand it, the directories size/content won't change, so the look-ups will remain the same after this copy process - still slow.


ID: 165202 · Report as offensive
Profile MikeSW17
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 1603
Credit: 2,700,523
RAC: 0
United Kingdom
Message 165204 - Posted: 9 Sep 2005, 18:27:26 UTC

Well there you go then, somehow I forgot Gb comes between Mb and Tb - Oops, sorry.

Mind you that still only gives a less-than 6Mb/sec transfer rate, which IMO is till very poor for the hardware involved.



ID: 165204 · Report as offensive
Profile Francis Noel
Avatar

Send message
Joined: 30 Aug 05
Posts: 452
Credit: 142,832,523
RAC: 94
Canada
Message 165205 - Posted: 9 Sep 2005, 18:39:54 UTC


Mind you that still only gives a less-than 6Mb/sec transfer rate, which IMO is till very poor for the hardware involved.


What if you factor in the directory lookup "overhead" ?
mambo
ID: 165205 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 165211 - Posted: 9 Sep 2005, 18:52:03 UTC - in response to Message 165193.  

Rom

Yes it would, so finding the real issue of whether it is a hardware or software failure is the hard part.

Wouldn't the slowdown of file lookups, which slows down the validation process, also slow down a copy operation?


If we look at just the Raid Array itself (hardware)
* If the controller talking to the backplane/fabric has an issue. It will have trouble finding the correct strip to look at..
* If the controller talking to the backplane/fabric has an issue. I may read the data from the stripe and calculate incorrect parity and issue a retry.
* If the controller talking to the backplane/fabric has a Cache issue. Then it will read "data" slower
* If the controller talking to the backplane/fabric has a Cache issue. It may see incorrect parity and issue a retry.
* If the backplane/fabric has an issue talking to a single drive in the stripe, you have a retry.
* If a cable connecting the Controller to the backplane/fabric has an issue. then data is read/written with incorrect parity and retried.
* If the Array is Mirrored, and there is an issue writing the data across "both" stripes normally results in a retry. Reboot, break the mirror and test...
* If a drive in the array is failing (presuming no mirror). When data is Read or Written and Parity checked. If parity is not correct, there is a retry.

In many of these cases it may only show up as a "Wait" (Controller BIOS reporting) and not be true cause of the "reported" error.

So any reasonable Raid Controller has diagnostics that can test the Raid Array, and report failures in the "Controller" log (most keep a limited log for troubleshooting)... It would include Fiber channel and cable connections. In the case of a soft failure of a single drive that does not reach the percentage/threshold indicating a true hardware failure you end up with multiple retries until it succeeds.

The OS Logs should show failed ethernet connections and point the issue...


R/

Al

Please consider a Donation to the Seti Project.

ID: 165211 · Report as offensive
Profile Dorsai
Avatar

Send message
Joined: 7 Sep 04
Posts: 474
Credit: 4,504,838
RAC: 0
United Kingdom
Message 165212 - Posted: 9 Sep 2005, 18:54:48 UTC
Last modified: 9 Sep 2005, 18:56:31 UTC

It used to be fast, but now is slow" is a basic summation of what Matt said re the Raid unit.

I get the feeling that in a few weeks/months someone @ Berkeley will have a Eurika! moment when they notice that:

A "gizmo" in the corner has accidently been switched from "fast" to "slow"

OR

A plug has come out....

OR

A cable has had a nail put through it.

OR

"Who the @~@~ turned this off!!!!"

OR

Something as equally silly, that the basic system is able to "compensate" for, but at a major loss of performance.

Foamy is "Lord and Master".
(Oh, + some Classic WUs too.)
ID: 165212 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 165218 - Posted: 9 Sep 2005, 19:04:26 UTC - in response to Message 165194.  

Ned

UCB really need to be asking just why a simple copy/move operation cannot function on their system.

IMO, either the network, the raid, or the file system is badly screwed-up.

Maybe it's just me, but.....

If the problem is "poor performance on the current RAID" and the proposed fix is to "get the files off of the current raid" then isn't the file copy operation likely to be subject to the same "poor performance" issue?


Moving the UL/DL to a separate machine will improve performance of UL/DL. The "Validation" will get some relief... and may provide some additional time to determine the true cause. However a portion of the potential problem, has been removed in the troubleshooting process.

* If it is an Inode issue then it should be readily apparent... With the reduced number of files, "Things should just fly."
* If it is a hardware issue things (validation/etc..) will still be slow... Back to troubleshooting.

R/

Al

Please consider a Donation to the Seti Project.

ID: 165218 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 165219 - Posted: 9 Sep 2005, 19:06:36 UTC
Last modified: 9 Sep 2005, 19:08:07 UTC

I'll tell you exactly what happened.

First off - we're just doing uploads. That's 11 million files on disks totalling about 160GB. Raw transfer time would be, if you do the math correctly, just under 1MB/sec over 48 hours. This is about the I/O rate we are seeing.

EDIT: It's this slow stricly because of the directory lookups that have to happen for each file being copied off the old file server.

Reason why original estimates were off: We did some speed tests first by copying a select few directories from the old server to the new one. We deleted those test copies, and did some more tests on the same directories with three copy processes running - it went 3 times as fast! So we fired that off and guessed it would take about 12-16 hours to complete. However, we didn't take into account that a few test directories were still in cache from our tests. So the three copy processes weren't actually going that much faster than just one.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 165219 · Report as offensive
Profile [B@H] Ray
Volunteer tester
Avatar

Send message
Joined: 1 Sep 00
Posts: 485
Credit: 45,275
RAC: 0
United States
Message 165221 - Posted: 9 Sep 2005, 19:08:51 UTC

Summerizeing the above post.

The large number of files (making it hard to read the directory) along with the large size is what is causing it to take a long time.

------------------

Also remember that they have been haveing problems with the old disk array not working correct for almost 2 Mo. now. That is what caused the past long outages in the first place. So there is a chance that the new (but still old) disk array will be a lot faster and not have the large WTV que after this, well worth not being able to turn in work for two days. So you have to wait a little longer to turn units in, but may get credit a lot faster with the outage, only time will tell.

There are newer disk arrays out there that would think that this project is small for them, but cost a lot more than the project has for them. They would be good, but like our families have to work within a budget. The only way to get one of them would be if one of the users found out how much it would cost, and took up a collection from the users to get one. But most users have very little extra and probably would still come up short of getting one.


Pizza@Home Rays Place Rays place Forums
ID: 165221 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 165228 - Posted: 9 Sep 2005, 19:33:52 UTC - in response to Message 165219.  

Matt

Thank You

I'll tell you exactly what happened.

First off - we're just doing uploads. That's 11 million files on disks totalling about 160GB. Raw transfer time would be, if you do the math correctly, just under 1MB/sec over 48 hours. This is about the I/O rate we are seeing.

EDIT: It's this slow stricly because of the directory lookups that have to happen for each file being copied off the old file server.

Reason why original estimates were off: We did some speed tests first by copying a select few directories from the old server to the new one. We deleted those test copies, and did some more tests on the same directories with three copy processes running - it went 3 times as fast! So we fired that off and guessed it would take about 12-16 hours to complete. However, we didn't take into account that a few test directories were still in cache from our tests. So the three copy processes weren't actually going that much faster than just one.

- Matt


R/

Al

Please consider a Donation to the Seti Project.

ID: 165228 · Report as offensive
Profile Anigel
Volunteer tester
Avatar

Send message
Joined: 5 Dec 99
Posts: 101
Credit: 643,544
RAC: 0
United Kingdom
Message 165237 - Posted: 9 Sep 2005, 19:50:35 UTC

Remember when working out what a MB, GB, TB is that only disk manufacturers use the incorrect 1KB = 1000 bytes this means they can sell you a 200GB box size drive whilst in usage it has a much lower capacity

1 gigabyte = 8,589,934,592 bits and not the 8,000,000,000 bits these manufacturers use to artifically inflate the size of their drives.

Over a 200GB drive, that makes a difference of

1,717,986,918,400 - 1,600,000,000,000 = 117,986,918,400 bits or nearly 14GB

So when you buy a 200GB drive you are actually buying a 186GB drive. Scale that up to a 360GB drive and you are actually buying a 335GB one.

In any other market this would be treated as false advertising, however somehow hard drive manufacturers seem exempt from this even though they overstate thier products capacity by 7%.
Part of Teamseti
For SetiBoinc status graphs visit Teamseti status graphs
ID: 165237 · Report as offensive
Divide Overflow
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 365
Credit: 131,684
RAC: 0
United States
Message 165239 - Posted: 9 Sep 2005, 19:56:53 UTC

Is this a file move operation, or a file copy > verify > delete source operation? I keep hearing the term "copy" which indicates that these 11 million upload files will all need to be deleted when the copy operation is done.

ID: 165239 · Report as offensive
Profile SunMicrosystemsLLG

Send message
Joined: 4 Jul 05
Posts: 102
Credit: 1,360,617
RAC: 0
United Kingdom
Message 165240 - Posted: 9 Sep 2005, 19:58:29 UTC

Also, if the destination device contains a 'new' RAID volume this will probably have to be left quite a while to 'sync' on completion of the file copy.
ID: 165240 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 165242 - Posted: 9 Sep 2005, 20:00:54 UTC - in response to Message 165240.  

Also, if the destination device contains a 'new' RAID volume this will probably have to be left quite a while to 'sync' on completion of the file copy.

If it's an empty RAID, shouldn't it stay in sync as files are added?
ID: 165242 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : I Just Don't Believe It


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.