New tech news item...

Message boards : Number crunching : New tech news item...
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 150360 - Posted: 11 Aug 2005, 20:50:05 UTC

Check out the new item in tech news...

Bottom line:

All queues are moving in a positive direction, though some much slower than others, except for the validation queue, which is just barely unable to currently keep up. The only bottleneck slowing validation down is large directory sizes on the upload/download filesystem. This is being addressed in many ways, and we should see this queue start to drain as fixes are applied.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 150360 · Report as offensive
Profile Pooh Bear 27
Volunteer tester
Avatar

Send message
Joined: 14 Jul 03
Posts: 3224
Credit: 4,603,826
RAC: 0
United States
Message 150362 - Posted: 11 Aug 2005, 20:57:20 UTC

Awesome tech article, Matt. Glad to see things moving forward, and starting to find some relief.

Keep up the great work.



My movie https://vimeo.com/manage/videos/502242
ID: 150362 · Report as offensive
Profile tekwyzrd
Volunteer tester
Avatar

Send message
Joined: 21 Nov 01
Posts: 767
Credit: 30,009
RAC: 0
United States
Message 150364 - Posted: 11 Aug 2005, 21:05:48 UTC
Last modified: 11 Aug 2005, 21:06:50 UTC

Thanks for the detailed report. I may be wrong but it seems to me that if the current directories are too large it might help if the results were organized with a directory for each tape. They could then be eliminated after the results are validated and deleted.
Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.
Douglas Adams (1952 - 2001)
ID: 150364 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20372
Credit: 7,508,002
RAC: 20
United Kingdom
Message 150369 - Posted: 11 Aug 2005, 21:19:02 UTC - in response to Message 150360.  
Last modified: 11 Aug 2005, 21:19:38 UTC

Check out the new item in tech news...

Bottom line:

All queues are moving in a positive direction, though some much slower than others, except for the validation queue, which is just barely unable to currently keep up. The only bottleneck slowing validation down is large directory sizes on the upload/download filesystem. This is being addressed in many ways, and we should see this queue start to drain as fixes are applied.

- Matt

Matt, thanks for the detailed tech news. Good to see the server side of what's happening!

Re:
Adding more fan-out directories wouldn't help, as then we would have an equally large directory of subdirectories. Of course, we could make a fan-out of fan-outs, but this would require some significant code changes, as well as long outage to implement, and frankly it's an ungraceful solution.

So don't keep us in suspense! What is going to be the graceful solution to this?

Fewer files?
A database?
ReiserFS?!
Or another one or two levels of fan-out?

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 150369 · Report as offensive
Profile Sir Ulli
Volunteer tester
Avatar

Send message
Joined: 21 Oct 99
Posts: 2246
Credit: 6,136,250
RAC: 0
Germany
Message 150372 - Posted: 11 Aug 2005, 21:26:59 UTC

Thanks for the Info Matt, and it is good that we know that you are working on this.

Greetings from Germany NRW
Ulli


ID: 150372 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 150379 - Posted: 11 Aug 2005, 21:48:23 UTC - in response to Message 150369.  


So don't keep us in suspense! What is going to be the graceful solution to this?

Fewer files?
A database?
ReiserFS?!
Or another one or two levels of fan-out?


Answer: fewer files. We have a lot to delete, and we're deleting as fast as we can without hurting normal operations.

Not sure what you mean by database. Our database is just fine. Faster database won't help the current problem.

ReiserFS: well, our current upload/download file server doesn't support it. We are pretty certain we can get by without it, as long as we optimize our code and wait for queues to drain. Bear in mind the current condition is pathological, and we should have far less files on disk than we do now. Of course, we are working towards making sure this doesn't happen again (if at all possible)!

More fan-out levels: I mention this in the tech note. Would require major programming change (major in that all our working systems would have to be broken open, recompiled, tested, etc.), but this would only solve the current problem. Then would require a long outage to move a half terabyte of files around, which would aggravate the current problem.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 150379 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20372
Credit: 7,508,002
RAC: 20
United Kingdom
Message 150383 - Posted: 11 Aug 2005, 22:19:42 UTC - in response to Message 150379.  
Last modified: 11 Aug 2005, 22:40:09 UTC


So don't keep us in suspense! What is going to be the graceful solution to this?
...

Answer: fewer files. We have a lot to delete, and we're deleting as fast as we can without hurting normal operations.

That works fine for now. Will the active upload/download file count stay manageable for the future when s@h-classic is closed?

Not sure what you mean by database. Our database is just fine. Faster database won't help the current problem.

There are many small files, ReiaserFS is not supported: Hence use a seperate database just for handling the upload results files and even the download WUs?...


OK on the fan-out levels being a very big thing to change and to include in all the various bits of code. Perhaps modularise it into another server process dedicated to get and put files? (Or is this getting to be too much like another database?!)

Aside: I've recently reshuffled 200GBytes of files between four partitions across two physical disks. Yes, it does take a long time!

Thanks for the heads-up. Keep with the good work!

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 150383 · Report as offensive
Profile Sir Ulli
Volunteer tester
Avatar

Send message
Joined: 21 Oct 99
Posts: 2246
Credit: 6,136,250
RAC: 0
Germany
Message 150385 - Posted: 11 Aug 2005, 22:46:31 UTC - in response to Message 150360.  
Last modified: 11 Aug 2005, 22:47:17 UTC

Check out the new item in tech news...

Bottom line:

All queues are moving in a positive direction, though some much slower than others, except for the validation queue, which is just barely unable to currently keep up. The only bottleneck slowing validation down is large directory sizes on the upload/download filesystem. This is being addressed in many ways, and we should see this queue start to drain as fixes are applied.

- Matt


i think there is also an I-net problem there

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:Dokumente und Einstellungenulli>ping setiathome.ssl.berkeley.edu

Ping setiathome.ssl.berkeley.edu [128.32.18.152] mit 32 Bytes Daten:

Antwort von 128.32.18.152: Bytes=32 Zeit=364ms TTL=235
Antwort von 128.32.18.152: Bytes=32 Zeit=235ms TTL=235
Zeitüberschreitung der Anforderung.
Antwort von 128.32.18.152: Bytes=32 Zeit=249ms TTL=235

Ping-Statistik für 128.32.18.152:
Pakete: Gesendet = 4, Empfangen = 3, Verloren = 1 (25% Verlust),
Ca. Zeitangaben in Millisek.:
Minimum = 235ms, Maximum = 364ms, Mittelwert = 282ms

C:Dokumente und Einstellungenulli>tracert setiathome.ssl.berkeley.edu

Routenverfolgung zu setiathome.ssl.berkeley.edu [128.32.18.152] über maximal 30
Abschnitte:

1 1 ms 1 ms 1 ms 192.168.0.22
2 63 ms 76 ms 160 ms 212-62-80-254.teleos-web.de [212.62.80.254]
3 179 ms 61 ms 185 ms m5-re.dts-online.net [212.62.64.3]
4 163 ms 60 ms 195 ms m10-hf.dts-online.net [212.62.64.30]
5 220 ms 65 ms 145 ms DTS.DO-2-pos130.de.lambdanet.net [217.71.111.29]

6 155 ms 64 ms 71 ms DUS-2-pos210.de.lambdanet.net [217.71.105.57]
7 69 ms 68 ms 201 ms AMS-2-pos100.nl.lambdanet.net [82.197.128.17]
8 96 ms 207 ms 73 ms gsr12416.ams.he.net [195.69.145.150]
9 81 ms 82 ms 201 ms pos0-0.gsr12416.lon.he.net [216.66.24.157]
10 164 ms 203 ms 225 ms pos8-0.gsr12416.nyc.he.net [216.218.200.101]
11 336 ms 344 ms 235 ms pos7-0.gsr12012.sjc.he.net [216.218.254.153]
12 233 ms 243 ms 259 ms pos1-2.gsr12416.fmt.he.net [64.71.128.182]
13 232 ms 245 ms 232 ms pos2-1.gsr12416.pao.he.net [64.62.249.122]
14 232 ms 291 ms 249 ms paix-px1--hurricane-ge.cenic.net [198.32.251.69]

15 368 ms 270 ms 275 ms dc-oak-dc2--oakk-dc1-p2p-1.cenic.net [137.164.22
.193]
16 338 ms 391 ms 278 ms ucb--oak-dc2-ge.cenic.net [137.164.23.30]
17 313 ms 249 ms 258 ms g3-14.inr-202-reccev.Berkeley.EDU [128.32.0.39]

18 278 ms 244 ms 260 ms g6-2.inr-230-spr.Berkeley.EDU [128.32.255.114]
19 247 ms * 286 ms solen.SSL.Berkeley.EDU [128.32.18.209]
20 * * * Zeitüberschreitung der Anforderung.
21 * * * Zeitüberschreitung der Anforderung.
22 * * * Zeitüberschreitung der Anforderung.
23 * * * Zeitüberschreitung der Anforderung.
24 398 ms * * klaatu.ssl.berkeley.edu [128.32.18.152]
25 240 ms * 237 ms klaatu.ssl.berkeley.edu [128.32.18.152]

Ablaufverfolgung beendet.

C:Dokumente und Einstellungenulli>


tracert and also Ping are report Problems...

Greetings from Germany NRW
Ulli
ID: 150385 · Report as offensive
Don Erway
Volunteer tester

Send message
Joined: 18 May 99
Posts: 305
Credit: 471,946
RAC: 0
United States
Message 150390 - Posted: 12 Aug 2005, 0:10:03 UTC

All the queues are shrinking, except one...

I suggest that rather than come up with a way to "solve" this problem of too many files, which should really be a condition that is never allowed to happen anyway, the system just be set up to throttle back on WU output, whenever the queues are higher than an hour or so.

Stop pouring out new WUs. Cut them back enough until you see the validator queue start to actually drop, then keep them cut there, until it drops all the way.

There are plenty of other worthy projects to take up any spare cycles, and keeping the queues small at all times, guarantees good file access performance, so the whole system can run as designed.


ID: 150390 · Report as offensive
Profile Ananas
Volunteer tester

Send message
Joined: 14 Dec 01
Posts: 195
Credit: 2,503,252
RAC: 0
Germany
Message 150403 - Posted: 12 Aug 2005, 1:08:19 UTC

I can confirm the network (traceroute/ping) trouble :-(
_____________________________

As of the file system slowdown :

It sometimes helps to create a new directory, hardlink all files to the new directory, delete the old directory and rename the new one to the name of the old one.
ID: 150403 · Report as offensive
N/A
Volunteer tester

Send message
Joined: 18 May 01
Posts: 3718
Credit: 93,649
RAC: 0
Message 150412 - Posted: 12 Aug 2005, 1:49:07 UTC - in response to Message 150390.  

[font='courier,courier new']Don't just cut back on WU production, but re-task the splitter: Make it a temporary validator.

Isn't there cluster SW that can do that (and how hard would it be to implement)?[/font]
ID: 150412 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 150420 - Posted: 12 Aug 2005, 4:25:36 UTC - in response to Message 150390.  

All the queues are shrinking, except one...

I suggest that rather than come up with a way to "solve" this problem of too many files, which should really be a condition that is never allowed to happen anyway, the system just be set up to throttle back on WU output, whenever the queues are higher than an hour or so.

Stop pouring out new WUs. Cut them back enough until you see the validator queue start to actually drop, then keep them cut there, until it drops all the way.

There are plenty of other worthy projects to take up any spare cycles, and keeping the queues small at all times, guarantees good file access performance, so the whole system can run as designed.


Remember that some clients have 10 days of work cached. It may take 10 days before the throttling really "throttles" -- and starved clients may just connect more often.
ID: 150420 · Report as offensive
Profile Toby
Volunteer tester
Avatar

Send message
Joined: 26 Oct 00
Posts: 1005
Credit: 6,366,949
RAC: 0
United States
Message 150421 - Posted: 12 Aug 2005, 4:25:49 UTC

Yep. Definitely something wrong with the network:

--- klaatu.ssl.berkeley.edu ping statistics ---
100 packets transmitted, 70 received, [b]30% packet loss[/b], time 107238ms
rtt min/avg/max/mdev = 76.058/83.147/103.322/5.213 ms


I confirmed from 2 other locations. One of them is on internet2 so the problem appears to be internal to berkeley unless I1 and I2 traffic go over the same wire coming into berkeley.

Maybe if we could block out the alien radio signal that is causing interference on the line...
A member of The Knights Who Say NI!
For rankings, history graphs and more, check out:
My BOINC stats site
ID: 150421 · Report as offensive
Profile Richard Smith

Send message
Joined: 2 Feb 00
Posts: 19
Credit: 7,319,258
RAC: 0
United Kingdom
Message 150427 - Posted: 12 Aug 2005, 7:40:20 UTC - in response to Message 150379.  

[quote]
So don't keep us in suspense! What is going to be the graceful solution to this?

Not sure what you mean by database. Our database is just fine. Faster database won't help the current problem.


- Matt


Why not store these tiny files in a database rather than seperate files?
ID: 150427 · Report as offensive
Profile Tigher
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 1547
Credit: 760,577
RAC: 0
United Kingdom
Message 150430 - Posted: 12 Aug 2005, 8:20:45 UTC - in response to Message 150421.  

Yep. Definitely something wrong with the network:

--- klaatu.ssl.berkeley.edu ping statistics ---
100 packets transmitted, 70 received, [b]30% packet loss[/b], time 107238ms
rtt min/avg/max/mdev = 76.058/83.147/103.322/5.213 ms


I confirmed from 2 other locations. One of them is on internet2 so the problem appears to be internal to berkeley unless I1 and I2 traffic go over the same wire coming into berkeley.

Maybe if we could block out the alien radio signal that is causing interference on the line...



Maybe this?
August 11, 2005
There is a hardware problem with the building network here at SSL. This is affecting the scheduling and web servers. You may see intermittent connection problems. The SSL network folks are working on a fix.

ID: 150430 · Report as offensive
Profile tekwyzrd
Volunteer tester
Avatar

Send message
Joined: 21 Nov 01
Posts: 767
Credit: 30,009
RAC: 0
United States
Message 150432 - Posted: 12 Aug 2005, 8:25:53 UTC - in response to Message 150430.  

Yep. Definitely something wrong with the network:

--- klaatu.ssl.berkeley.edu ping statistics ---
100 packets transmitted, 70 received, [b]30% packet loss[/b], time 107238ms
rtt min/avg/max/mdev = 76.058/83.147/103.322/5.213 ms


I confirmed from 2 other locations. One of them is on internet2 so the problem appears to be internal to berkeley unless I1 and I2 traffic go over the same wire coming into berkeley.

Maybe if we could block out the alien radio signal that is causing interference on the line...



Maybe this?
August 11, 2005
There is a hardware problem with the building network here at SSL. This is affecting the scheduling and web servers. You may see intermittent connection problems. The SSL network folks are working on a fix.



I've been connecting to the forum with no problems since just after 4am EST
Nothing travels faster than the speed of light with the possible exception of bad news, which obeys its own special laws.
Douglas Adams (1952 - 2001)
ID: 150432 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20372
Credit: 7,508,002
RAC: 20
United Kingdom
Message 152181 - Posted: 16 Aug 2005, 10:49:04 UTC - in response to Message 150360.  
Last modified: 16 Aug 2005, 10:50:01 UTC

...The only bottleneck slowing validation down is large directory sizes on the upload/download filesystem. This is being addressed in many ways, ... - Matt

Matt, have you noted this good point from doublechaz?

Deleting the files does not necessarily reduce the number of directory entries. The directory table remains at the maximum size from whenever you had the maximum number of files listed!

Hence, you get no speedup from deleting old files.

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 152181 · Report as offensive

Message boards : Number crunching : New tech news item...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.