Project Status - 08/26/2005 4pm PST

Message boards : Number crunching : Project Status - 08/26/2005 4pm PST
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Bradshawma

Send message
Joined: 17 Mar 04
Posts: 20
Credit: 2,362,303
RAC: 0
United Kingdom
Message 159505 - Posted: 29 Aug 2005, 21:38:04 UTC

Just had another look at server status and, correct me if I am wrong, but i am sure the file deleters were running on Penguin earlier today but are now running on Kryten. If I am right this would seem to be progress.

ID: 159505 · Report as offensive
Scarecrow

Send message
Joined: 15 Jul 00
Posts: 4520
Credit: 486,601
RAC: 0
United States
Message 159506 - Posted: 29 Aug 2005, 21:40:15 UTC - in response to Message 159505.  
Last modified: 29 Aug 2005, 21:41:24 UTC

Just had another look at server status and, correct me if I am wrong, but i am sure the file deleters were running on Penguin earlier today but are now running on Kryten. If I am right this would seem to be progress.

I think you are correct. Field Engineer's mantra #3... "When in doubt, swap it out"
:)
ID: 159506 · Report as offensive
Chris Weber

Send message
Joined: 22 Oct 04
Posts: 3
Credit: 191,704
RAC: 0
United Kingdom
Message 159516 - Posted: 29 Aug 2005, 22:09:16 UTC

Little information, and mostly inaccurate at that.
Einstein@Home. you got a new user.
Seti@Home. Goodbye.

ID: 159516 · Report as offensive
Mike Gelvin
Avatar

Send message
Joined: 23 May 00
Posts: 92
Credit: 9,298,464
RAC: 0
United States
Message 159525 - Posted: 29 Aug 2005, 22:23:44 UTC - in response to Message 159506.  

I think they figured out that the problem was not the number of files at all... but rather the bandwidth to the file system. Had the problem been the number of files, we would have seen an exponential increase in speed of validation as the number of files dropped.

Just had another look at server status and, correct me if I am wrong, but i am sure the file deleters were running on Penguin earlier today but are now running on Kryten. If I am right this would seem to be progress.

I think you are correct. Field Engineer's mantra #3... "When in doubt, swap it out"
:)



ID: 159525 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 159553 - Posted: 29 Aug 2005, 22:50:14 UTC - in response to Message 159415.  

Actually it is the various results which have passed their deadline which are timed out and rescheduled.


Rom: I have been trying to upload and download since 2 A.M and all I get is "NO SCHEDULERS RESPOND" . Am I missing something "Boinc 4.19". Kindly answer this. Thks.

Everyone is out ...

The system is still being recovered.

If the feeder is off, not database connection for the schedulers ...
ID: 159553 · Report as offensive
Bradshawma

Send message
Joined: 17 Mar 04
Posts: 20
Credit: 2,362,303
RAC: 0
United Kingdom
Message 159561 - Posted: 29 Aug 2005, 22:56:56 UTC - in response to Message 159525.  

I think they figured out that the problem was not the number of files at all... but rather the bandwidth to the file system. Had the problem been the number of files, we would have seen an exponential increase in speed of validation as the number of files dropped.

Just had another look at server status and, correct me if I am wrong, but i am sure the file deleters were running on Penguin earlier today but are now running on Kryten. If I am right this would seem to be progress.

I think you are correct. Field Engineer's mantra #3... "When in doubt, swap it out"
:)



No,surely the validators would have maintained a constant speed because the file deleters were off-line while the validator queue was emptying. This would mean that the number of files would have been constant throughout and is only reducing now that the deleters are working
ID: 159561 · Report as offensive
Profile [B^S] Spydermb
Volunteer tester
Avatar

Send message
Joined: 16 Jul 99
Posts: 496
Credit: 10,860,148
RAC: 0
United States
Message 159578 - Posted: 29 Aug 2005, 23:11:00 UTC

From Technical News
August 29, 2005 - 23:00 UTC
So we're still offline, as we have been for the past week. Actually it'll be a full week tomorrow. We decided to keep the servers off one more night to clear out the remaining assimilation/deletion queues but we plan to come back on line at some point tomorrow no matter what. Regarding this lengthy outage, we have some good news and bad news.
The good news is that the entire validation queue has been drained. So people worried their backlogged credit would never arrive should be quite happy now. As well, those who fear their results will arrive past deadline to be counted should fear not. As long as the respective workunits are still in the database, credit will be granted. We'll hold off running db_purge for a while, so people can return their work after a long outage without missing any deadlines. It also should be noted that the antique deleters finished several days ago, and have reduced the result directories by about 40% in size.

Now the bad news. Even though the result directories are much smaller, and most of the servers are idle since many queues are empty, the assimilators and deleters are still running way too slow. There has been some speed improvement over the past week, but hardly enough. There's some NFS weirdness going on that wasn't so obvious before. So we're hastily looking into that, hopefully finding out what the problem is before tomorrow.



BOINC SYNERGY is an International Team and We Welcome All BOINC Participants!
BOINC Synergy Click to Join BOINC Synergy
ID: 159578 · Report as offensive
Mike Gelvin
Avatar

Send message
Joined: 23 May 00
Posts: 92
Credit: 9,298,464
RAC: 0
United States
Message 159579 - Posted: 29 Aug 2005, 23:11:43 UTC - in response to Message 159561.  
Last modified: 29 Aug 2005, 23:14:09 UTC

I think they figured out that the problem was not the number of files at all... but rather the bandwidth to the file system. Had the problem been the number of files, we would have seen an exponential increase in speed of validation as the number of files dropped.

Just had another look at server status and, correct me if I am wrong, but i am sure the file deleters were running on Penguin earlier today but are now running on Kryten. If I am right this would seem to be progress.

I think you are correct. Field Engineer's mantra #3... "When in doubt, swap it out"
:)



No,surely the validators would have maintained a constant speed because the file deleters were off-line while the validator queue was emptying. This would mean that the number of files would have been constant throughout and is only reducing now that the deleters are working


Before the deleters were turned off (prior to Sat), the decline in validated units was linear as well. Once the deleters were off, the validators acquired their much need bandwidth to their thing. (all of this is speculation).




ID: 159579 · Report as offensive
Profile John Cropper
Avatar

Send message
Joined: 3 May 00
Posts: 444
Credit: 416,933
RAC: 0
United States
Message 159581 - Posted: 29 Aug 2005, 23:12:54 UTC - in response to Message 159578.  

[coughDEFRAGcough]
ID: 159581 · Report as offensive
Profile betonklaus
Avatar

Send message
Joined: 28 Feb 03
Posts: 10
Credit: 31,836,074
RAC: 19
Germany
Message 159605 - Posted: 29 Aug 2005, 23:22:24 UTC

Ok - I see the Problems you have. Our Problem is now that many Files we crounched are getting out of time now. The Files are ready but we cannot upload. Is that now wasted CPU time for us?

Sorry for my English is not so good.
Lebe den Tag - es könnte dein letzter sein.....

Live the Day - maybe it`s your last.....

ID: 159605 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 159635 - Posted: 29 Aug 2005, 23:53:37 UTC - in response to Message 159605.  

Ok - I see the Problems you have. Our Problem is now that many Files we crounched are getting out of time now. The Files are ready but we cannot upload. Is that now wasted CPU time for us?

Sorry for my English is not so good.

According to the news, as long as a work unit is in the database, overdue work will still get credit.

... and they aren't purging work units from the database.
ID: 159635 · Report as offensive
itenginerd
Avatar

Send message
Joined: 1 Aug 00
Posts: 37
Credit: 39,905
RAC: 0
United States
Message 159657 - Posted: 30 Aug 2005, 1:06:21 UTC - in response to Message 159581.  

[coughDEFRAGcough]


lmao. Nice to see even the big projects need a good old-fashioned defrag now and then.

's what they get for not using a more robust file system like NTFS. (there's a scary statement for you)

(j)
James
ID: 159657 · Report as offensive
Steven Wilcox
Volunteer tester

Send message
Joined: 23 Sep 99
Posts: 36
Credit: 86,104,929
RAC: 131
United States
Message 159675 - Posted: 30 Aug 2005, 1:49:00 UTC - in response to Message 159657.  

[coughDEFRAGcough]


lmao. Nice to see even the big projects need a good old-fashioned defrag now and then.

's what they get for not using a more robust file system like NTFS. (there's a scary statement for you)

(j)
James


NFS (network file system) Unix and NTFS are not the same. NFS is more like a netbios file share. The problems are most likely network error's (timeouts ect.)
which could be caused by a bad network port on a switch, nic card on the system
or just too much traffic. Most of the servers appear to be older ones with only a 10meg or 100mb (E3500, E450, U10, U60) Not sure what the D220R has. Only one system with a bad connection talking to the right server could gum up the works.
I've seen duplex problems on my network with Sun systems and Cisco switches
if the switch thinks the link is full duplex (send/rec same time) and the system thinks the link is half duplex (send or rec only NOT both) errors will go up quickly. They cleaned up some of these errors earlier in the month/June.
I'm sure this is just one item they're checking in trying to fix the NFS problems.

Steve
ID: 159675 · Report as offensive
itenginerd
Avatar

Send message
Joined: 1 Aug 00
Posts: 37
Credit: 39,905
RAC: 0
United States
Message 159683 - Posted: 30 Aug 2005, 2:16:35 UTC - in response to Message 159675.  

NFS (network file system) Unix and NTFS are not the same.


erm... that was kinda the joke.

Berkeley's got more problems than we care to deal with if they're running RISC boxen against NTFS filesystems. 8)

(j)
James
ID: 159683 · Report as offensive
ampoliros
Volunteer tester
Avatar

Send message
Joined: 24 Sep 99
Posts: 152
Credit: 3,542,579
RAC: 5
United States
Message 159742 - Posted: 30 Aug 2005, 4:18:49 UTC

NFS has it's share of problems (I don't know what version they are using) and I've had to muddle through some of them at one point or another.

It's more common than you might think to have "dropped" packets when the connection is 100Mb/s from the file server to the switch and 10Mb/s from the switch to the client. And as mentioned you have the possibility of half/full duplex confusion. If packets are being lost/dropped somewhere, it's not that hard to find the problem. (And if both see problems, it's the switch.)

Slow performance could also mean problems with name/IP resolution.

One "way-out-there" idea is that there's a software firewall on one of these things that's filtering packets (I hope not, that's just crazy).

There could also be a problem with file locking (normally handled by the kernel but is a seperate daemon for NFS).

7,049 S@H Classic Credits
ID: 159742 · Report as offensive
Scarecrow

Send message
Joined: 15 Jul 00
Posts: 4520
Credit: 486,601
RAC: 0
United States
Message 159745 - Posted: 30 Aug 2005, 4:27:50 UTC - in response to Message 159742.  

There could also be a problem with file locking (normally handled by the kernel but is a seperate daemon for NFS).

At least part of my bald spot can be attibuted to NFS, the sync/async and no_subtree_check ops, and the bad cables, flakey NIC cards and cranky switch in between. Lots of little things alone and in combination that can slow things down and be annoyingly elusive. However that experience did teach me that by talking to myself I tend to meet a much nicer class of people.
ID: 159745 · Report as offensive
Profile John Cropper
Avatar

Send message
Joined: 3 May 00
Posts: 444
Credit: 416,933
RAC: 0
United States
Message 159821 - Posted: 30 Aug 2005, 9:23:32 UTC - in response to Message 159516.  

Little information, and mostly inaccurate at that.
Einstein@Home. you got a new user.
Seti@Home. Goodbye.



Ass. Door. The. Way out. Hit. Let. Don't. Your. On. The.

You do the math...

Stewie: So, is there any tread left on the tires? Or at this point would it be like throwing a hot dog down a hallway?

Fox Sunday (US) at 9PM ET/PT
ID: 159821 · Report as offensive
Bronco
Volunteer tester
Avatar

Send message
Joined: 22 Jun 05
Posts: 123
Credit: 19,340
RAC: 0
France
Message 159825 - Posted: 30 Aug 2005, 9:39:38 UTC

May be a stupid network problem. Don't forget that since the DNS move, some crunchers are no more able to reach Seti ...

Probably a nice combination of 2 or 3 nice little problems anyway
"In a world without walls and fences, who needs windows and gates ?"
for the team
ID: 159825 · Report as offensive
Mibe, ZX-81 16kb
Volunteer tester

Send message
Joined: 30 Jun 99
Posts: 42
Credit: 2,622,033
RAC: 0
Sweden
Message 159828 - Posted: 30 Aug 2005, 9:52:28 UTC

If the problematic filesystem is NFS-mounted it sure would be an easy quick-fix to just mount another filesystem as a replacement for half of the 1024 fan-out directorys and voilá, they have doubled their filesystem speed.

Granted they have another RAID with sufficient free disk space. If not they can replace one quarter (256) or one tenth (102) of the directorys, depending on free disk space and at least alleviate some of the problem.

Now this doesen't help the underlying problem, but would get things up and running temporarily while they search for a permanent solution.
ID: 159828 · Report as offensive
Profile Tern
Volunteer tester
Avatar

Send message
Joined: 4 Dec 03
Posts: 1122
Credit: 13,376,822
RAC: 44
United States
Message 159928 - Posted: 30 Aug 2005, 14:30:06 UTC

I just noticed an interesting "curve" in the graph of Ready To Send results, here:
http://bluenorthernsoftware.com/scarecrow/sahstats/lastweek/

It looks like right at 750,000 results, the linear increase that was there up to that point turned into a nice curve - because of a file system problem with > 750K files? The curve starts at the same time the WFV went to zero, so it may be because the UCB folks started other processes as soon as WFV was zero, slowing down the splitters. Still, whatever caused it, I hope they make a note of it - if we need the splitters running full-out some time, they'll know ONE possible performance bottleneck!
ID: 159928 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : Project Status - 08/26/2005 4pm PST


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.