Message boards :
Number crunching :
Project Status - 08/26/2005 4pm PST
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Bradshawma Send message Joined: 17 Mar 04 Posts: 20 Credit: 2,362,303 RAC: 0 |
Just had another look at server status and, correct me if I am wrong, but i am sure the file deleters were running on Penguin earlier today but are now running on Kryten. If I am right this would seem to be progress. |
Scarecrow Send message Joined: 15 Jul 00 Posts: 4520 Credit: 486,601 RAC: 0 |
Just had another look at server status and, correct me if I am wrong, but i am sure the file deleters were running on Penguin earlier today but are now running on Kryten. If I am right this would seem to be progress. I think you are correct. Field Engineer's mantra #3... "When in doubt, swap it out" :) |
Chris Weber Send message Joined: 22 Oct 04 Posts: 3 Credit: 191,704 RAC: 0 |
Little information, and mostly inaccurate at that. Einstein@Home. you got a new user. Seti@Home. Goodbye. |
Mike Gelvin Send message Joined: 23 May 00 Posts: 92 Credit: 9,298,464 RAC: 0 |
I think they figured out that the problem was not the number of files at all... but rather the bandwidth to the file system. Had the problem been the number of files, we would have seen an exponential increase in speed of validation as the number of files dropped. Just had another look at server status and, correct me if I am wrong, but i am sure the file deleters were running on Penguin earlier today but are now running on Kryten. If I am right this would seem to be progress. |
Paul D. Buck Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0 |
Actually it is the various results which have passed their deadline which are timed out and rescheduled. Everyone is out ... The system is still being recovered. If the feeder is off, not database connection for the schedulers ... |
Bradshawma Send message Joined: 17 Mar 04 Posts: 20 Credit: 2,362,303 RAC: 0 |
I think they figured out that the problem was not the number of files at all... but rather the bandwidth to the file system. Had the problem been the number of files, we would have seen an exponential increase in speed of validation as the number of files dropped. No,surely the validators would have maintained a constant speed because the file deleters were off-line while the validator queue was emptying. This would mean that the number of files would have been constant throughout and is only reducing now that the deleters are working |
[B^S] Spydermb Send message Joined: 16 Jul 99 Posts: 496 Credit: 10,860,148 RAC: 0 |
From Technical News August 29, 2005 - 23:00 UTC So we're still offline, as we have been for the past week. Actually it'll be a full week tomorrow. We decided to keep the servers off one more night to clear out the remaining assimilation/deletion queues but we plan to come back on line at some point tomorrow no matter what. Regarding this lengthy outage, we have some good news and bad news. The good news is that the entire validation queue has been drained. So people worried their backlogged credit would never arrive should be quite happy now. As well, those who fear their results will arrive past deadline to be counted should fear not. As long as the respective workunits are still in the database, credit will be granted. We'll hold off running db_purge for a while, so people can return their work after a long outage without missing any deadlines. It also should be noted that the antique deleters finished several days ago, and have reduced the result directories by about 40% in size. Now the bad news. Even though the result directories are much smaller, and most of the servers are idle since many queues are empty, the assimilators and deleters are still running way too slow. There has been some speed improvement over the past week, but hardly enough. There's some NFS weirdness going on that wasn't so obvious before. So we're hastily looking into that, hopefully finding out what the problem is before tomorrow. BOINC SYNERGY is an International Team and We Welcome All BOINC Participants! BOINC Synergy Click to Join BOINC Synergy |
Mike Gelvin Send message Joined: 23 May 00 Posts: 92 Credit: 9,298,464 RAC: 0 |
I think they figured out that the problem was not the number of files at all... but rather the bandwidth to the file system. Had the problem been the number of files, we would have seen an exponential increase in speed of validation as the number of files dropped. Before the deleters were turned off (prior to Sat), the decline in validated units was linear as well. Once the deleters were off, the validators acquired their much need bandwidth to their thing. (all of this is speculation). |
John Cropper Send message Joined: 3 May 00 Posts: 444 Credit: 416,933 RAC: 0 |
[coughDEFRAGcough] |
betonklaus Send message Joined: 28 Feb 03 Posts: 10 Credit: 31,836,074 RAC: 19 |
Ok - I see the Problems you have. Our Problem is now that many Files we crounched are getting out of time now. The Files are ready but we cannot upload. Is that now wasted CPU time for us? Sorry for my English is not so good. Lebe den Tag - es könnte dein letzter sein..... Live the Day - maybe it`s your last..... |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Ok - I see the Problems you have. Our Problem is now that many Files we crounched are getting out of time now. The Files are ready but we cannot upload. Is that now wasted CPU time for us? According to the news, as long as a work unit is in the database, overdue work will still get credit. ... and they aren't purging work units from the database. |
itenginerd Send message Joined: 1 Aug 00 Posts: 37 Credit: 39,905 RAC: 0 |
[coughDEFRAGcough] lmao. Nice to see even the big projects need a good old-fashioned defrag now and then. 's what they get for not using a more robust file system like NTFS. (there's a scary statement for you) (j) James |
Steven Wilcox Send message Joined: 23 Sep 99 Posts: 36 Credit: 86,104,929 RAC: 131 |
[coughDEFRAGcough] NFS (network file system) Unix and NTFS are not the same. NFS is more like a netbios file share. The problems are most likely network error's (timeouts ect.) which could be caused by a bad network port on a switch, nic card on the system or just too much traffic. Most of the servers appear to be older ones with only a 10meg or 100mb (E3500, E450, U10, U60) Not sure what the D220R has. Only one system with a bad connection talking to the right server could gum up the works. I've seen duplex problems on my network with Sun systems and Cisco switches if the switch thinks the link is full duplex (send/rec same time) and the system thinks the link is half duplex (send or rec only NOT both) errors will go up quickly. They cleaned up some of these errors earlier in the month/June. I'm sure this is just one item they're checking in trying to fix the NFS problems. Steve |
itenginerd Send message Joined: 1 Aug 00 Posts: 37 Credit: 39,905 RAC: 0 |
NFS (network file system) Unix and NTFS are not the same. erm... that was kinda the joke. Berkeley's got more problems than we care to deal with if they're running RISC boxen against NTFS filesystems. 8) (j) James |
ampoliros Send message Joined: 24 Sep 99 Posts: 152 Credit: 3,542,579 RAC: 5 |
NFS has it's share of problems (I don't know what version they are using) and I've had to muddle through some of them at one point or another. It's more common than you might think to have "dropped" packets when the connection is 100Mb/s from the file server to the switch and 10Mb/s from the switch to the client. And as mentioned you have the possibility of half/full duplex confusion. If packets are being lost/dropped somewhere, it's not that hard to find the problem. (And if both see problems, it's the switch.) Slow performance could also mean problems with name/IP resolution. One "way-out-there" idea is that there's a software firewall on one of these things that's filtering packets (I hope not, that's just crazy). There could also be a problem with file locking (normally handled by the kernel but is a seperate daemon for NFS). 7,049 S@H Classic Credits |
Scarecrow Send message Joined: 15 Jul 00 Posts: 4520 Credit: 486,601 RAC: 0 |
There could also be a problem with file locking (normally handled by the kernel but is a seperate daemon for NFS). At least part of my bald spot can be attibuted to NFS, the sync/async and no_subtree_check ops, and the bad cables, flakey NIC cards and cranky switch in between. Lots of little things alone and in combination that can slow things down and be annoyingly elusive. However that experience did teach me that by talking to myself I tend to meet a much nicer class of people. |
John Cropper Send message Joined: 3 May 00 Posts: 444 Credit: 416,933 RAC: 0 |
Little information, and mostly inaccurate at that. Ass. Door. The. Way out. Hit. Let. Don't. Your. On. The. You do the math... Stewie: So, is there any tread left on the tires? Or at this point would it be like throwing a hot dog down a hallway? Fox Sunday (US) at 9PM ET/PT |
Bronco Send message Joined: 22 Jun 05 Posts: 123 Credit: 19,340 RAC: 0 |
May be a stupid network problem. Don't forget that since the DNS move, some crunchers are no more able to reach Seti ... Probably a nice combination of 2 or 3 nice little problems anyway "In a world without walls and fences, who needs windows and gates ?" for the team |
Mibe, ZX-81 16kb Send message Joined: 30 Jun 99 Posts: 42 Credit: 2,622,033 RAC: 0 |
If the problematic filesystem is NFS-mounted it sure would be an easy quick-fix to just mount another filesystem as a replacement for half of the 1024 fan-out directorys and voilá, they have doubled their filesystem speed. Granted they have another RAID with sufficient free disk space. If not they can replace one quarter (256) or one tenth (102) of the directorys, depending on free disk space and at least alleviate some of the problem. Now this doesen't help the underlying problem, but would get things up and running temporarily while they search for a permanent solution. |
Tern Send message Joined: 4 Dec 03 Posts: 1122 Credit: 13,376,822 RAC: 44 |
I just noticed an interesting "curve" in the graph of Ready To Send results, here: http://bluenorthernsoftware.com/scarecrow/sahstats/lastweek/ It looks like right at 750,000 results, the linear increase that was there up to that point turned into a nice curve - because of a file system problem with > 750K files? The curve starts at the same time the WFV went to zero, so it may be because the UCB folks started other processes as soon as WFV was zero, slowing down the splitters. Still, whatever caused it, I hope they make a note of it - if we need the splitters running full-out some time, they'll know ONE possible performance bottleneck! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.