Message boards :
Number crunching :
NEW Technical -- NEWS -- in case some one missed it
Message board moderation
Author | Message |
---|---|
Crunch3r Send message Joined: 15 Apr 99 Posts: 1546 Credit: 3,438,823 RAC: 0 |
July 23, 2005 - 00:15 UTC We are looking for bottlenecks in workunit production. We may have found one. A number of processes that read and write to the upload/download storage device (eg, splitters, the data server, validators) now do so across the ethernet switch that connects our data closet machines to the SSL LAN. This 100Mbps switch may well be overloaded. We are moving intra-closet data intensive traffic to a separate 1Gbps switch. Today we moved the data server machine and one of the machines which does both splitting and validation over to this swtich for their upload/download traffic. Where we had been seeing NFS (Network File System) errors on both of these machines before the move to the new switch, we are not seeing errors or either of them now. ---- regards Crunch3r Join BOINC United now! |
Crunch3r Send message Joined: 15 Apr 99 Posts: 1546 Credit: 3,438,823 RAC: 0 |
July 26, 2005 - 19:00 UTC Over the past week the BOINC data server finally caught up (after moving this service off a D220 and onto a E3500 with three times the CPU and memory). However, after the floodgates opened up the splitters couldn't keep up with the large backlog of work. At the end of the day on Friday we discovered that all machines talking to the SnapAppliance over the Gigabit switch were happy, but the ones talking over the LAN were having chronic NFS dropouts. We moved one of the splitter machines onto the Gigabit switch and its NFS dropouts disappeared, and in turn the workunit queue began to grow. Over the weekend the queue returned to almost full (about 500K results ready to send out). So we are in the process of reconfiguring various pieces of hardware to get all of the back-end processes that need to talk to the SnapAppliance onto the Gigabit switch. This is no easy task, as hardware is involved (each server added to the Gigabit switch needs an extra ethernet port, for example), and sometimes physical placement is an issue (as some servers are nowhere near the switch). This may mean that some services will shuffle around to servers in proximity to the switch. We shall see. Meanwhile, the assimilators have been falling behind. We recently added code to parallelize this process (like the transitioners and validators) and this has helped the backlog, but only slightly. This wouldn't be that much of an issue, except (a) with the assimilators behind, the file_deleter is also behind, (b) the file_deleter among other things is not yet talking via the Gigabit switch, and (c) the empty workunit queue has been filling up all weekend. What does all this mean? The SnapAppliance is dangerously full with fresh workunits and a large backlog of old work. So... we actually turned off the splitters this morning so the assimilators/deleter could catch up a bit. We also just converted the "old" kryten into the machine "penguin" which will run extra assimilators and deleters. These will appear on the server status page shortly. ALSO! Part of this grand Gigabit switch endeavor, we had to free up a port on the scheduler, so we made a DNS switch this morning moving all scheduler traffic off the Cogent link and onto the Berkeley campus net. This should be transparent to all parties involved as the scheduler bandwidth is minimal (far less than the SETI@home web server, which is also on the campus net), but while the new DNS maps propogate some users will be unable to contact the scheduler. This should clear up relatively quickly (several hours for most of the world, maybe days for the few with ISPs that have finicky DNS servers). Join BOINC United now! |
Misfit Send message Joined: 21 Jun 01 Posts: 21804 Credit: 2,815,091 RAC: 0 |
...but while the new DNS maps propogate some users will be unable to contact the scheduler. This should clear up relatively quickly (several hours for most of the world, maybe days for the few with ISPs that have finicky DNS servers). 12 hours here and counting. me@rescam.org |
Paul D. Buck Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0 |
Here is more fun, one computer on my net can update one cannot ... :) |
Prognatus Send message Joined: 6 Jul 99 Posts: 1600 Credit: 391,546 RAC: 0 |
Same here. I noticed that none of my PC's running TMR's optimized clients can report, but another (4.18) reports fine every time. However unlikely... but I wonder if this has something to do with anonymous platform. :/ [edit] I saw the discussion about NAME service change in another thread, and this seems the likely cause. But it's really strange anyhow, because one of my PC's that can report is on the same router as one that cannot! This doesn't compute... :) [/edit] |
EclipseHA Send message Joined: 28 Jul 99 Posts: 1018 Credit: 530,719 RAC: 0 |
This type of change can take 72 or more hours, and there's nothing that UCB can do about it. (it's yopur local ISP which isn't synced) Whatever you do, don't change anything! Any mods you make may break stuff. |
Astro Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0 |
Same here. I noticed that none of my PC's running TMR's optimized clients can report, but another (4.18) reports fine every time. However unlikely... but I wonder if this has something to do with anonymous platform. :/ Prognatus and others, last nite the old dying laptop of mine which wasn't connected all day just got right on through. This laptop couldn't connect before hand, and didn't connect after the other one got through. Both have Optimized clients. Then this A.M, this puter is getting through. I guess "patience" is what's needed. PS I'm a dial up user |
spacemeat Send message Joined: 4 Oct 99 Posts: 239 Credit: 8,425,288 RAC: 0 |
does someone have the new scheduler IP? i want to make a temporary entry in the local hosts file. one of my machines ran out of work and needs an update. it's 3 day cache usually dries up in 12 hours |
Thierry Van Driessche Send message Joined: 20 Aug 02 Posts: 3083 Credit: 150,096 RAC: 0 |
does someone have the new scheduler IP? i want to make a temporary entry in the local hosts file. one of my machines ran out of work and needs an update. it's 3 day cache usually dries up in 12 hours Using a proxy can resolve the problem for returning results. I just used it myself having the communication problem. See here. |
spacemeat Send message Joined: 4 Oct 99 Posts: 239 Credit: 8,425,288 RAC: 0 |
128.32.18.173 setiboinc.ssl.berkeley.edu for win, add this line to C:\windows\system32\drivers\etc\hosts and save the file. open up a command prompt and type "ipconfig /flushdns" update the seti project do not forget to eventually remove that line from your hosts file ssh into one of my remote machines on a different network had the update |
Crunch3r Send message Joined: 15 Apr 99 Posts: 1546 Credit: 3,438,823 RAC: 0 |
128.32.18.173 setiboinc.ssl.berkeley.edu It´s working for me now. Seems the update to the root servers went well. No problems updating or contacting the sheduler from here. regards Crunch3r Join BOINC United now! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.