NEW Technical -- NEWS -- in case some one missed it

Message boards : Number crunching : NEW Technical -- NEWS -- in case some one missed it
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Crunch3r
Volunteer tester
Avatar

Send message
Joined: 15 Apr 99
Posts: 1546
Credit: 3,438,823
RAC: 0
Germany
Message 141121 - Posted: 23 Jul 2005, 0:57:16 UTC

July 23, 2005 - 00:15 UTC
We are looking for bottlenecks in workunit production. We may have found one. A number of processes that read and write to the upload/download storage device (eg, splitters, the data server, validators) now do so across the ethernet switch that connects our data closet machines to the SSL LAN. This 100Mbps switch may well be overloaded.

We are moving intra-closet data intensive traffic to a separate 1Gbps switch. Today we moved the data server machine and one of the machines which does both splitting and validation over to this swtich for their upload/download traffic. Where we had been seeing NFS (Network File System) errors on both of these machines before the move to the new switch, we are not seeing errors or either of them now.

----
regards
Crunch3r


Join BOINC United now!
ID: 141121 · Report as offensive
Profile Crunch3r
Volunteer tester
Avatar

Send message
Joined: 15 Apr 99
Posts: 1546
Credit: 3,438,823
RAC: 0
Germany
Message 143242 - Posted: 26 Jul 2005, 20:50:29 UTC

July 26, 2005 - 19:00 UTC
Over the past week the BOINC data server finally caught up (after moving this service off a D220 and onto a E3500 with three times the CPU and memory). However, after the floodgates opened up the splitters couldn't keep up with the large backlog of work.

At the end of the day on Friday we discovered that all machines talking to the SnapAppliance over the Gigabit switch were happy, but the ones talking over the LAN were having chronic NFS dropouts. We moved one of the splitter machines onto the Gigabit switch and its NFS dropouts disappeared, and in turn the workunit queue began to grow. Over the weekend the queue returned to almost full (about 500K results ready to send out).

So we are in the process of reconfiguring various pieces of hardware to get all of the back-end processes that need to talk to the SnapAppliance onto the Gigabit switch. This is no easy task, as hardware is involved (each server added to the Gigabit switch needs an extra ethernet port, for example), and sometimes physical placement is an issue (as some servers are nowhere near the switch). This may mean that some services will shuffle around to servers in proximity to the switch. We shall see.

Meanwhile, the assimilators have been falling behind. We recently added code to parallelize this process (like the transitioners and validators) and this has helped the backlog, but only slightly. This wouldn't be that much of an issue, except (a) with the assimilators behind, the file_deleter is also behind, (b) the file_deleter among other things is not yet talking via the Gigabit switch, and (c) the empty workunit queue has been filling up all weekend. What does all this mean? The SnapAppliance is dangerously full with fresh workunits and a large backlog of old work.

So... we actually turned off the splitters this morning so the assimilators/deleter could catch up a bit. We also just converted the "old" kryten into the machine "penguin" which will run extra assimilators and deleters. These will appear on the server status page shortly.

ALSO! Part of this grand Gigabit switch endeavor, we had to free up a port on the scheduler, so we made a DNS switch this morning moving all scheduler traffic off the Cogent link and onto the Berkeley campus net. This should be transparent to all parties involved as the scheduler bandwidth is minimal (far less than the SETI@home web server, which is also on the campus net), but while the new DNS maps propogate some users will be unable to contact the scheduler. This should clear up relatively quickly (several hours for most of the world, maybe days for the few with ISPs that have finicky DNS servers).

Join BOINC United now!
ID: 143242 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21804
Credit: 2,815,091
RAC: 0
United States
Message 143415 - Posted: 27 Jul 2005, 1:10:50 UTC - in response to Message 143242.  

...but while the new DNS maps propogate some users will be unable to contact the scheduler. This should clear up relatively quickly (several hours for most of the world, maybe days for the few with ISPs that have finicky DNS servers).

12 hours here and counting.
me@rescam.org
ID: 143415 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 143535 - Posted: 27 Jul 2005, 3:56:53 UTC

Here is more fun, one computer on my net can update one cannot ... :)
ID: 143535 · Report as offensive
Profile Prognatus

Send message
Joined: 6 Jul 99
Posts: 1600
Credit: 391,546
RAC: 0
Norway
Message 143543 - Posted: 27 Jul 2005, 4:09:09 UTC - in response to Message 143535.  
Last modified: 27 Jul 2005, 4:29:36 UTC

Same here. I noticed that none of my PC's running TMR's optimized clients can report, but another (4.18) reports fine every time. However unlikely... but I wonder if this has something to do with anonymous platform. :/

[edit]
I saw the discussion about NAME service change in another thread, and this seems the likely cause. But it's really strange anyhow, because one of my PC's that can report is on the same router as one that cannot! This doesn't compute... :)
[/edit]
ID: 143543 · Report as offensive
EclipseHA

Send message
Joined: 28 Jul 99
Posts: 1018
Credit: 530,719
RAC: 0
United States
Message 143582 - Posted: 27 Jul 2005, 5:55:17 UTC

This type of change can take 72 or more hours, and there's nothing that UCB can do about it. (it's yopur local ISP which isn't synced)

Whatever you do, don't change anything! Any mods you make may break stuff.


ID: 143582 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 143633 - Posted: 27 Jul 2005, 10:05:51 UTC - in response to Message 143543.  
Last modified: 27 Jul 2005, 10:08:20 UTC

Same here. I noticed that none of my PC's running TMR's optimized clients can report, but another (4.18) reports fine every time. However unlikely... but I wonder if this has something to do with anonymous platform. :/

Prognatus and others, last nite the old dying laptop of mine which wasn't connected all day just got right on through. This laptop couldn't connect before hand, and didn't connect after the other one got through. Both have Optimized clients. Then this A.M, this puter is getting through. I guess "patience" is what's needed.

PS I'm a dial up user
ID: 143633 · Report as offensive
Profile spacemeat
Avatar

Send message
Joined: 4 Oct 99
Posts: 239
Credit: 8,425,288
RAC: 0
United States
Message 143681 - Posted: 27 Jul 2005, 13:03:03 UTC

does someone have the new scheduler IP? i want to make a temporary entry in the local hosts file. one of my machines ran out of work and needs an update. it's 3 day cache usually dries up in 12 hours
ID: 143681 · Report as offensive
Profile Thierry Van Driessche
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3083
Credit: 150,096
RAC: 0
Belgium
Message 143690 - Posted: 27 Jul 2005, 13:37:47 UTC - in response to Message 143681.  
Last modified: 27 Jul 2005, 13:47:10 UTC

does someone have the new scheduler IP? i want to make a temporary entry in the local hosts file. one of my machines ran out of work and needs an update. it's 3 day cache usually dries up in 12 hours

Using a proxy can resolve the problem for returning results. I just used it myself having the communication problem.

See here.
ID: 143690 · Report as offensive
Profile spacemeat
Avatar

Send message
Joined: 4 Oct 99
Posts: 239
Credit: 8,425,288
RAC: 0
United States
Message 143702 - Posted: 27 Jul 2005, 14:15:30 UTC - in response to Message 143690.  

128.32.18.173 setiboinc.ssl.berkeley.edu

for win, add this line to C:\windows\system32\drivers\etc\hosts and save the file.
open up a command prompt and type "ipconfig /flushdns"
update the seti project
do not forget to eventually remove that line from your hosts file

ssh into one of my remote machines on a different network had the update
ID: 143702 · Report as offensive
Profile Crunch3r
Volunteer tester
Avatar

Send message
Joined: 15 Apr 99
Posts: 1546
Credit: 3,438,823
RAC: 0
Germany
Message 143830 - Posted: 27 Jul 2005, 23:55:36 UTC - in response to Message 143702.  

128.32.18.173 setiboinc.ssl.berkeley.edu

for win, add this line to C:windowssystem32driversetchosts and save the file.
open up a command prompt and type "ipconfig /flushdns"
update the seti project
do not forget to eventually remove that line from your hosts file

ssh into one of my remote machines on a different network had the update


It´s working for me now. Seems the update to the root servers went well. No problems updating or contacting the sheduler from here.

regards
Crunch3r

Join BOINC United now!
ID: 143830 · Report as offensive

Message boards : Number crunching : NEW Technical -- NEWS -- in case some one missed it


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.