Panic Mode On (22) Server problems

Author	Message
arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 920602 - Posted: 23 Jul 2009, 7:32:24 UTC Lets hope this one stays open a little bit longer which means the servers are running better. ID: 920602 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 920606 - Posted: 23 Jul 2009, 8:08:18 UTC - in response to Message 920602. Last modified: 23 Jul 2009, 8:11:32 UTC I was just about to post a follow up to a discussion in the other thread, and then i couldn't. So i might as well post it here. :-) Yet another SSD from Intel. I draw your attention to the Random Read & Random Write performance, in particular the comparison to the VelociRaptor. EDIT- BTW- whatever Eric did is still working. There have been a few times where it's taken a retry or 2 before a result has uploaded, but this is during traffic that previously nothing would have been able to upload in. Grant Darwin NT ID: 920606 ·

gizbar Send message Joined: 7 Jan 01 Posts: 586 Credit: 21,087,774 RAC: 0	Message 920611 - Posted: 23 Jul 2009, 8:50:14 UTC This morning, I finally cleared my backlog completely. I hope that all the tasks were reported in time, and that I'll get all the credit that I worked for. I've noticed that I've got quite a few of the new "double precision workunits?" that will take approximately twice as long. Let's hope all the changes are for the good, and we can settle down to a bit of reliability for a while. The servers can breathe a sigh of relief once the pressure drops, and hopefilly get back to normal. regards, Gizbar. *A proud GPU User Server Donor!* ID: 920611 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20258 Credit: 7,508,002 RAC: 20	Message 920640 - Posted: 23 Jul 2009, 12:02:34 UTC Last modified: 23 Jul 2009, 12:03:38 UTC From the previous thread: "the Staff" having worked for a solid 3 days through the TCP settings on the Upload server the Log Jam should be broken. Well, things are certainly running more smoothly from the users point of view. For my system here, uploads and downloads clear pretty much instantly. Curiously, on Cricket the downloads look to be maxed out for 1 hour periods at a time. More significantly: Has the uploads issue really been 'fixed' by only tweaking the upload server TCP settings? Or has the fix been greatly helped by also avoiding the download servers from saturating the downlink? Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 920640 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 920641 - Posted: 23 Jul 2009, 12:15:50 UTC - in response to Message 920640. Last modified: 23 Jul 2009, 12:16:09 UTC More significantly: Has the uploads issue really been 'fixed' by only tweaking the upload server TCP settings? Or has the fix been greatly helped by also avoiding the download servers from saturating the downlink? My vote is for the tweak. For a given level of download traffic i'm getting uploads going through when before they would have taken several attempts. I'm even getting some uploads going through after a couple of attempts where previously nothing would have gotten through. Grant Darwin NT ID: 920641 ·

Joseph Monk Send message Joined: 31 Mar 07 Posts: 150 Credit: 1,181,197 RAC: 0	Message 920649 - Posted: 23 Jul 2009, 12:46:29 UTC - in response to Message 920641. What ever they did is great, every one of my uploads have gone throw right away and my downloads are coming cleanly. 6.6.11 seems to be the magic version for people like me (two CUDA cards on Linux x64), so the timing is perfect. Life is good now! ID: 920649 ·

Space Cowboy Volunteer tester Send message Joined: 24 Apr 00 Posts: 43 Credit: 1,730,621 RAC: 0	Message 920672 - Posted: 23 Jul 2009, 14:31:09 UTC Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished? ID: 920672 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 920675 - Posted: 23 Jul 2009, 14:40:19 UTC - in response to Message 920672. Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished? It has been documented in several threads. Both those functions put a heavy load on the replica database which is only just catching up with the master. They will be switched back on when thimgs have settled down (probably next week when Berkeley is fully staffed again). F. ID: 920675 ·

Vistro Send message Joined: 6 Aug 08 Posts: 233 Credit: 316,549 RAC: 0	Message 920686 - Posted: 23 Jul 2009, 15:07:09 UTC - in response to Message 920675. Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished? It has been documented in several threads. Both those functions put a heavy load on the replica database which is only just catching up with the master. They will be switched back on when thimgs have settled down (probably next week when Berkeley is fully staffed again). F. Why isn't it fully staffed now? Vacation? ID: 920686 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65736 Credit: 55,293,173 RAC: 49	Message 920688 - Posted: 23 Jul 2009, 15:21:33 UTC - in response to Message 920686. Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished? It has been documented in several threads. Both those functions put a heavy load on the replica database which is only just catching up with the master. They will be switched back on when things have settled down (probably next week when Berkeley is fully staffed again). F. Why isn't it fully staffed now? Vacation? Yep, Last I heard 1/3rd of the staff is on vacation(Matt), He earned It too. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 920688 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 920696 - Posted: 23 Jul 2009, 16:07:39 UTC - in response to Message 920688. Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished? It has been documented in several threads. Both those functions put a heavy load on the replica database which is only just catching up with the master. They will be switched back on when things have settled down (probably next week when Berkeley is fully staffed again). F. Why isn't it fully staffed now? Vacation? Yep, Last I heard 1/3rd of the staff is on vacation(Matt), He earned It too. There's another person on vacation too. ID: 920696 ·

Vistro Send message Joined: 6 Aug 08 Posts: 233 Credit: 316,549 RAC: 0	Message 920700 - Posted: 23 Jul 2009, 16:18:49 UTC - in response to Message 920696. Well, a new internet link won't fix the tasks list. They would need a new server for that, right? ID: 920700 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 920704 - Posted: 23 Jul 2009, 16:26:18 UTC - in response to Message 920700. Well, a new internet link won't fix the tasks list. They would need a new server for that, right? They need a gigabit internet connection to ensure enough bandwidth for all the hungry crunchers. They need powerful enough servers to not drop all the connections asking for more work or returning results. ID: 920704 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 920706 - Posted: 23 Jul 2009, 16:38:02 UTC Last modified: 23 Jul 2009, 16:40:05 UTC I don't know what happen on other PCs.. but.. My GPU cruncher start now to have UL and DL 'http errors'. EDIT: Also what I mentioned already in the other panic thread.. The DL speed is ~ 50 % cutted. ID: 920706 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 920740 - Posted: 23 Jul 2009, 18:25:18 UTC - in response to Message 920700. Well, a new internet link won't fix the tasks list. They would need a new server for that, right? They can use many things. One issue "may" be the way the BOINC client handles uploads and downloads -- just being too aggressive when things are slow. That could be fixed in 6.6.38. ID: 920740 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 920757 - Posted: 23 Jul 2009, 19:46:59 UTC Vistro asked in the previous thread: I don't know why this is really making my scratch my head.. Results ready to send: For each workunit, "empty" results are generated that are then sent out to individual users to be filled with data. This is the number of excess empty results ready to be sent out, i.e. a backlog in case demand exceeds the current rate of creation. You are trying to sell to me that it stockpiles tens of thousands of identical files, sends one off, then deletes it? Why can't it keep ONE, and send it off, and keep it once it's done? Is there a value that says: Here's how many have been split and are now ready to be fired! What's the difference between a workunit and a result? Here's an overall brief rundown of data handling. Data which was recorded at Arecibo is received on 750 GB Hard Disks, then broken down into files of 50.20 GB or smaller for more convenient handling. Those files are sometimes called 'tapes' because the size resembles the amount of data on one tape of an older recorder system. A splitter works with one of the 14 channels in one of those files. An ap_splitter divides the data in 13.42 second sequential chunks, each being a WU. An mb_splitter gets 107.37 seconds of data and breaks it down into 256 frequency subbands, each of which is a WU. Each WU is saved as a file, and a database entry is made in the WORKUNIT table identifying where the file is and much other information such as the basis for estimated crunch time, how far out the deadline should be, etc. The Transitioner checks the database for new WUs, and creates 2 records in the RESULT table because the project setting is 2 for initial replication. Those 2 are then added to the "Results ready to send" queue shown on the server status page. The Feeder notifies the Scheduler about a set of up to 100 of those Results. Those 100 slots are preassigned, some for MB work, some for AP_v5, and some for AP_v505, probably a 96:1:3 ratio now. The Feeder goes to sleep for a few seconds after it has updated those slots, either by filling them all or leaving some empty because there are none of the correct type in "Ready to send". As a Scheduler process handles a request for work from a host, it checks those slots for suitable work. Preferences, host capabilities, or an app_info.xml can all rule out some types. If no work is found a "(Project has no work available)" message is sent, otherwise the name and URL of the workunit and the desired result are added to the reply message, and database fields are updated to show the work "In progress". Also, the executable and any other files needed to do the work are identified in the reply, including the URLs if the host isn't using an app_info.xml. If the estimated crunch time hasn't fulfilled the amount of work the host requested, the Scheduler looks for more suitable work, otherwise it's done and the reply is sent. The host receives the reply containing one or more tasks. I view a task as being a set of directions: 1. Download xxxxx workunit and any other needed files you don't already have. 2. Start yyyyy application to crunch the workunit and produce an output file with the right name. 3. When the application finishes, upload the result file. 4. Sometime after the result has successfully uploaded, report the task complete. When a host reports a result, a Scheduler process updates the RESULT and WORKUNIT tables as needed. Multiple reported results are done in a batch, less costly in terms of database operations. When successful results from 2 hosts (project setting) have been reported the Validator loads the two result files and checks whether they match. If so, one is declared canonical and both are granted credit. Otherwise, the validation state causes the Transitioner to create another RESULT entry. etc. A canonical result is entered into the master science database by the Assimilator, if there are no more "In progress" results for the WU, the workunit file and all associated result files can be deleted. The database workunit and result entries are kept for another day, usually visible to users as web pages. --------------------------- There's a lot of detail I've left out, but that's the high points. I don't know all the details, for that matter. Hope it helps. Joe ID: 920757 ·

Vistro Send message Joined: 6 Aug 08 Posts: 233 Credit: 316,549 RAC: 0	Message 920759 - Posted: 23 Jul 2009, 19:51:34 UTC - in response to Message 920757. Oh. That makes a lot of sense. You precreate the result records because it would be easier to do it now when you have the free time than later. ID: 920759 ·

__W__ Send message Joined: 28 Mar 09 Posts: 116 Credit: 5,943,642 RAC: 0	Message 920785 - Posted: 23 Jul 2009, 21:29:17 UTC Last modified: 23 Jul 2009, 21:58:38 UTC Congratulation to the hard working seti staff since the last maintenance (except for the first 2-3 hours, when the wave runs through :-) ) ULs and DLs running smoothly - no errors or retrys for me. They must have found some extra MHz/GBit somewhere, which are rare than truffle. Maybe we should donate them a truffle pig and when they have found enough bandwidth they could have a nice barbecue - with the pig of course ;-). __W__ _______________________________________________________________________________ ID: 920785 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 920843 - Posted: 23 Jul 2009, 23:36:04 UTC Yes, since the first time now for nearly 2 months things is settling. My topcruncher is extremely sensitive for server errors and i was getting used to getting some work to crunch around saturday or sunday after a dry cache occured during the Tuesday backups. I was up and running mearly the day after this time with work filling up so it wouldn't drain. Many thanks and extremely nice done s@h staff. This was a really good move to increase the sensitivity of regular MB work paired with the tweaking of Apache on the upload server.. Hope the "Panic thread" doesn't need to be flooded that much in the near future. Kind regards Vyper _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 920843 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 920890 - Posted: 24 Jul 2009, 1:30:46 UTC - in response to Message 920704. Well, a new internet link won't fix the tasks list. They would need a new server for that, right? They need a gigabit internet connection to ensure enough bandwidth for all the hungry crunchers. They need powerful enough servers to not drop all the connections asking for more work or returning results. You know... I bet that once the gb line is up and running that usage wil lend up topping out at like 110mb lol. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 920890 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.