Panic Mode On (22) Server problems

Message boards : Number crunching : Panic Mode On (22) Server problems
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 11 · Next

AuthorMessage
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 920602 - Posted: 23 Jul 2009, 7:32:24 UTC

Lets hope this one stays open a little bit longer which means the servers are running better.

ID: 920602 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13368
Credit: 208,696,464
RAC: 304
Australia
Message 920606 - Posted: 23 Jul 2009, 8:08:18 UTC - in response to Message 920602.  
Last modified: 23 Jul 2009, 8:11:32 UTC

I was just about to post a follow up to a discussion in the other thread, and then i couldn't. So i might as well post it here. :-)

Yet another SSD from Intel.
I draw your attention to the Random Read & Random Write performance, in particular the comparison to the VelociRaptor.



EDIT-
BTW- whatever Eric did is still working. There have been a few times where it's taken a retry or 2 before a result has uploaded, but this is during traffic that previously nothing would have been able to upload in.
Grant
Darwin NT
ID: 920606 · Report as offensive
Profile gizbar
Avatar

Send message
Joined: 7 Jan 01
Posts: 586
Credit: 21,087,774
RAC: 0
United Kingdom
Message 920611 - Posted: 23 Jul 2009, 8:50:14 UTC

This morning, I finally cleared my backlog completely. I hope that all the tasks were reported in time, and that I'll get all the credit that I worked for. I've noticed that I've got quite a few of the new "double precision workunits?" that will take approximately twice as long. Let's hope all the changes are for the good, and we can settle down to a bit of reliability for a while. The servers can breathe a sigh of relief once the pressure drops, and hopefilly get back to normal.

regards, Gizbar.


A proud GPU User Server Donor!
ID: 920611 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 16614
Credit: 7,508,002
RAC: 20
United Kingdom
Message 920640 - Posted: 23 Jul 2009, 12:02:34 UTC
Last modified: 23 Jul 2009, 12:03:38 UTC

From the previous thread:

"the Staff" having worked for a solid 3 days through the TCP settings on the Upload server the Log Jam should be broken.

Well, things are certainly running more smoothly from the users point of view. For my system here, uploads and downloads clear pretty much instantly.

Curiously, on Cricket the downloads look to be maxed out for 1 hour periods at a time.

More significantly: Has the uploads issue really been 'fixed' by only tweaking the upload server TCP settings? Or has the fix been greatly helped by also avoiding the download servers from saturating the downlink?


Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 920640 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13368
Credit: 208,696,464
RAC: 304
Australia
Message 920641 - Posted: 23 Jul 2009, 12:15:50 UTC - in response to Message 920640.  
Last modified: 23 Jul 2009, 12:16:09 UTC

More significantly: Has the uploads issue really been 'fixed' by only tweaking the upload server TCP settings? Or has the fix been greatly helped by also avoiding the download servers from saturating the downlink?

My vote is for the tweak.
For a given level of download traffic i'm getting uploads going through when before they would have taken several attempts. I'm even getting some uploads going through after a couple of attempts where previously nothing would have gotten through.
Grant
Darwin NT
ID: 920641 · Report as offensive
Joseph Monk

Send message
Joined: 31 Mar 07
Posts: 150
Credit: 1,181,197
RAC: 0
Korea, South
Message 920649 - Posted: 23 Jul 2009, 12:46:29 UTC - in response to Message 920641.  

What ever they did is great, every one of my uploads have gone throw right away and my downloads are coming cleanly. 6.6.11 seems to be the magic version for people like me (two CUDA cards on Linux x64), so the timing is perfect. Life is good now!
ID: 920649 · Report as offensive
Profile Space Cowboy
Volunteer tester
Avatar

Send message
Joined: 24 Apr 00
Posts: 43
Credit: 1,730,621
RAC: 0
United Kingdom
Message 920672 - Posted: 23 Jul 2009, 14:31:09 UTC

Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished?
ID: 920672 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 920675 - Posted: 23 Jul 2009, 14:40:19 UTC - in response to Message 920672.  

Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished?

It has been documented in several threads. Both those functions put a heavy load on the replica database which is only just catching up with the master. They will be switched back on when thimgs have settled down (probably next week when Berkeley is fully staffed again).

F.
ID: 920675 · Report as offensive
Profile Vistro
Avatar

Send message
Joined: 6 Aug 08
Posts: 233
Credit: 316,549
RAC: 0
United States
Message 920686 - Posted: 23 Jul 2009, 15:07:09 UTC - in response to Message 920675.  

Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished?

It has been documented in several threads. Both those functions put a heavy load on the replica database which is only just catching up with the master. They will be switched back on when thimgs have settled down (probably next week when Berkeley is fully staffed again).

F.



Why isn't it fully staffed now? Vacation?
ID: 920686 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 64937
Credit: 55,293,173
RAC: 49
United States
Message 920688 - Posted: 23 Jul 2009, 15:21:33 UTC - in response to Message 920686.  

Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished?

It has been documented in several threads. Both those functions put a heavy load on the replica database which is only just catching up with the master. They will be switched back on when things have settled down (probably next week when Berkeley is fully staffed again).

F.



Why isn't it fully staffed now? Vacation?

Yep, Last I heard 1/3rd of the staff is on vacation(Matt), He earned It too.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 920688 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15688
Credit: 84,761,841
RAC: 28
United States
Message 920696 - Posted: 23 Jul 2009, 16:07:39 UTC - in response to Message 920688.  

Good to see everything back up and running again. Question though, anyone know why the option to view tasks is disabled and the one to view pending credit on your account page has vanished?

It has been documented in several threads. Both those functions put a heavy load on the replica database which is only just catching up with the master. They will be switched back on when things have settled down (probably next week when Berkeley is fully staffed again).

F.



Why isn't it fully staffed now? Vacation?

Yep, Last I heard 1/3rd of the staff is on vacation(Matt), He earned It too.


There's another person on vacation too.
ID: 920696 · Report as offensive
Profile Vistro
Avatar

Send message
Joined: 6 Aug 08
Posts: 233
Credit: 316,549
RAC: 0
United States
Message 920700 - Posted: 23 Jul 2009, 16:18:49 UTC - in response to Message 920696.  

Well, a new internet link won't fix the tasks list.

They would need a new server for that, right?

ID: 920700 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15688
Credit: 84,761,841
RAC: 28
United States
Message 920704 - Posted: 23 Jul 2009, 16:26:18 UTC - in response to Message 920700.  

Well, a new internet link won't fix the tasks list.

They would need a new server for that, right?


They need a gigabit internet connection to ensure enough bandwidth for all the hungry crunchers. They need powerful enough servers to not drop all the connections asking for more work or returning results.
ID: 920704 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 920706 - Posted: 23 Jul 2009, 16:38:02 UTC
Last modified: 23 Jul 2009, 16:40:05 UTC


I don't know what happen on other PCs.. but..

My GPU cruncher start now to have UL and DL 'http errors'.


EDIT:
Also what I mentioned already in the other panic thread..
The DL speed is ~ 50 % cutted.

ID: 920706 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 920740 - Posted: 23 Jul 2009, 18:25:18 UTC - in response to Message 920700.  

Well, a new internet link won't fix the tasks list.

They would need a new server for that, right?

They can use many things.

One issue "may" be the way the BOINC client handles uploads and downloads -- just being too aggressive when things are slow. That could be fixed in 6.6.38.

ID: 920740 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 920757 - Posted: 23 Jul 2009, 19:46:59 UTC

Vistro asked in the previous thread:
I don't know why this is really making my scratch my head..

Results ready to send: For each workunit, "empty" results are generated that are then sent out to individual users to be filled with data. This is the number of excess empty results ready to be sent out, i.e. a backlog in case demand exceeds the current rate of creation.


You are trying to sell to me that it stockpiles tens of thousands of identical files, sends one off, then deletes it? Why can't it keep ONE, and send it off, and keep it once it's done?

Is there a value that says: Here's how many have been split and are now ready to be fired!

What's the difference between a workunit and a result?


Here's an overall brief rundown of data handling.

Data which was recorded at Arecibo is received on 750 GB Hard Disks, then broken down into files of 50.20 GB or smaller for more convenient handling. Those files are sometimes called 'tapes' because the size resembles the amount of data on one tape of an older recorder system.

A splitter works with one of the 14 channels in one of those files. An ap_splitter divides the data in 13.42 second sequential chunks, each being a WU. An mb_splitter gets 107.37 seconds of data and breaks it down into 256 frequency subbands, each of which is a WU. Each WU is saved as a file, and a database entry is made in the WORKUNIT table identifying where the file is and much other information such as the basis for estimated crunch time, how far out the deadline should be, etc.

The Transitioner checks the database for new WUs, and creates 2 records in the RESULT table because the project setting is 2 for initial replication. Those 2 are then added to the "Results ready to send" queue shown on the server status page.

The Feeder notifies the Scheduler about a set of up to 100 of those Results. Those 100 slots are preassigned, some for MB work, some for AP_v5, and some for AP_v505, probably a 96:1:3 ratio now. The Feeder goes to sleep for a few seconds after it has updated those slots, either by filling them all or leaving some empty because there are none of the correct type in "Ready to send".

As a Scheduler process handles a request for work from a host, it checks those slots for suitable work. Preferences, host capabilities, or an app_info.xml can all rule out some types. If no work is found a "(Project has no work available)" message is sent, otherwise the name and URL of the workunit and the desired result are added to the reply message, and database fields are updated to show the work "In progress". Also, the executable and any other files needed to do the work are identified in the reply, including the URLs if the host isn't using an app_info.xml. If the estimated crunch time hasn't fulfilled the amount of work the host requested, the Scheduler looks for more suitable work, otherwise it's done and the reply is sent.

The host receives the reply containing one or more tasks. I view a task as being a set of directions:
1. Download xxxxx workunit and any other needed files you don't already have.
2. Start yyyyy application to crunch the workunit and produce an output file with the right name.
3. When the application finishes, upload the result file.
4. Sometime after the result has successfully uploaded, report the task complete.

When a host reports a result, a Scheduler process updates the RESULT and WORKUNIT tables as needed. Multiple reported results are done in a batch, less costly in terms of database operations.

When successful results from 2 hosts (project setting) have been reported the Validator loads the two result files and checks whether they match. If so, one is declared canonical and both are granted credit. Otherwise, the validation state causes the Transitioner to create another RESULT entry. etc.

A canonical result is entered into the master science database by the Assimilator, if there are no more "In progress" results for the WU, the workunit file and all associated result files can be deleted. The database workunit and result entries are kept for another day, usually visible to users as web pages.
---------------------------

There's a lot of detail I've left out, but that's the high points. I don't know all the details, for that matter. Hope it helps.
                                                             Joe
ID: 920757 · Report as offensive
Profile Vistro
Avatar

Send message
Joined: 6 Aug 08
Posts: 233
Credit: 316,549
RAC: 0
United States
Message 920759 - Posted: 23 Jul 2009, 19:51:34 UTC - in response to Message 920757.  

Oh.

That makes a lot of sense. You precreate the result records because it would be easier to do it now when you have the free time than later.
ID: 920759 · Report as offensive
__W__
Avatar

Send message
Joined: 28 Mar 09
Posts: 116
Credit: 5,943,642
RAC: 0
Germany
Message 920785 - Posted: 23 Jul 2009, 21:29:17 UTC
Last modified: 23 Jul 2009, 21:58:38 UTC

Congratulation to the hard working seti staff
since the last maintenance (except for the first 2-3 hours, when the wave runs through :-) ) ULs and DLs running smoothly - no errors or retrys for me.

They must have found some extra MHz/GBit somewhere, which are rare than truffle. Maybe we should donate them a truffle pig and when they have found enough bandwidth they could have a nice barbecue - with the pig of course ;-).

__W__
_______________________________________________________________________________
ID: 920785 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 920843 - Posted: 23 Jul 2009, 23:36:04 UTC

Yes, since the first time now for nearly 2 months things is settling.

My topcruncher is extremely sensitive for server errors and i was getting used to getting some work to crunch around saturday or sunday after a dry cache occured during the Tuesday backups.

I was up and running mearly the day after this time with work filling up so it wouldn't drain.

Many thanks and extremely nice done s@h staff. This was a really good move to increase the sensitivity of regular MB work paired with the tweaking of Apache on the upload server..

Hope the "Panic thread" doesn't need to be flooded that much in the near future.

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 920843 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6533
Credit: 196,805,888
RAC: 57
United States
Message 920890 - Posted: 24 Jul 2009, 1:30:46 UTC - in response to Message 920704.  

Well, a new internet link won't fix the tasks list.

They would need a new server for that, right?


They need a gigabit internet connection to ensure enough bandwidth for all the hungry crunchers. They need powerful enough servers to not drop all the connections asking for more work or returning results.


You know... I bet that once the gb line is up and running that usage wil lend up topping out at like 110mb lol.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 920890 · Report as offensive
1 · 2 · 3 · 4 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (22) Server problems


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.