Panic Mode On (50) Server problems?

Message boards : Number crunching : Panic Mode On (50) Server problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1126538 - Posted: 10 Jul 2011, 9:24:06 UTC - in response to Message 1126504.  
Last modified: 10 Jul 2011, 9:25:22 UTC

Yes...but why did you say 'Burma'?

Who did?

Well all I can say if anyone is still having problems ATM then its your bad lack. :D

My 3 have been going through the cycles without any problems the last few days and still maintaining their caches after they were all drained last week for updating.

Cheers.
ID: 1126538 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1126615 - Posted: 10 Jul 2011, 17:17:08 UTC

Doesn't look like the replica is gaining any ground. Still just over 200,000 seconds behind.

Matt is right.. there's something fishy going on there. There's no reason on a hardware level that the replica can't keep up.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1126615 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1126623 - Posted: 10 Jul 2011, 17:42:17 UTC - in response to Message 1126531.  
Last modified: 10 Jul 2011, 17:42:55 UTC

http://setiathome.berkeley.edu/workunit.php?wuid=753788639

The 1st not, so a 3rd WU was sent out.
But in the meantime the 1st result came (2 Jul 2011 - 6:11:09 UTC)

You are right. I did misinterprid.

But mine should not have started on July 8, it should have been deleted before my machine startedto crunch this one, for there were already two retults validated.

The second host already reported late.
Deadline was 27 june so it was sent to third host.
But the second returned before you did and got credits already.

It could srill be you get credits for it.

Look again. ;-)

The 2nd result came in time.

Name ap_02ap11ae_B4_P0_00026_20110601_16835.wu_0
Arbeitspaket 753788639
Erstellt 2 Jun 2011 | 0:29:01 UTC
Gesendet 2 Jun 2011 | 4:26:56 UTC
Empfangen 2 Jul 2011 | 6:11:09 UTC
Serverstatus Abgeschloßen
Resultat Erfolgreich
Clientstatus Fertig
Endstatus 0 (0x0)
Computer ID 5541162
Ablaufdatum 27 Jun 2011 | 4:26:56 UTC


The first both WUs had the deadline 27 Jun 2011 - 4:26:56 UTC.

If I follow the URL the upper (I said 1st) result came:
ap_02ap11ae_B4_P0_00026_20110601_16835.wu_0
2 Jul 2011 - 6:11:09 UTC
..the middle result (I said 2nd) came:
ap_02ap11ae_B4_P0_00026_20110601_16835.wu_1
13 Jun 2011 - 17:24:07 UTC

So the 2nd WU/result came in time.

I count the WU/results because of the end of the name, this is the same as from up to down in the overview.

You count in the order in which the results was sent back.

;-)


- Best regards! - Sutaru Tsureku, team seti.international founder. - Optimize your PC for higher RAC. - SETI@home needs your help. -
ID: 1126623 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1126695 - Posted: 10 Jul 2011, 22:39:08 UTC - in response to Message 1126615.  

Doesn't look like the replica is gaining any ground. Still just over 200,000 seconds behind.

Matt is right.. there's something fishy going on there. There's no reason on a hardware level that the replica can't keep up.

Not really fishy... the MB assimilartors are catching up pretty good, that's additional load on the database. We have now reached the point, when assimilated results are purged from database at the speed they were assimilated, and that was faster than usual, i.e. without backlog of over 1000000. And than for the last few hours we were returning around 80000 results per hour. So already the fact, that it's not falling more behind like it did sometimes in the past under even less heavy load is IMO a good thing.
ID: 1126695 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1126711 - Posted: 11 Jul 2011, 17:14:29 UTC - in response to Message 1126708.  

Welcome back!

Meowza!
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1126711 · Report as offensive
ToxicTBag

Send message
Joined: 5 Feb 10
Posts: 101
Credit: 57,197,902
RAC: 0
United Kingdom
Message 1126713 - Posted: 11 Jul 2011, 17:16:35 UTC

Phew! Was really getting worried this time wb all :-)
ID: 1126713 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1126715 - Posted: 11 Jul 2011, 17:19:02 UTC

Ok, where did everybody go? And why wasn't I invited? :-)


PROUD MEMBER OF Team Starfire World BOINC
ID: 1126715 · Report as offensive
Profile Lint trap

Send message
Joined: 30 May 03
Posts: 871
Credit: 28,092,319
RAC: 0
United States
Message 1126717 - Posted: 11 Jul 2011, 17:21:31 UTC



Checking my bp...should be getting back to normal now....Lt
ID: 1126717 · Report as offensive
Profile S@NL - XP_Freak

Send message
Joined: 10 Jul 99
Posts: 99
Credit: 6,248,265
RAC: 0
Netherlands
Message 1126722 - Posted: 11 Jul 2011, 17:32:54 UTC

I almost panicked.

Goodbye Seti Classic
ID: 1126722 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1126727 - Posted: 11 Jul 2011, 17:37:54 UTC

I didn't panic---quite.
I was a bit worried though. It's not often the server crashes take the forums down with them.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1126727 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1126733 - Posted: 11 Jul 2011, 17:50:27 UTC

And, Hmmmmmmmm.....
They're splitting AP work again.
Hopefully that's a sign some progress has been made on the AP issues, and not just an oversight when bringing the servers back online.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1126733 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1126735 - Posted: 11 Jul 2011, 17:54:56 UTC - in response to Message 1126733.  

And, Hmmmmmmmm.....
They're splitting AP work again.
Hopefully that's a sign some progress has been made on the AP issues, and not just an oversight when bringing the servers back online.



Yeah, now if they could just do something with those 69k+ workunits awaiting validation we might get somewhere.



PROUD MEMBER OF Team Starfire World BOINC
ID: 1126735 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1126740 - Posted: 11 Jul 2011, 18:10:18 UTC - in response to Message 1126735.  


One of the download servers appears to still be down.
Most of my downloads time out as soon as they start.
Grant
Darwin NT
ID: 1126740 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 1126743 - Posted: 11 Jul 2011, 18:17:57 UTC - in response to Message 1126504.  

Yes...but why did you say 'Burma'?


I panicked.

ID: 1126743 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 1126767 - Posted: 11 Jul 2011, 19:13:42 UTC - in response to Message 1126747.  

Yes...but why did you say 'Burma'?

I panicked.


[BSM Williams voice on]
I will not have panicking in my jungle!
[BSM Williams voice off]

"Shoulders back, lovely boy".
ID: 1126767 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1126858 - Posted: 11 Jul 2011, 21:49:33 UTC

That was a bit of a panicky moment there lads. But a stiff upper lip, and a cup of tea, helped in the meantime :).
ID: 1126858 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1126865 - Posted: 11 Jul 2011, 22:04:20 UTC - in response to Message 1126619.  

Well, it is gaining ground, but slowly. On 8 Jul 2011, 22:21:59 UTC it was 71 hours behind, now it is something like 59 hours behind.

It's going painfully slow though, so something fishy is going on. Maybe all the slowness and problems we have, not only with the replica, but with many other services, is because we have so many results in the different databases now, that we are reaching the limits of what the current database software, and hardware, can handle in real time.


I know it has gained some ground, but now since I've been paying attention to it, over the past ~36 hours, it is still sitting at 211,000 seconds behind. Hasn't gained or lost any ground.

Not really fishy... the MB assimilartors are catching up pretty good, that's additional load on the database. We have now reached the point, when assimilated results are purged from database at the speed they were assimilated, and that was faster than usual, i.e. without backlog of over 1000000. And than for the last few hours we were returning around 80000 results per hour. So already the fact, that it's not falling more behind like it did sometimes in the past under even less heavy load is IMO a good thing.


I do understand there's a lot of stress on the database(s), but the replica should be able to keep up with the master. The master does most of the work, and the replica just has to copy the changes. Aside from certain things like the tasks pages usually coming from the replica rather than the master, the apparent lack of performance doesn't add up when you look at the hardware being nearly identical between the two. This implies some kind of software limitation, assuming there are no issues with the hardware.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1126865 · Report as offensive
B-Man
Volunteer tester

Send message
Joined: 11 Feb 01
Posts: 253
Credit: 147,366
RAC: 0
United States
Message 1126886 - Posted: 11 Jul 2011, 23:09:30 UTC - in response to Message 1126865.  
Last modified: 11 Jul 2011, 23:19:26 UTC

Well, it is gaining ground, but slowly. On 8 Jul 2011, 22:21:59 UTC it was 71 hours behind, now it is something like 59 hours behind.

It's going painfully slow though, so something fishy is going on. Maybe all the slowness and problems we have, not only with the replica, but with many other services, is because we have so many results in the different databases now, that we are reaching the limits of what the current database software, and hardware, can handle in real time.


I know it has gained some ground, but now since I've been paying attention to it, over the past ~36 hours, it is still sitting at 211,000 seconds behind. Hasn't gained or lost any ground.

Not really fishy... the MB assimilartors are catching up pretty good, that's additional load on the database. We have now reached the point, when assimilated results are purged from database at the speed they were assimilated, and that was faster than usual, i.e. without backlog of over 1000000. And than for the last few hours we were returning around 80000 results per hour. So already the fact, that it's not falling more behind like it did sometimes in the past under even less heavy load is IMO a good thing.


I do understand there's a lot of stress on the database(s), but the replica should be able to keep up with the master. The master does most of the work, and the replica just has to copy the changes. Aside from certain things like the tasks pages usually coming from the replica rather than the master, the apparent lack of performance doesn't add up when you look at the hardware being nearly identical between the two. This implies some kind of software limitation, assuming there are no issues with the hardware.

Jocelyn is not almost the same spec as Carolyn. Jocelyn is a 2004-2005 model sun server Carolyn is from 2010 so 5-6 years younger with over 3X the memory of Jocelyn. I say be glad it handes even 1/4 th the amount Carolyn does. The project only has 3 modern servers Carolyn, Oscar and Synergy all the others run from3-10 years old. That is Ancient in terms of server hardware and for the most part functionally obsolescent but we keep using them in many case because we can't afford to replace them.

Edit to add Carolyn has 8 CPU cores and Jocelyn has 4 CPUs ( 4 cores one per chip). So looking at 1/2 the # cores on older slower chip architecture so don't tell me they are almost the same.
ID: 1126886 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1126915 - Posted: 12 Jul 2011, 2:19:07 UTC - in response to Message 1126886.  

I guess when Jeff upgraded MySQL on Jocelyn to the latest stable version Friday he was hoping that would make a significant improvement. If the lag had kept dropping as it did for the first day it would be below 150000 seconds by now. At least there haven't been any big backsteps, and Jeff probably has additional plans if the other servers will give him time to try them.
                                                                  Joe
ID: 1126915 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1126955 - Posted: 12 Jul 2011, 7:26:25 UTC
Last modified: 12 Jul 2011, 7:27:53 UTC

Jocelyn is not almost the same spec as Carolyn. Jocelyn is a 2004-2005 model sun server Carolyn is from 2010 so 5-6 years younger with over 3X the memory of Jocelyn. I say be glad it handes even 1/4 th the amount Carolyn does. The project only has 3 modern servers Carolyn, Oscar and Synergy all the others run from3-10 years old. That is Ancient in terms of server hardware and for the most part functionally obsolescent but we keep using them in many case because we can't afford to replace them.

Edit to add Carolyn has 8 CPU cores and Jocelyn has 4 CPUs ( 4 cores one per chip). So looking at 1/2 the # cores on older slower chip architecture so don't tell me they are almost the same.


My bad.. I was thinking carolyn and oscar were still operating as master and replica. I guess I didn't pay any attention to recent configurations.

In that case.. I can easily see why jocelyn can't keep up. She was the best server we had before the new ones arrived, and is why new servers were acquired.

Unrelated: [/panic] Almost ran out of "ready to start" tasks in my cache. One request granting 14 APs averted that crisis.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1126955 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

Message boards : Number crunching : Panic Mode On (50) Server problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.