Panic Mode On (21) Server problems

Message boards : Number crunching : Panic Mode On (21) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12

AuthorMessage
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 920440 - Posted: 22 Jul 2009, 21:13:24 UTC - in response to Message 920437.  
Last modified: 22 Jul 2009, 21:14:12 UTC

Probably not before Monday when Matt and Jeff get back.

P.S. I should just stop saying we're working. We're maxed out again.
@SETIEric@qoto.org (Mastodon)

ID: 920440 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 920448 - Posted: 22 Jul 2009, 21:28:00 UTC - in response to Message 920440.  

Probably not before Monday when Matt and Jeff get back.

P.S. I should just stop saying we're working. We're maxed out again.

Every time you say that, another ten thousand fingers hit the retry button. You should know that by now.
ID: 920448 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 920449 - Posted: 22 Jul 2009, 21:31:41 UTC

Well hey, after the fcgi configuration in the <virtualhost> stuff got sorted out earlier, I was able to push about 30 pending uploads through, and then they just stopped going through. Oh well. Not a big deal. I'm sure most of the result files for my shorties will manage to get through before someone else downloads, crunches, and tries to upload.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 920449 · Report as offensive
Miklos M.

Send message
Joined: 5 May 99
Posts: 955
Credit: 136,115,648
RAC: 73
Hungary
Message 920452 - Posted: 22 Jul 2009, 21:41:49 UTC

Thank you Pappa and everyone else. Looks like I have happy campers, ooops, computers here with their stomachs full of wu's.
ID: 920452 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 920460 - Posted: 22 Jul 2009, 21:53:48 UTC
Last modified: 22 Jul 2009, 21:54:20 UTC


Over the day I could UL all results.
Fast and/or slowly.

But now again:
Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: HTTP error


Also the DL speed to my client is ~ 50 % or more slower.
After looking to the 'transmission overview'.

Hmm.. the old BOINC Manager showed after every transfer the KB/s in the messages.
The new not.
It's possible to enable this with cc_config.xml ?

ID: 920460 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 920471 - Posted: 22 Jul 2009, 22:21:39 UTC - in response to Message 920460.  

Hmm.. the old BOINC Manager showed after every transfer the KB/s in the messages.
The new not.
It's possible to enable this with cc_config.xml ?

Yes, with <file_xfer_debug>, but that also gives other output.
22/07/2009 22:12:08|SETI@home|Started upload of 17oc08aa.21970.2526.6.8.124_0_0
22/07/2009 22:12:08||[file_xfer_debug] URL: http://setiboincdata.ssl.berkeley.edu/sah_cgi/file_upload_handler
22/07/2009 22:12:10||[file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
22/07/2009 22:12:13||[file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
22/07/2009 22:12:13||[file_xfer_debug] file transfer status 0
22/07/2009 22:12:13|SETI@home|Finished upload of 17oc08aa.21970.2526.6.8.124_0_0
22/07/2009 22:12:13|SETI@home|[file_xfer_debug] Throughput 14119 bytes/sec

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 920471 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 920479 - Posted: 22 Jul 2009, 22:52:46 UTC - in response to Message 920471.  

Hmm.. the old BOINC Manager showed after every transfer the KB/s in the messages.
The new not.
It's possible to enable this with cc_config.xml ?

Yes, with <file_xfer_debug>, but that also gives other output.
22/07/2009 22:12:08|SETI@home|Started upload of 17oc08aa.21970.2526.6.8.124_0_0
22/07/2009 22:12:08||[file_xfer_debug] URL: http://setiboincdata.ssl.berkeley.edu/sah_cgi/file_upload_handler
22/07/2009 22:12:10||[file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
22/07/2009 22:12:13||[file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0
22/07/2009 22:12:13||[file_xfer_debug] file transfer status 0
22/07/2009 22:12:13|SETI@home|Finished upload of 17oc08aa.21970.2526.6.8.124_0_0
22/07/2009 22:12:13|SETI@home|[file_xfer_debug] Throughput 14119 bytes/sec

Gruß,
Gundolf

That is true, the xfer_debug will in fact tell you the throughput speed, however, that one specific line used to be standard output without any special debugging flags in the old 5.x clients. When 6.x came along, they removed that since it was informational rather than important.

I do also miss having the throughput message in the messages tab, but oh well. 6.2.19 works great for me, and I don't have CUDA, so I'll stick with it.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 920479 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 920482 - Posted: 22 Jul 2009, 23:01:06 UTC - in response to Message 920479.  

...however, that one specific line used to be standard output without any special debugging flags in the old 5.x clients. When 6.x came along, they removed that since it was informational rather than important...

Not quite :-)

It should be: When 5.10.x came along... 5.8.16 shows the line without debug flag. Somewhere between 5.8.16 and 5.10.45 it disappeared.
ID: 920482 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 920500 - Posted: 23 Jul 2009, 0:16:03 UTC - in response to Message 920482.  

As far as I can find, it happened somewhere after revision 13804, so that means somewhere around 5.10.23/.24
ID: 920500 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 920504 - Posted: 23 Jul 2009, 0:23:15 UTC

Alright then, I stand mostly corrected. I was at least right that it was in the 5 series and not the 6 series.. :p
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 920504 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 920518 - Posted: 23 Jul 2009, 1:02:10 UTC
Last modified: 23 Jul 2009, 1:03:07 UTC


Thanks! :-)

It would be well/better if the debug message output would be little bit smaller.

Or maybe make a new debug log_flag only for KB/s?

My GPU cruncher make ~ 860 .. ops.. ;-) ..now the half.. ~ 430 MB AR=0.44x WU/result ULs/day.
After updating of the new nVIDIA_driver and CUDA_V2.3 (~ 30 % faster).. ~ 560 result ULs/day.

If not a new debug log_flag the message overview in the BOINC Manager would be overfilled.. ;-)


EDIT: Ops.. I forgot the DLs.. ;-)

ID: 920518 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 920521 - Posted: 23 Jul 2009, 1:32:58 UTC

Well, I think it might be safe to say that the upload backlogs have finally pushed through? All of my machines are uploading finished results on the first try, and my backlogs are gone.

One thing I noticed earlier when getting rid of the ~10 pending uploads I had.. one at a time worked great about 80% of the time. Selecting multiples and hitting retry would result in one going through and the rest getting HTTP error.

I know that 'retry now' button is a hot topic around here, but I don't have hundreds of tasks waiting to go through..just a small handful that are insignificant in the grand scheme of things.

At any rate, like I said..I have no more backlogged uploads, and the tasks that do finish are going right through. My cache is slowly filling up with the new longer WUs, so having ~250 MBs in the list for a 4-day cache on a 4-core machine should technically drop to ~125 WUs. Though I am still anxiously waiting for a new AP to be given to me.

I know they're out there..the splitters keep burning through the tapes and the RTS queue is always near zero, while the Results in the Field count continues to climb.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 920521 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 920530 - Posted: 23 Jul 2009, 2:11:30 UTC - in response to Message 920521.  

If I look at the Server Status Page I see there is a large backlog of things to get purged that would have disappeared during the weekly maintenance. They were left so that users could get credit without other hassles. This places a great deal of stress on the Master Database (we have read about those crashes before) and the indications are that the replica is quite a bit behind.

So as Eric stated "the Staff" having worked for a solid 3 days through the TCP settings on the Upload server the Log Jam should be broken. The next 24 hours should tell. Uploads are now higher then they have been in weeks.
In a phone conversation with Eric, at this point those machines that have been patiently waiting an retrying should be sending uploads. That is Good. The more impatient probably have their finger on the Retry Now button. That is okay. If it is clearing work so that you can get more that is fine. So as fast as the splitters can split and the donwnload servers can deliver it. It continues to add to the Database Volume! Or why things are turned off...

That requires another payback.

If someone inadvertantly increased the size of their cache (due to bad advice) during this period. Please turn it back down, it is not helping. Sanity would say never more than 4 days. Or advanced settings that would allow quicker reporting and still be able to maintain a Cache.

Thank You All

Regards

Well, I think it might be safe to say that the upload backlogs have finally pushed through? All of my machines are uploading finished results on the first try, and my backlogs are gone.

One thing I noticed earlier when getting rid of the ~10 pending uploads I had.. one at a time worked great about 80% of the time. Selecting multiples and hitting retry would result in one going through and the rest getting HTTP error.

I know that 'retry now' button is a hot topic around here, but I don't have hundreds of tasks waiting to go through..just a small handful that are insignificant in the grand scheme of things.

At any rate, like I said..I have no more backlogged uploads, and the tasks that do finish are going right through. My cache is slowly filling up with the new longer WUs, so having ~250 MBs in the list for a 4-day cache on a 4-core machine should technically drop to ~125 WUs. Though I am still anxiously waiting for a new AP to be given to me.

I know they're out there..the splitters keep burning through the tapes and the RTS queue is always near zero, while the Results in the Field count continues to climb.


Please consider a Donation to the Seti Project.

ID: 920530 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 920543 - Posted: 23 Jul 2009, 2:46:17 UTC

I'm sensing something amiss in the background. Two hours ago, the replica was ~9,000 seconds behind. Now it is closing in on ~10,500 seconds behind.

I know there is a lot of things going on with the database, but the task pages are turned off and the replica still can't keep up.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 920543 · Report as offensive
Nicolas
Avatar

Send message
Joined: 30 Mar 05
Posts: 161
Credit: 12,985
RAC: 0
Argentina
Message 920549 - Posted: 23 Jul 2009, 3:20:35 UTC - in response to Message 919831.  

If the the project, being on campus, can't even attract STUDENT VOLUNTEERS to help 'man' the jobs, what does that say about the project ? and how OUTSIDE people look at it ? ....not to mention prospective Sponsors and funding sources.

Student volunteers wrote the initial BOINC project website code. Which explains why the code still sucks very badly.

Let's not stoop to insulting people please-it is not at all productive.

Quote from Rytis Slatkevicius, admin of Primegrid and former maintainer of BOINC web code:

Don't credit broken code to me, I did not write it ;) I believe the parts of the code that you're talking about were written by undergrad students in Berkeley, at the time BOINC was just starting.


Contribute to the Wiki!
ID: 920549 · Report as offensive
Profile Vistro
Avatar

Send message
Joined: 6 Aug 08
Posts: 233
Credit: 316,549
RAC: 0
United States
Message 920552 - Posted: 23 Jul 2009, 3:35:18 UTC - in response to Message 920044.  

hey guys...

This page flat out won't load on the DSi....


Are there any forum settings that control how many posts are shown to me at a given time? I have already disabled images in the browser.
ID: 920552 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 920554 - Posted: 23 Jul 2009, 3:38:54 UTC - in response to Message 920552.  

Yes, your Community Preferences, under the section "Message Display", "How to sort", there are two boxes that you can enter numbers into for how many posts you want to display after X number of posts have been created in a thread.
ID: 920554 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 920565 - Posted: 23 Jul 2009, 4:02:55 UTC

Now the replica is nearly only ~8,000 seconds behind, so it seems to be catching up now.

On a side-note, I changed venues and managed to get one ap_v505 after 23 work requests (I was not hitting the 'update' button, either..).
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 920565 · Report as offensive
Profile Vistro
Avatar

Send message
Joined: 6 Aug 08
Posts: 233
Credit: 316,549
RAC: 0
United States
Message 920570 - Posted: 23 Jul 2009, 4:29:08 UTC - in response to Message 920530.  
Last modified: 23 Jul 2009, 4:39:53 UTC

If I look at the Server Status Page I see there is a large backlog of things to get purged that would have disappeared during the weekly maintenance. They were left so that users could get credit without other hassles. This places a great deal of stress on the Master Database (we have read about those crashes before) and the indications are that the replica is quite a bit behind.

So as Eric stated "the Staff" having worked for a solid 3 days through the TCP settings on the Upload server the Log Jam should be broken. The next 24 hours should tell. Uploads are now higher then they have been in weeks.
In a phone conversation with Eric, at this point those machines that have been patiently waiting an retrying should be sending uploads. That is Good. The more impatient probably have their finger on the Retry Now button. That is okay. If it is clearing work so that you can get more that is fine. So as fast as the splitters can split and the donwnload servers can deliver it. It continues to add to the Database Volume! Or why things are turned off...

That requires another payback.

If someone inadvertantly increased the size of their cache (due to bad advice) during this period. Please turn it back down, it is not helping. Sanity would say never more than 4 days. Or advanced settings that would allow quicker reporting and still be able to maintain a Cache.

Thank You All

Regards

Well, I think it might be safe to say that the upload backlogs have finally pushed through? All of my machines are uploading finished results on the first try, and my backlogs are gone.

One thing I noticed earlier when getting rid of the ~10 pending uploads I had.. one at a time worked great about 80% of the time. Selecting multiples and hitting retry would result in one going through and the rest getting HTTP error.

I know that 'retry now' button is a hot topic around here, but I don't have hundreds of tasks waiting to go through..just a small handful that are insignificant in the grand scheme of things.

At any rate, like I said..I have no more backlogged uploads, and the tasks that do finish are going right through. My cache is slowly filling up with the new longer WUs, so having ~250 MBs in the list for a 4-day cache on a 4-core machine should technically drop to ~125 WUs. Though I am still anxiously waiting for a new AP to be given to me.

I know they're out there..the splitters keep burning through the tapes and the RTS queue is always near zero, while the Results in the Field count continues to climb.



So happy to know that this is all behind us now. Maybe now the owner of the Dual Core will stop calling me freaking out about the failed uploads.

I have turned down the cache of the 8 core from 10 days to 1. Now it just has to burn through it's current stockpile. I won't, therefore, be bothering the download server for, oh, 10 days. 5 days if these optimized apps are working like they should.


EDIT: I don't know why this is really making my scratch my head..

Results ready to send: For each workunit, "empty" results are generated that are then sent out to individual users to be filled with data. This is the number of excess empty results ready to be sent out, i.e. a backlog in case demand exceeds the current rate of creation.


You are trying to sell to me that it stockpiles tens of thousands of identical files, sends one off, then deletes it? Why can't it keep ONE, and send it off, and keep it once it's done?

Is there a value that says: Here's how many have been split and are now ready to be fired!

What's the difference between a workunit and a result?
ID: 920570 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 920586 - Posted: 23 Jul 2009, 5:18:19 UTC - in response to Message 920570.  


Don't know what Eric did, but it worked.
When i got back from work there were no uploads waiting to get through, and looking at my log they've been uploading as they've completed on the first attempt.
And this is with network traffic that would previously result in next to nothing uploading, until it had tried anywhere from 8 to 15 times.
Grant
Darwin NT
ID: 920586 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12

Message boards : Number crunching : Panic Mode On (21) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.