Panic Mode On (21) Server problems

Author	Message
Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 920440 - Posted: 22 Jul 2009, 21:13:24 UTC - in response to Message 920437. Last modified: 22 Jul 2009, 21:14:12 UTC Probably not before Monday when Matt and Jeff get back. P.S. I should just stop saying we're working. We're maxed out again. @SETIEric@qoto.org (Mastodon) ID: 920440 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 920448 - Posted: 22 Jul 2009, 21:28:00 UTC - in response to Message 920440. Probably not before Monday when Matt and Jeff get back. P.S. I should just stop saying we're working. We're maxed out again. Every time you say that, another ten thousand fingers hit the retry button. You should know that by now. ID: 920448 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 920449 - Posted: 22 Jul 2009, 21:31:41 UTC Well hey, after the fcgi configuration in the <virtualhost> stuff got sorted out earlier, I was able to push about 30 pending uploads through, and then they just stopped going through. Oh well. Not a big deal. I'm sure most of the result files for my shorties will manage to get through before someone else downloads, crunches, and tries to upload. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 920449 ·

Miklos M. Send message Joined: 5 May 99 Posts: 955 Credit: 136,115,648 RAC: 73	Message 920452 - Posted: 22 Jul 2009, 21:41:49 UTC Thank you Pappa and everyone else. Looks like I have happy campers, ooops, computers here with their stomachs full of wu's. ID: 920452 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 920460 - Posted: 22 Jul 2009, 21:53:48 UTC Last modified: 22 Jul 2009, 21:54:20 UTC Over the day I could UL all results. Fast and/or slowly. But now again: Temporarily failed upload of xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: HTTP error Also the DL speed to my client is ~ 50 % or more slower. After looking to the 'transmission overview'. Hmm.. the old BOINC Manager showed after every transfer the KB/s in the messages. The new not. It's possible to enable this with cc_config.xml ? ID: 920460 ·

Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0	Message 920471 - Posted: 22 Jul 2009, 22:21:39 UTC - in response to Message 920460. Hmm.. the old BOINC Manager showed after every transfer the KB/s in the messages. The new not. It's possible to enable this with cc_config.xml ? Yes, with <file_xfer_debug>, but that also gives other output. 22/07/2009 22:12:08\|SETI@home\|Started upload of 17oc08aa.21970.2526.6.8.124_0_0 22/07/2009 22:12:08\|\|[file_xfer_debug] URL: http://setiboincdata.ssl.berkeley.edu/sah_cgi/file_upload_handler 22/07/2009 22:12:10\|\|[file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 22/07/2009 22:12:13\|\|[file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 22/07/2009 22:12:13\|\|[file_xfer_debug] file transfer status 0 22/07/2009 22:12:13\|SETI@home\|Finished upload of 17oc08aa.21970.2526.6.8.124_0_0 22/07/2009 22:12:13\|SETI@home\|[file_xfer_debug] Throughput 14119 bytes/sec GruÃƒÅ¸, Gundolf Computer sind nicht alles im Leben. (Kleiner Scherz) SETI@home classic workunits 3,758 SETI@home classic CPU time 66,520 hours ID: 920471 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 920479 - Posted: 22 Jul 2009, 22:52:46 UTC - in response to Message 920471. Hmm.. the old BOINC Manager showed after every transfer the KB/s in the messages. The new not. It's possible to enable this with cc_config.xml ? Yes, with <file_xfer_debug>, but that also gives other output. 22/07/2009 22:12:08\|SETI@home\|Started upload of 17oc08aa.21970.2526.6.8.124_0_0 22/07/2009 22:12:08\|\|[file_xfer_debug] URL: http://setiboincdata.ssl.berkeley.edu/sah_cgi/file_upload_handler 22/07/2009 22:12:10\|\|[file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 22/07/2009 22:12:13\|\|[file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 22/07/2009 22:12:13\|\|[file_xfer_debug] file transfer status 0 22/07/2009 22:12:13\|SETI@home\|Finished upload of 17oc08aa.21970.2526.6.8.124_0_0 22/07/2009 22:12:13\|SETI@home\|[file_xfer_debug] Throughput 14119 bytes/sec GruÃƒÅ¸, Gundolf That is true, the xfer_debug will in fact tell you the throughput speed, however, that one specific line used to be standard output without any special debugging flags in the old 5.x clients. When 6.x came along, they removed that since it was informational rather than important. I do also miss having the throughput message in the messages tab, but oh well. 6.2.19 works great for me, and I don't have CUDA, so I'll stick with it. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 920479 ·

Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0	Message 920482 - Posted: 22 Jul 2009, 23:01:06 UTC - in response to Message 920479. ...however, that one specific line used to be standard output without any special debugging flags in the old 5.x clients. When 6.x came along, they removed that since it was informational rather than important... Not quite :-) It should be: When 5.10.x came along... 5.8.16 shows the line without debug flag. Somewhere between 5.8.16 and 5.10.45 it disappeared. ID: 920482 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 920500 - Posted: 23 Jul 2009, 0:16:03 UTC - in response to Message 920482. As far as I can find, it happened somewhere after revision 13804, so that means somewhere around 5.10.23/.24 ID: 920500 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 920504 - Posted: 23 Jul 2009, 0:23:15 UTC Alright then, I stand mostly corrected. I was at least right that it was in the 5 series and not the 6 series.. :p Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 920504 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 920518 - Posted: 23 Jul 2009, 1:02:10 UTC Last modified: 23 Jul 2009, 1:03:07 UTC Thanks! :-) It would be well/better if the debug message output would be little bit smaller. Or maybe make a new debug log_flag only for KB/s? My GPU cruncher make ~ 860 .. ops.. ;-) ..now the half.. ~ 430 MB AR=0.44x WU/result ULs/day. After updating of the new nVIDIA_driver and CUDA_V2.3 (~ 30 % faster).. ~ 560 result ULs/day. If not a new debug log_flag the message overview in the BOINC Manager would be overfilled.. ;-) EDIT: Ops.. I forgot the DLs.. ;-) ID: 920518 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 920521 - Posted: 23 Jul 2009, 1:32:58 UTC Well, I think it might be safe to say that the upload backlogs have finally pushed through? All of my machines are uploading finished results on the first try, and my backlogs are gone. One thing I noticed earlier when getting rid of the ~10 pending uploads I had.. one at a time worked great about 80% of the time. Selecting multiples and hitting retry would result in one going through and the rest getting HTTP error. I know that 'retry now' button is a hot topic around here, but I don't have hundreds of tasks waiting to go through..just a small handful that are insignificant in the grand scheme of things. At any rate, like I said..I have no more backlogged uploads, and the tasks that do finish are going right through. My cache is slowly filling up with the new longer WUs, so having ~250 MBs in the list for a 4-day cache on a 4-core machine should technically drop to ~125 WUs. Though I am still anxiously waiting for a new AP to be given to me. I know they're out there..the splitters keep burning through the tapes and the RTS queue is always near zero, while the Results in the Field count continues to climb. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 920521 ·

Pappa Volunteer tester Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0	Message 920530 - Posted: 23 Jul 2009, 2:11:30 UTC - in response to Message 920521. If I look at the Server Status Page I see there is a large backlog of things to get purged that would have disappeared during the weekly maintenance. They were left so that users could get credit without other hassles. This places a great deal of stress on the Master Database (we have read about those crashes before) and the indications are that the replica is quite a bit behind. So as Eric stated "the Staff" having worked for a solid 3 days through the TCP settings on the Upload server the Log Jam should be broken. The next 24 hours should tell. Uploads are now higher then they have been in weeks. In a phone conversation with Eric, at this point those machines that have been patiently waiting an retrying should be sending uploads. That is Good. The more impatient probably have their finger on the Retry Now button. That is okay. If it is clearing work so that you can get more that is fine. So as fast as the splitters can split and the donwnload servers can deliver it. It continues to add to the Database Volume! Or why things are turned off... That requires another payback. If someone inadvertantly increased the size of their cache (due to bad advice) during this period. Please turn it back down, it is not helping. Sanity would say never more than 4 days. Or advanced settings that would allow quicker reporting and still be able to maintain a Cache. Thank You All Regards Well, I think it might be safe to say that the upload backlogs have finally pushed through? All of my machines are uploading finished results on the first try, and my backlogs are gone. One thing I noticed earlier when getting rid of the ~10 pending uploads I had.. one at a time worked great about 80% of the time. Selecting multiples and hitting retry would result in one going through and the rest getting HTTP error. I know that 'retry now' button is a hot topic around here, but I don't have hundreds of tasks waiting to go through..just a small handful that are insignificant in the grand scheme of things. At any rate, like I said..I have no more backlogged uploads, and the tasks that do finish are going right through. My cache is slowly filling up with the new longer WUs, so having ~250 MBs in the list for a 4-day cache on a 4-core machine should technically drop to ~125 WUs. Though I am still anxiously waiting for a new AP to be given to me. I know they're out there..the splitters keep burning through the tapes and the RTS queue is always near zero, while the Results in the Field count continues to climb. Please consider a Donation to the Seti Project. ID: 920530 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 920543 - Posted: 23 Jul 2009, 2:46:17 UTC I'm sensing something amiss in the background. Two hours ago, the replica was ~9,000 seconds behind. Now it is closing in on ~10,500 seconds behind. I know there is a lot of things going on with the database, but the task pages are turned off and the replica still can't keep up. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 920543 ·

Nicolas Send message Joined: 30 Mar 05 Posts: 161 Credit: 12,985 RAC: 0	Message 920549 - Posted: 23 Jul 2009, 3:20:35 UTC - in response to Message 919831. If the the project, being on campus, can't even attract STUDENT VOLUNTEERS to help 'man' the jobs, what does that say about the project ? and how OUTSIDE people look at it ? ....not to mention prospective Sponsors and funding sources. Student volunteers wrote the initial BOINC project website code. Which explains why the code still sucks very badly. Let's not stoop to insulting people please-it is not at all productive. Quote from Rytis Slatkevicius, admin of Primegrid and former maintainer of BOINC web code: Don't credit broken code to me, I did not write it ;) I believe the parts of the code that you're talking about were written by undergrad students in Berkeley, at the time BOINC was just starting. Contribute to the Wiki! ID: 920549 ·

Vistro Send message Joined: 6 Aug 08 Posts: 233 Credit: 316,549 RAC: 0	Message 920552 - Posted: 23 Jul 2009, 3:35:18 UTC - in response to Message 920044. hey guys... This page flat out won't load on the DSi.... Are there any forum settings that control how many posts are shown to me at a given time? I have already disabled images in the browser. ID: 920552 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 920554 - Posted: 23 Jul 2009, 3:38:54 UTC - in response to Message 920552. Yes, your Community Preferences, under the section "Message Display", "How to sort", there are two boxes that you can enter numbers into for how many posts you want to display after X number of posts have been created in a thread. ID: 920554 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 920565 - Posted: 23 Jul 2009, 4:02:55 UTC Now the replica is nearly only ~8,000 seconds behind, so it seems to be catching up now. On a side-note, I changed venues and managed to get one ap_v505 after 23 work requests (I was not hitting the 'update' button, either..). Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 920565 ·

Vistro Send message Joined: 6 Aug 08 Posts: 233 Credit: 316,549 RAC: 0	Message 920570 - Posted: 23 Jul 2009, 4:29:08 UTC - in response to Message 920530. Last modified: 23 Jul 2009, 4:39:53 UTC If I look at the Server Status Page I see there is a large backlog of things to get purged that would have disappeared during the weekly maintenance. They were left so that users could get credit without other hassles. This places a great deal of stress on the Master Database (we have read about those crashes before) and the indications are that the replica is quite a bit behind. So as Eric stated "the Staff" having worked for a solid 3 days through the TCP settings on the Upload server the Log Jam should be broken. The next 24 hours should tell. Uploads are now higher then they have been in weeks. In a phone conversation with Eric, at this point those machines that have been patiently waiting an retrying should be sending uploads. That is Good. The more impatient probably have their finger on the Retry Now button. That is okay. If it is clearing work so that you can get more that is fine. So as fast as the splitters can split and the donwnload servers can deliver it. It continues to add to the Database Volume! Or why things are turned off... That requires another payback. If someone inadvertantly increased the size of their cache (due to bad advice) during this period. Please turn it back down, it is not helping. Sanity would say never more than 4 days. Or advanced settings that would allow quicker reporting and still be able to maintain a Cache. Thank You All Regards Well, I think it might be safe to say that the upload backlogs have finally pushed through? All of my machines are uploading finished results on the first try, and my backlogs are gone. One thing I noticed earlier when getting rid of the ~10 pending uploads I had.. one at a time worked great about 80% of the time. Selecting multiples and hitting retry would result in one going through and the rest getting HTTP error. I know that 'retry now' button is a hot topic around here, but I don't have hundreds of tasks waiting to go through..just a small handful that are insignificant in the grand scheme of things. At any rate, like I said..I have no more backlogged uploads, and the tasks that do finish are going right through. My cache is slowly filling up with the new longer WUs, so having ~250 MBs in the list for a 4-day cache on a 4-core machine should technically drop to ~125 WUs. Though I am still anxiously waiting for a new AP to be given to me. I know they're out there..the splitters keep burning through the tapes and the RTS queue is always near zero, while the Results in the Field count continues to climb. So happy to know that this is all behind us now. Maybe now the owner of the Dual Core will stop calling me freaking out about the failed uploads. I have turned down the cache of the 8 core from 10 days to 1. Now it just has to burn through it's current stockpile. I won't, therefore, be bothering the download server for, oh, 10 days. 5 days if these optimized apps are working like they should. EDIT: I don't know why this is really making my scratch my head.. Results ready to send: For each workunit, "empty" results are generated that are then sent out to individual users to be filled with data. This is the number of excess empty results ready to be sent out, i.e. a backlog in case demand exceeds the current rate of creation. You are trying to sell to me that it stockpiles tens of thousands of identical files, sends one off, then deletes it? Why can't it keep ONE, and send it off, and keep it once it's done? Is there a value that says: Here's how many have been split and are now ready to be fired! What's the difference between a workunit and a result? ID: 920570 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 920586 - Posted: 23 Jul 2009, 5:18:19 UTC - in response to Message 920570. Don't know what Eric did, but it worked. When i got back from work there were no uploads waiting to get through, and looking at my log they've been uploading as they've completed on the first attempt. And this is with network traffic that would previously result in next to nothing uploading, until it had tried anywhere from 8 to 15 times. Grant Darwin NT ID: 920586 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.