The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Oz Send message Joined: 6 Jun 99 Posts: 233 Credit: 200,655,462 RAC: 212	Message 2029210 - Posted: 25 Jan 2020, 18:50:15 UTC - in response to Message 2028145. I am sure it will all be fixed during the next "maintenance" on tuesday.... ROTFLMAO See? I said it would be all fixed during the next "maintenance" but since I received no response or acknowledgment, I responded to myself - now my clique is more exclusive than yours! I will be here another 20 years NOT waiting for an answer. Thanks again to everyone at sah for keeping things running for over twenty years. I whine a lot but I really do understand how much you have done... Member of the 20 Year Club ID: 2029210 ·

xpozd Send message Joined: 26 Jan 15 Posts: 88 Credit: 280,183 RAC: 1	Message 2029213 - Posted: 25 Jan 2020, 19:43:12 UTC - in response to Message 2029023. @Jimbocous thanks for the reply, I re-ran the Lunatics installer and re-generated the app_info.xml. then tried getting new tasks again but i keep getting the same notices. - Project communication failed: attempting access to reference site - Internet access OK - project servers may be temporarily down. this seems like a longer than usual time to go with no tasks at all. win7starter boinc: 7.14.2 boinc tasks: 1.78 Lunatics Win32 v0.44 ID: 2029213 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2029215 - Posted: 25 Jan 2020, 19:54:22 UTC - in response to Message 2029197. . . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog. This wouldn't help much. Two equally powerful hosts could have wildly different queue sizes. Better to group them by their average turnaround times. If a host returns its results in 12 hours from obtaining them, what does it matter if it processed two or two thousand tasks during those those 12 hours? ID: 2029215 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2029216 - Posted: 25 Jan 2020, 20:01:21 UTC For Me the problems with the Failing Uploads seems to be getting Worse. This morning I found all machines, except the fastest one, working fine. I found the top Mining machine was clogged with failed Uploads, dozens of them. The only machine without any Uploads waiting on retries was the slowest one. Trying to clear the Uploads on the one machine also Failed, countless times. I tried everything, then tried using my USB/Ethernet adapter which finally allowed the Uploads to clear. But, even with the USB adapter I now have an average of 6 retries waiting on that machine. It seems if you get very many they just Fail altogether and then rapidly start piling up until the Downloads stop. At that point it becomes difficult to get the Uploads to clear. It's Not getting any better... ID: 2029216 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2029217 - Posted: 25 Jan 2020, 20:05:55 UTC - in response to Message 2029205. Not true - a single result occupies "one unit" of database space, while 90 occupy 90 units of database space. Space and spacetime are different things. One result lasting 90 days consumes one unit of row-days every day for 90 days, which could have supported 90 results if each lasted only a day. If results on average lasted twice longer, then the average number of results in the database at a time would also double. There are other kinds of resources consumed when results are created, deleted or their state changed which don't depend on the time the results spend in the database but the recent problems were caused by the database swelling too big to fit in RAM which severely affected the server performance. This depends purely on the row counts, so long lived results are bad. ID: 2029217 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2029218 - Posted: 25 Jan 2020, 20:06:32 UTC - in response to Message 2029215. Last modified: 25 Jan 2020, 20:11:00 UTC . . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog. This wouldn't help much. Two equally powerful hosts could have wildly different queue sizes. Better to group them by their average turnaround times. If a host returns its results in 12 hours from obtaining them, what does it matter if it processed two or two thousand tasks during those those 12 hours? There is an easy way to solve this. Just make all resends to be sended to the top 50 hosts by daily production. This hosts has the returning rate fast enough to clear the pending backlogs. ID: 2029218 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 35035 Credit: 261,360,520 RAC: 489	Message 2029219 - Posted: 25 Jan 2020, 20:16:20 UTC - in response to Message 2029213. @Jimbocous thanks for the reply, I re-ran the Lunatics installer and re-generated the app_info.xml. then tried getting new tasks again but i keep getting the same notices. - Project communication failed: attempting access to reference site - Internet access OK - project servers may be temporarily down. this seems like a longer than usual time to go with no tasks at all. Your old version of BOINC doesn't contain the updated certificates to make contact with the servers. I believe that there is a work around for that, but it'll be easier just to update to a later BOINC version. ;-) Cheers. ID: 2029219 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2029220 - Posted: 25 Jan 2020, 20:20:48 UTC - in response to Message 2029216. For Me the problems with the Failing Uploads seems to be getting Worse. This morning I found all machines, except the fastest one, working fine. I found the top Mining machine was clogged with failed Uploads, dozens of them. The only machine without any Uploads waiting on retries was the slowest one. Trying to clear the Uploads on the one machine also Failed, countless times. I tried everything, then tried using my USB/Ethernet adapter which finally allowed the Uploads to clear. But, even with the USB adapter I now have an average of 6 retries waiting on that machine. It seems if you get very many they just Fail altogether and then rapidly start piling up until the Downloads stop. At that point it becomes difficult to get the Uploads to clear. It's Not getting any better... I have seen a few uploads go into retry for a few minutes on each machine. They clear when I hit retry or clear themselves if I'm not logged on. My hosts have slowly been refilling their caches. Here's the event log info for a recent one: Sat 25 Jan 2020 03:19:00 PM EST \| \| Project communication failed: attempting access to reference site Sat 25 Jan 2020 03:19:00 PM EST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_86094_HIP80163_0111.7431.409.22.45.44.vlar_2_r1267369358_0: transient HTTP error Sat 25 Jan 2020 03:19:00 PM EST \| SETI@home \| Backing off 00:03:34 on upload of blc35_2bit_guppi_58691_86094_HIP80163_0111.7431.409.22.45.44.vlar_2_r1267369358_0 Sat 25 Jan 2020 03:19:01 PM EST \| \| Internet access OK - project servers may be temporarily down. I hadn't really seen this until today. ID: 2029220 ·

JohnDK Volunteer tester Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127	Message 2029222 - Posted: 25 Jan 2020, 20:25:01 UTC Have the same upload problems, still, but seems most if not all only need one retry before finish uploading. ID: 2029222 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2029227 - Posted: 25 Jan 2020, 21:08:59 UTC I'm having the same upload troubles as TBar. Constant list of uploads in backoff. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2029227 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2029230 - Posted: 25 Jan 2020, 21:17:57 UTC Last modified: 25 Jan 2020, 21:20:19 UTC Everything seems fine here. My queues are full and the occasional upload problems clear themselves on their own in couple of minutes. The funny thing is that the failed uploads quote long backoff times but despite of that they retry after a minute or two on their own. But when I look at my tasks on the web site, things seem less fine. The pages take forever to open and the numbers don't look healthy. I have over 7000 tasks in 'Valid' state but my daily production is only about 2000 tasks, so it looks like there is 3.5 days worth of tasks and whatever software is purging the database is not doing its job of trimming the list to one day. ID: 2029230 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22247 Credit: 416,307,556 RAC: 380	Message 2029232 - Posted: 25 Jan 2020, 21:18:34 UTC - in response to Message 2029217. I'm not sure where you get your idea of the database structure, but it isn't supported by the published database schema. Your concept would be horrendously inefficient to implement, both in terms of space and management. A result returned "early" would require every succeeding copy of the table to be interrogated and have that result "removed" - which might be OK if there were only a couple of days to look at, but as it stands there would need to be about 90 such tables and a task returned on day one wold mean every one of the remaining 89 tables would need to be updated. The schema for the "results" table is: table result ( id integer not null auto_increment, create_time integer not null, workunitid integer not null, server_state integer not null, outcome integer not null, client_state integer not null, hostid integer not null, userid integer not null, report_deadline integer not null, sent_time integer not null, received_time integer not null, name varchar(254) not null, cpu_time double not null, xml_doc_in blob, xml_doc_out blob, stderr_out blob, batch integer not null, file_delete_state integer not null, validate_state integer not null, claimed_credit double not null, granted_credit double not null, opaque double not null, random integer not null, app_version_num integer not null, appid integer not null, exit_status integer not null, teamid integer not null, priority integer not null, mod_time timestamp default current_timestamp on update current_timestamp, elapsed_time double not null, flops_estimate double not null, app_version_id integer not null, runtime_outlier tinyint not null, size_class smallint not null default -1, peak_working_set_size double not null, peak_swap_size double not null, peak_disk_usage double not null, primary key (id) ) engine=InnoDB; Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2029232 ·

JohnDK Volunteer tester Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127	Message 2029235 - Posted: 25 Jan 2020, 21:25:13 UTC Getting no work available for the last hour or so... ID: 2029235 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1855 Credit: 268,616,081 RAC: 1,349	Message 2029236 - Posted: 25 Jan 2020, 21:29:54 UTC - in response to Message 2029213. Last modified: 25 Jan 2020, 21:40:46 UTC @Jimbocous thanks for the reply, I re-ran the Lunatics installer and re-generated the app_info.xml. then tried getting new tasks again but i keep getting the same notices. - Project communication failed: attempting access to reference site - Internet access OK - project servers may be temporarily down. this seems like a longer than usual time to go with no tasks at all. @xpozd, As someone else has mentioned, you need to update the client to 7.14.2 here, as the security certificate in your current version is no longer valid. You can see the fail in the log, with debug options on. Apparently, something changed, as several folks on older loads have experienced this. Later, Jim ... ID: 2029236 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2029237 - Posted: 25 Jan 2020, 21:32:40 UTC - in response to Message 2029232. Last modified: 25 Jan 2020, 21:35:25 UTC I'm not sure where you get your idea of the database structure, but it isn't supported by the published database schema. Your concept would be horrendously inefficient to implement, both in terms of space and management. A result returned "early" would require every succeeding copy of the table to be interrogated and have that result "removed" - which might be OK if there were only a couple of days to look at, but as it stands there would need to be about 90 such tables and a task returned on day one wold mean every one of the remaining 89 tables would need to be updated. Why would there be copies of tables? There's one table and each result occupies a row on it for as long as it exists. When the result is validated and purged, the row is freed. If I have 90 one day tasks and one 90 day task created every day, then during the first day there are 91 tasks. Second day 92 tasks because yesterday's long task is still there and the number grows by one every day until the first long task gets purged after 90th day. After that the database size stays constant at 180 rows. So there are equal number of short and long tasks in the database although only a bit over 1% of all the tasks are long ones. ID: 2029237 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2029240 - Posted: 25 Jan 2020, 21:47:52 UTC Last modified: 25 Jan 2020, 21:49:17 UTC I contacted Dr. Korpela and he inidcated that there still is some throttling going on to keep the total results less than 20M (lest we have the same issue where the results table exceeds memory) which is probably why the BLC splitters were disabled earlier. No doubt the "shorty storm" from blc35_2bit_guppi_58691_* is causing this. In the interim things seem to be improving and I'm getting just enough work to keep my machines busy, so it should be over soon. ID: 2029240 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2029242 - Posted: 25 Jan 2020, 21:48:37 UTC Here we go again, Results received in last hour = 203,416 Too many Instant Overflows. So what's going to fail first, the Server, or my host which is frequently reporting over a Hundred completed tasks every 5 minutes? ID: 2029242 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13759 Credit: 208,696,464 RAC: 304	Message 2029244 - Posted: 25 Jan 2020, 21:55:52 UTC - in response to Message 2029216. Last modified: 25 Jan 2020, 22:13:19 UTC For Me the problems with the Failing Uploads seems to be getting Worse. Yep. I figure the upload server is struggling more than usual with a sustained return rate of over 200k/hr Sun 26 Jan 2020 06:30:11 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.11795.818.22.45.106.vlar_1_r606393021_0: transient HTTP error Sun 26 Jan 2020 06:31:39 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_85780_HIP80179_0110.16201.409.22.45.89.vlar_2_r2089466694_0: transient HTTP error Sun 26 Jan 2020 06:34:50 ACST \| SETI@home \| Temporarily failed upload of 08ja11ae.22268.2521.6.33.155_2_r1586865063_0: transient HTTP error Sun 26 Jan 2020 06:36:07 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.12536.409.21.44.213.vlar_1_r1213202057_0: transient HTTP error Sun 26 Jan 2020 06:36:34 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.2642.3339.9.36.96_1_r767339215_0: transient HTTP error Sun 26 Jan 2020 06:37:19 ACST \| SETI@home \| Temporarily failed upload of 08ja11ae.22268.2521.6.33.155_2_r1586865063_0: transient HTTP error Sun 26 Jan 2020 06:40:05 ACST \| SETI@home \| Temporarily failed upload of 21ja20aa.4253.22570.10.37.204_0_r408842142_0: transient HTTP error Sun 26 Jan 2020 06:40:51 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.16816.13155.3.30.230_1_r358958715_0: transient HTTP error Sun 26 Jan 2020 06:42:35 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.2642.4157.9.36.175_1_r965334702_0: transient HTTP error Sun 26 Jan 2020 06:49:06 ACST \| SETI@home \| Temporarily failed upload of 21ja20aa.16790.25020.12.39.93_1_r852049672_0: transient HTTP error Sun 26 Jan 2020 06:52:35 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.15270.0.21.44.143.vlar_0_r70048527_0: transient HTTP error Sun 26 Jan 2020 06:55:11 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.27947.7429.10.37.93_0_r1454248994_0: transient HTTP error Sun 26 Jan 2020 06:55:20 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.27947.7429.10.37.75_1_r932967319_0: transient HTTP error Sun 26 Jan 2020 06:58:01 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.16816.15609.3.30.212_1_r738937942_0: transient HTTP error Sun 26 Jan 2020 06:58:07 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_63126_HIP23250_0036.20166.818.22.45.21.vlar_2_r1640955414_0: transient HTTP error Sun 26 Jan 2020 07:05:36 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.17859.885.12.39.223_0_r294066185_0: transient HTTP error Sun 26 Jan 2020 07:06:39 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.17929.409.22.45.64.vlar_0_r749199146_0: transient HTTP error Sun 26 Jan 2020 07:07:04 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58692_00323_HIP80184_0113.17969.409.22.45.64.vlar_0_r2018004276_0: transient HTTP error Sun 26 Jan 2020 07:07:10 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.19090.409.21.44.11.vlar_0_r1164732645_0: transient HTTP error Sun 26 Jan 2020 07:14:35 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.15554.22562.6.33.107_1_r168313052_0: transient HTTP error Sun 26 Jan 2020 07:17:10 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.18506.10292.8.35.36_1_r1381483434_0: transient HTTP error Sun 26 Jan 2020 07:18:08 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_63755_HIP23422_0038.28244.818.22.45.25.vlar_2_r118219340_0: transient HTTP error Sun 26 Jan 2020 07:19:04 ACST \| SETI@home \| Temporarily failed upload of 21ja20aa.32052.22975.14.41.168_2_r1293512493_0: transient HTTP error Sun 26 Jan 2020 07:21:05 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_63755_HIP23422_0038.28244.818.22.45.25.vlar_2_r118219340_0: transient HTTP error Sun 26 Jan 2020 07:26:13 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58692_00323_HIP80184_0113.24865.0.22.45.22.vlar_1_r1730966582_0: transient HTTP error Sun 26 Jan 2020 07:28:08 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_64387_HIP23535_0040.24381.818.22.45.202.vlar_0_r45723157_0: transient HTTP error Sun 26 Jan 2020 07:28:12 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_86094_HIP80163_0111.24330.0.21.44.33_1_r726636297_0: transient HTTP error Sun 26 Jan 2020 07:28:36 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_63126_HIP23250_0036.24401.818.21.44.115.vlar_1_r866925457_0: transient HTTP error Sun 26 Jan 2020 07:28:46 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58692_00323_HIP80184_0113.24865.409.22.45.186.vlar_0_r2095728927_0: transient HTTP error Sun 26 Jan 2020 07:29:05 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_85133_HIP80179_0108.22416.409.21.44.90.vlar_1_r285619254_0: transient HTTP error Sun 26 Jan 2020 07:30:38 ACST \| SETI@home \| Temporarily failed upload of 20ja20ad.27685.11110.10.37.6_2_r1999113188_0: transient HTTP error Sun 26 Jan 2020 07:36:51 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.7486.0.21.44.123.vlar_2_r107567880_0: transient HTTP error Sun 26 Jan 2020 07:37:22 ACST \| SETI@home \| Temporarily failed upload of 21ja20ab.27508.11110.5.32.154_0_r1053912088_0: transient HTTP error Sun 26 Jan 2020 07:38:38 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_85133_HIP80179_0108.12484.818.21.44.184.vlar_2_r1435803581_0: transient HTTP error Sun 26 Jan 2020 07:38:44 ACST \| SETI@home \| Temporarily failed upload of blc56_2bit_guppi_58692_82350_HIP80974_0099.4020.818.22.45.93.vlar_2_r18747109_0: transient HTTP error Sun 26 Jan 2020 07:38:44 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_63755_HIP23422_0038.25417.818.22.45.126.vlar_2_r469231822_0: transient HTTP error Sun 26 Jan 2020 07:40:10 ACST \| SETI@home \| Temporarily failed upload of blc35_2bit_guppi_58691_62144_HIP21547_0033.2538.0.21.44.213.vlar_2_r781129339_0: transient HTTP error Sun 26 Jan 2020 07:41:09 ACST \| SETI@home \| Backing off 00:07:53 on upload of blc56_2bit_guppi_58692_82350_HIP80974_0099.4020.818.22.45.93.vlar_2_r18747109_0 Sun 26 Jan 2020 07:41:58 ACST \| SETI@home \| Temporarily failed upload of blc56_2bit_guppi_58692_82350_HIP80974_0099.4020.818.22.45.91.vlar_2_r1378860192_0: transient HTTP error I'm sure i've missed a few. Grant Darwin NT ID: 2029244 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2029245 - Posted: 25 Jan 2020, 21:59:55 UTC - in response to Message 2029240. I contacted Dr. Korpela and he inidcated that there still is some throttling going on to keep the total results less than 20M When I sum all the result counts on ssp, I get just a bit under 20 million. Does that mean that there's no results corresponding to 'Workunits waiting for assimilation' or that those results are counted in some other category? ID: 2029245 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2029246 - Posted: 25 Jan 2020, 22:01:42 UTC I'm still trying to understand that for over two weeks and two long maintenance outages, they haven't apparently made any attempt to reduce the number of completed and validated tasks from the host lists. Doesn't seem they ever have let the assimilators, purgers and deleters ever have unfettered freedom to clear the backlogs. That in of itself would reduce the size of the database. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2029246 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.