Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 52 · 53 · 54 · 55 · 56 · 57 · 58 . . . 94 · Next
Author | Message |
---|---|
Oz Send message Joined: 6 Jun 99 Posts: 233 Credit: 200,655,462 RAC: 212 |
I am sure it will all be fixed during the next "maintenance" on tuesday.... See? I said it would be all fixed during the next "maintenance" but since I received no response or acknowledgment, I responded to myself - now my clique is more exclusive than yours! I will be here another 20 years NOT waiting for an answer. Thanks again to everyone at sah for keeping things running for over twenty years. I whine a lot but I really do understand how much you have done... Member of the 20 Year Club |
xpozd Send message Joined: 26 Jan 15 Posts: 88 Credit: 280,183 RAC: 1 |
@Jimbocous thanks for the reply, I re-ran the Lunatics installer and re-generated the app_info.xml. then tried getting new tasks again but i keep getting the same notices. - Project communication failed: attempting access to reference site - Internet access OK - project servers may be temporarily down. this seems like a longer than usual time to go with no tasks at all.
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
. . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog.This wouldn't help much. Two equally powerful hosts could have wildly different queue sizes. Better to group them by their average turnaround times. If a host returns its results in 12 hours from obtaining them, what does it matter if it processed two or two thousand tasks during those those 12 hours? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
For Me the problems with the Failing Uploads seems to be getting Worse. This morning I found all machines, except the fastest one, working fine. I found the top Mining machine was clogged with failed Uploads, dozens of them. The only machine without any Uploads waiting on retries was the slowest one. Trying to clear the Uploads on the one machine also Failed, countless times. I tried everything, then tried using my USB/Ethernet adapter which finally allowed the Uploads to clear. But, even with the USB adapter I now have an average of 6 retries waiting on that machine. It seems if you get very many they just Fail altogether and then rapidly start piling up until the Downloads stop. At that point it becomes difficult to get the Uploads to clear. It's Not getting any better... |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Not true - a single result occupies "one unit" of database space, while 90 occupy 90 units of database space.Space and spacetime are different things. One result lasting 90 days consumes one unit of row-days every day for 90 days, which could have supported 90 results if each lasted only a day. If results on average lasted twice longer, then the average number of results in the database at a time would also double. There are other kinds of resources consumed when results are created, deleted or their state changed which don't depend on the time the results spend in the database but the recent problems were caused by the database swelling too big to fit in RAM which severely affected the server performance. This depends purely on the row counts, so long lived results are bad. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
. . I have been thinking about that. Maybe if hosts were grouped in about half a dozen classes based on daily returns. Such as Class A up to 50 WUs per day, Class B 50 - 150/day etc. And then assign work with the guideline to not send the second copy to any host that is more than 1 or 2 classes different from the first. That should reduce a large part of the prolonged pending backlog.This wouldn't help much. Two equally powerful hosts could have wildly different queue sizes. Better to group them by their average turnaround times. If a host returns its results in 12 hours from obtaining them, what does it matter if it processed two or two thousand tasks during those those 12 hours? There is an easy way to solve this. Just make all resends to be sended to the top 50 hosts by daily production. This hosts has the returning rate fast enough to clear the pending backlogs. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34900 Credit: 261,360,520 RAC: 489 |
@JimbocousYour old version of BOINC doesn't contain the updated certificates to make contact with the servers. I believe that there is a work around for that, but it'll be easier just to update to a later BOINC version. ;-) Cheers. |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
For Me the problems with the Failing Uploads seems to be getting Worse. This morning I found all machines, except the fastest one, working fine. I found the top Mining machine was clogged with failed Uploads, dozens of them. The only machine without any Uploads waiting on retries was the slowest one. Trying to clear the Uploads on the one machine also Failed, countless times. I tried everything, then tried using my USB/Ethernet adapter which finally allowed the Uploads to clear. But, even with the USB adapter I now have an average of 6 retries waiting on that machine. It seems if you get very many they just Fail altogether and then rapidly start piling up until the Downloads stop. At that point it becomes difficult to get the Uploads to clear. I have seen a few uploads go into retry for a few minutes on each machine. They clear when I hit retry or clear themselves if I'm not logged on. My hosts have slowly been refilling their caches. Here's the event log info for a recent one: Sat 25 Jan 2020 03:19:00 PM EST | | Project communication failed: attempting access to reference site Sat 25 Jan 2020 03:19:00 PM EST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_86094_HIP80163_0111.7431.409.22.45.44.vlar_2_r1267369358_0: transient HTTP error Sat 25 Jan 2020 03:19:00 PM EST | SETI@home | Backing off 00:03:34 on upload of blc35_2bit_guppi_58691_86094_HIP80163_0111.7431.409.22.45.44.vlar_2_r1267369358_0 Sat 25 Jan 2020 03:19:01 PM EST | | Internet access OK - project servers may be temporarily down. I hadn't really seen this until today. |
JohnDK Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127 |
Have the same upload problems, still, but seems most if not all only need one retry before finish uploading. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I'm having the same upload troubles as TBar. Constant list of uploads in backoff. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Everything seems fine here. My queues are full and the occasional upload problems clear themselves on their own in couple of minutes. The funny thing is that the failed uploads quote long backoff times but despite of that they retry after a minute or two on their own. But when I look at my tasks on the web site, things seem less fine. The pages take forever to open and the numbers don't look healthy. I have over 7000 tasks in 'Valid' state but my daily production is only about 2000 tasks, so it looks like there is 3.5 days worth of tasks and whatever software is purging the database is not doing its job of trimming the list to one day. |
rob smith Send message Joined: 7 Mar 03 Posts: 22227 Credit: 416,307,556 RAC: 380 |
I'm not sure where you get your idea of the database structure, but it isn't supported by the published database schema. Your concept would be horrendously inefficient to implement, both in terms of space and management. A result returned "early" would require every succeeding copy of the table to be interrogated and have that result "removed" - which might be OK if there were only a couple of days to look at, but as it stands there would need to be about 90 such tables and a task returned on day one wold mean every one of the remaining 89 tables would need to be updated. The schema for the "results" table is:
Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
JohnDK Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127 |
Getting no work available for the last hour or so... |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
@Jimbocous@xpozd, As someone else has mentioned, you need to update the client to 7.14.2 here, as the security certificate in your current version is no longer valid. You can see the fail in the log, with debug options on. Apparently, something changed, as several folks on older loads have experienced this. Later, Jim ... |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
I'm not sure where you get your idea of the database structure, but it isn't supported by the published database schema. Your concept would be horrendously inefficient to implement, both in terms of space and management. A result returned "early" would require every succeeding copy of the table to be interrogated and have that result "removed" - which might be OK if there were only a couple of days to look at, but as it stands there would need to be about 90 such tables and a task returned on day one wold mean every one of the remaining 89 tables would need to be updated.Why would there be copies of tables? There's one table and each result occupies a row on it for as long as it exists. When the result is validated and purged, the row is freed. If I have 90 one day tasks and one 90 day task created every day, then during the first day there are 91 tasks. Second day 92 tasks because yesterday's long task is still there and the number grows by one every day until the first long task gets purged after 90th day. After that the database size stays constant at 180 rows. So there are equal number of short and long tasks in the database although only a bit over 1% of all the tasks are long ones. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
I contacted Dr. Korpela and he inidcated that there still is some throttling going on to keep the total results less than 20M (lest we have the same issue where the results table exceeds memory) which is probably why the BLC splitters were disabled earlier. No doubt the "shorty storm" from blc35_2bit_guppi_58691_* is causing this. In the interim things seem to be improving and I'm getting just enough work to keep my machines busy, so it should be over soon. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Here we go again, Results received in last hour = 203,416 Too many Instant Overflows. So what's going to fail first, the Server, or my host which is frequently reporting over a Hundred completed tasks every 5 minutes? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13750 Credit: 208,696,464 RAC: 304 |
For Me the problems with the Failing Uploads seems to be getting Worse.Yep. I figure the upload server is struggling more than usual with a sustained return rate of over 200k/hr Sun 26 Jan 2020 06:30:11 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.11795.818.22.45.106.vlar_1_r606393021_0: transient HTTP error Sun 26 Jan 2020 06:31:39 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_85780_HIP80179_0110.16201.409.22.45.89.vlar_2_r2089466694_0: transient HTTP error Sun 26 Jan 2020 06:34:50 ACST | SETI@home | Temporarily failed upload of 08ja11ae.22268.2521.6.33.155_2_r1586865063_0: transient HTTP error Sun 26 Jan 2020 06:36:07 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.12536.409.21.44.213.vlar_1_r1213202057_0: transient HTTP error Sun 26 Jan 2020 06:36:34 ACST | SETI@home | Temporarily failed upload of 21ja20ab.2642.3339.9.36.96_1_r767339215_0: transient HTTP error Sun 26 Jan 2020 06:37:19 ACST | SETI@home | Temporarily failed upload of 08ja11ae.22268.2521.6.33.155_2_r1586865063_0: transient HTTP error Sun 26 Jan 2020 06:40:05 ACST | SETI@home | Temporarily failed upload of 21ja20aa.4253.22570.10.37.204_0_r408842142_0: transient HTTP error Sun 26 Jan 2020 06:40:51 ACST | SETI@home | Temporarily failed upload of 21ja20ab.16816.13155.3.30.230_1_r358958715_0: transient HTTP error Sun 26 Jan 2020 06:42:35 ACST | SETI@home | Temporarily failed upload of 21ja20ab.2642.4157.9.36.175_1_r965334702_0: transient HTTP error Sun 26 Jan 2020 06:49:06 ACST | SETI@home | Temporarily failed upload of 21ja20aa.16790.25020.12.39.93_1_r852049672_0: transient HTTP error Sun 26 Jan 2020 06:52:35 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.15270.0.21.44.143.vlar_0_r70048527_0: transient HTTP error Sun 26 Jan 2020 06:55:11 ACST | SETI@home | Temporarily failed upload of 21ja20ab.27947.7429.10.37.93_0_r1454248994_0: transient HTTP error Sun 26 Jan 2020 06:55:20 ACST | SETI@home | Temporarily failed upload of 21ja20ab.27947.7429.10.37.75_1_r932967319_0: transient HTTP error Sun 26 Jan 2020 06:58:01 ACST | SETI@home | Temporarily failed upload of 21ja20ab.16816.15609.3.30.212_1_r738937942_0: transient HTTP error Sun 26 Jan 2020 06:58:07 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_63126_HIP23250_0036.20166.818.22.45.21.vlar_2_r1640955414_0: transient HTTP error Sun 26 Jan 2020 07:05:36 ACST | SETI@home | Temporarily failed upload of 21ja20ab.17859.885.12.39.223_0_r294066185_0: transient HTTP error Sun 26 Jan 2020 07:06:39 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.17929.409.22.45.64.vlar_0_r749199146_0: transient HTTP error Sun 26 Jan 2020 07:07:04 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58692_00323_HIP80184_0113.17969.409.22.45.64.vlar_0_r2018004276_0: transient HTTP error Sun 26 Jan 2020 07:07:10 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.19090.409.21.44.11.vlar_0_r1164732645_0: transient HTTP error Sun 26 Jan 2020 07:14:35 ACST | SETI@home | Temporarily failed upload of 21ja20ab.15554.22562.6.33.107_1_r168313052_0: transient HTTP error Sun 26 Jan 2020 07:17:10 ACST | SETI@home | Temporarily failed upload of 21ja20ab.18506.10292.8.35.36_1_r1381483434_0: transient HTTP error Sun 26 Jan 2020 07:18:08 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_63755_HIP23422_0038.28244.818.22.45.25.vlar_2_r118219340_0: transient HTTP error Sun 26 Jan 2020 07:19:04 ACST | SETI@home | Temporarily failed upload of 21ja20aa.32052.22975.14.41.168_2_r1293512493_0: transient HTTP error Sun 26 Jan 2020 07:21:05 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_63755_HIP23422_0038.28244.818.22.45.25.vlar_2_r118219340_0: transient HTTP error Sun 26 Jan 2020 07:26:13 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58692_00323_HIP80184_0113.24865.0.22.45.22.vlar_1_r1730966582_0: transient HTTP error Sun 26 Jan 2020 07:28:08 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_64387_HIP23535_0040.24381.818.22.45.202.vlar_0_r45723157_0: transient HTTP error Sun 26 Jan 2020 07:28:12 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_86094_HIP80163_0111.24330.0.21.44.33_1_r726636297_0: transient HTTP error Sun 26 Jan 2020 07:28:36 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_63126_HIP23250_0036.24401.818.21.44.115.vlar_1_r866925457_0: transient HTTP error Sun 26 Jan 2020 07:28:46 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58692_00323_HIP80184_0113.24865.409.22.45.186.vlar_0_r2095728927_0: transient HTTP error Sun 26 Jan 2020 07:29:05 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_85133_HIP80179_0108.22416.409.21.44.90.vlar_1_r285619254_0: transient HTTP error Sun 26 Jan 2020 07:30:38 ACST | SETI@home | Temporarily failed upload of 20ja20ad.27685.11110.10.37.6_2_r1999113188_0: transient HTTP error Sun 26 Jan 2020 07:36:51 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_62810_HIP23311_0035.7486.0.21.44.123.vlar_2_r107567880_0: transient HTTP error Sun 26 Jan 2020 07:37:22 ACST | SETI@home | Temporarily failed upload of 21ja20ab.27508.11110.5.32.154_0_r1053912088_0: transient HTTP error Sun 26 Jan 2020 07:38:38 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_85133_HIP80179_0108.12484.818.21.44.184.vlar_2_r1435803581_0: transient HTTP error Sun 26 Jan 2020 07:38:44 ACST | SETI@home | Temporarily failed upload of blc56_2bit_guppi_58692_82350_HIP80974_0099.4020.818.22.45.93.vlar_2_r18747109_0: transient HTTP error Sun 26 Jan 2020 07:38:44 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_63755_HIP23422_0038.25417.818.22.45.126.vlar_2_r469231822_0: transient HTTP error Sun 26 Jan 2020 07:40:10 ACST | SETI@home | Temporarily failed upload of blc35_2bit_guppi_58691_62144_HIP21547_0033.2538.0.21.44.213.vlar_2_r781129339_0: transient HTTP error Sun 26 Jan 2020 07:41:09 ACST | SETI@home | Backing off 00:07:53 on upload of blc56_2bit_guppi_58692_82350_HIP80974_0099.4020.818.22.45.93.vlar_2_r18747109_0 Sun 26 Jan 2020 07:41:58 ACST | SETI@home | Temporarily failed upload of blc56_2bit_guppi_58692_82350_HIP80974_0099.4020.818.22.45.91.vlar_2_r1378860192_0: transient HTTP error I'm sure i've missed a few. Grant Darwin NT |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
I contacted Dr. Korpela and he inidcated that there still is some throttling going on to keep the total results less than 20MWhen I sum all the result counts on ssp, I get just a bit under 20 million. Does that mean that there's no results corresponding to 'Workunits waiting for assimilation' or that those results are counted in some other category? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I'm still trying to understand that for over two weeks and two long maintenance outages, they haven't apparently made any attempt to reduce the number of completed and validated tasks from the host lists. Doesn't seem they ever have let the assimilators, purgers and deleters ever have unfettered freedom to clear the backlogs. That in of itself would reduce the size of the database. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.