Panic Mode On (79) Server Problems?

Message boards : Number crunching : Panic Mode On (79) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next

AuthorMessage
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1307624 - Posted: 19 Nov 2012, 5:56:10 UTC - in response to Message 1307586.  

You're welcome MG. Now can begin the grim solution
for the near future.

Think Lifeboat.

https://en.wikipedia.org/wiki/Lifeboat_%28film%29
ID: 1307624 · Report as offensive
KB7RZF
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 9549
Credit: 3,308,926
RAC: 2
United States
Message 1307630 - Posted: 19 Nov 2012, 6:18:01 UTC

I've only got 42 ghost work units on my new laptop, and none on my other slower machines. But I've stopped crunching for a little while here. I know it doesn't matter much, since most machines are set it-forget it setups, but I just hate when my computers can't get work. So working on other projects for now. Gotta keep those processors warm somehow. LOL
ID: 1307630 · Report as offensive
Draconian
Volunteer tester

Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1307662 - Posted: 19 Nov 2012, 8:15:48 UTC
Last modified: 19 Nov 2012, 8:35:07 UTC

Proxy server working perfectly - data rate hitting the scheduler from individual hosts is too high.
This 200 WU has only made it worse I would think - my system keeps asking for tasks about every 5 minutes. Thinking, why not open up the queue and have a mandatory backoff after wus are sent. After all - if you just filled your queue up with 3 or 4 days worth of data - you don't need to talk to the project for AT LEAST a day.
Just anyway that the load can be taken off of the scheduler - not letting the systems ask for data all the time when they don't have to would help.

I have 200 wu queue - I contact the project and it advises it has 2 units to report and is requesting more work....when...I still have 198 in my queue....
Maybe...set project to where it does not contact until half the queue remains (when the project is up and running well)?
ID: 1307662 · Report as offensive
Ianab
Volunteer tester

Send message
Joined: 11 Jun 08
Posts: 732
Credit: 20,635,586
RAC: 5
New Zealand
Message 1307665 - Posted: 19 Nov 2012, 8:20:41 UTC

Basically a victim of their own success....

With all the new high powered CUDA crunchers that have been coming online the amount of work in progress has become too much for the database to handle in a reliable way. This led to the recent timeout issues as the database gradually got more sluggish, then to ghost work units, which further compound the database issues. Downward spiral until eventually the database broke completely...

I guess once it's fixed the short term fix will be to stop splitting for a while, clear the ghosts and get the work units in progress back to a number the database can handle. Then restrict the new work units to keep a sensible number in progress.

So expect some ongoing issues over the short term.

I see the plan is to make bigger work units, which should help a lot by making the database only 25% the size, but then that puts more pressure on the internet connection???

I assume there will be some gnashing of teeth, tearing of hair and threats to leave, like there usually is when problems occur. The other 99.9% of us will just sigh, select some other projects, and wait it out...

Ian
ID: 1307665 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1307668 - Posted: 19 Nov 2012, 8:28:25 UTC
Last modified: 19 Nov 2012, 8:28:43 UTC

Personally I an happy that Eric has taken time to explain the problem. Whatever they decide to do is up to them. If it increases crunching time and lowers RAC so be it. I have always been here for the science.

I have actually stopped most of my crunching here just leaving one machine to "fly the flag". I will not restart at SETI@Home, I will give them time to sort it out. As should we all.
ID: 1307668 · Report as offensive
Draconian
Volunteer tester

Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1307669 - Posted: 19 Nov 2012, 8:29:26 UTC - in response to Message 1307665.  

Possibly - but - if the problem is what they stated it is - then my proxy server that I am using wouldn't make a difference at all. It doesn't change the timeout - it only changes the RATE of data that is hitting seti.
When I use a proxy server - it's basically flawless. Turn off the proxy - and..once in a great while will I get to the scheduler.

I think what we are seeing is another form of an old computer hack - flood the system with data and eventually it will do something wrong (used to be used to break into systems) - think the same thing is happening here.

When I am on my proxy - the data flow to seti is moderated - after all, the proxy is sending who knows how much data to multiple places - and it works great. When not on the proxy - seti basically gets my full upload speed - sure - a small amount of data - but at full speed.
An analogy - we used to have to interleave hard drives because the system wasn't fast enough to read data straight from the drive.

ID: 1307669 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1307745 - Posted: 19 Nov 2012, 16:20:54 UTC - in response to Message 1307665.  

[...]I see the plan is to make bigger work units, which should help a lot by making the database only 25% the size, but then that puts more pressure on the internet connection???[...]


If they do it like they doubled MB a while back, the WUs will stay the same size, they just increase the FFT resolution and make you do four times more work on the same data file. That change in the resolution is just something that gets changed in the XML-type header for the WU itself, but as always, it requires testing to see if it is going to work like it is expected. There's always that possibility that by increasing the precision, you may end up with many more false positives than you would expect.

It's kind of like looking at some of the satellite photos for things like the surface of another planet. It used to be something like 10-meter resolution, which meant that every pixel represented 100 square meters, and the particular color of that pixel was determined by what color was the most abundant in that 100 meter square.

As technology increased, I believe we've gotten it down to less than 5 square meters per pixel, so you end up with a huge increase in detail, and now instead of "this 100 meter square is roughly brown and flat" you have "okay, so there's probably a tree there, and oh look, a huge boulder, and it turns out this tree is on the edge of a cliff" because you now have 20 pixels to describe what you only saw in one before.

But increased detail can also be a burden, because it might be possible to end up with more signals in one WU, so -9 overflows may become much more likely, unless the limit gets increased for that, as well.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1307745 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1307761 - Posted: 19 Nov 2012, 17:15:44 UTC - in response to Message 1307745.  
Last modified: 19 Nov 2012, 17:21:36 UTC

[...]I see the plan is to make bigger work units, which should help a lot by making the database only 25% the size, but then that puts more pressure on the internet connection???[...]


If they do it like they doubled MB a while back, the WUs will stay the same size, they just increase the FFT resolution and make you do four times more work on the same data file. That change in the resolution is just something that gets changed in the XML-type header for the WU itself, but as always, it requires testing to see if it is going to work like it is expected. There's always that possibility that by increasing the precision, you may end up with many more false positives than you would expect.

It's kind of like looking at some of the satellite photos for things like the surface of another planet. It used to be something like 10-meter resolution, which meant that every pixel represented 100 square meters, and the particular color of that pixel was determined by what color was the most abundant in that 100 meter square.

As technology increased, I believe we've gotten it down to less than 5 square meters per pixel, so you end up with a huge increase in detail, and now instead of "this 100 meter square is roughly brown and flat" you have "okay, so there's probably a tree there, and oh look, a huge boulder, and it turns out this tree is on the edge of a cliff" because you now have 20 pixels to describe what you only saw in one before.

But increased detail can also be a burden, because it might be possible to end up with more signals in one WU, so -9 overflows may become much more likely, unless the limit gets increased for that, as well.


As I recall it was a few years ago when they last cranked up the dial on the resolution & I think it was stated that it was at the max to get any sort of useful data.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1307761 · Report as offensive
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 1307765 - Posted: 19 Nov 2012, 17:32:30 UTC

One of the red flags that a site has been hacked or spoofed is that you find grammatical or spelling errors that are out of the ordinary, and that make a message hard to read. Did anyone else notice the most recent message on the front page seems to have such errors? For example, "the lookup of result in process", "hosts being assigned large number or [of?] results to compute", and "The host. think it received", among others. These are not normal for the seti@home front page or any technical message one usually finds on the site.
ID: 1307765 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1307767 - Posted: 19 Nov 2012, 17:36:09 UTC - in response to Message 1307765.  

One of the red flags that a site has been hacked or spoofed is that you find grammatical or spelling errors that are out of the ordinary, and that make a message hard to read. Did anyone else notice the most recent message on the front page seems to have such errors? For example, "the lookup of result in process", "hosts being assigned large number or [of?] results to compute", and "The host. think it received", among others. These are not normal for the seti@home front page or any technical message one usually finds on the site.

OR it could have simply been that Eric was tired on a Sunday when he composed that posting and was more concerned with getting the info out than worrying about being grammatically correct..........
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1307767 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1307770 - Posted: 19 Nov 2012, 17:39:58 UTC
Last modified: 19 Nov 2012, 17:40:26 UTC

Belive in the kittyman... The kitties are allways right!
ID: 1307770 · Report as offensive
Profile Lint trap

Send message
Joined: 30 May 03
Posts: 871
Credit: 28,092,319
RAC: 0
United States
Message 1307771 - Posted: 19 Nov 2012, 17:41:03 UTC



I hope Eric et al don't forget to hit the "Turbo" button when they get everything running again....I just ordered upgrades from Newegg!!

My current mobo/cpu is 5 yo, so yep, it's about time...:)


Lt





ID: 1307771 · Report as offensive
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 1307776 - Posted: 19 Nov 2012, 17:49:12 UTC - in response to Message 1307767.  

One of the red flags that a site has been hacked or spoofed is that you find grammatical or spelling errors that are out of the ordinary, and that make a message hard to read. Did anyone else notice the most recent message on the front page seems to have such errors? For example, "the lookup of result in process", "hosts being assigned large number or [of?] results to compute", and "The host. think it received", among others. These are not normal for the seti@home front page or any technical message one usually finds on the site.

OR it could have simply been that Eric was tired on a Sunday when he composed that posting and was more concerned with getting the info out than worrying about being grammatically correct..........

Sure, that could be, but they've had a long time to determine what this problem was, and a long enough time to correct the front page message. I don't recall any messages from Eric in the past that contained so many faults.

I'm not saying that aliens have the team held in the basement of the server closet, forcing them to write messages that will throw us off the scent. I'm just commenting on the abnormality of the way this issue is being explained. If I have mistakes made yous maybe see them and wondering you are.
ID: 1307776 · Report as offensive
musicplayer

Send message
Joined: 17 May 10
Posts: 2430
Credit: 926,046
RAC: 0
Message 1307784 - Posted: 19 Nov 2012, 17:56:18 UTC
Last modified: 19 Nov 2012, 17:59:33 UTC

Oh, Eric is wearing glasses, isn't he?

Be thankful that he is not the one who is biting you here.
ID: 1307784 · Report as offensive
TPCBF

Send message
Joined: 18 May 99
Posts: 54
Credit: 4,594,980
RAC: 0
United States
Message 1307812 - Posted: 19 Nov 2012, 18:41:29 UTC - in response to Message 1307767.  

One of the red flags that a site has been hacked or spoofed is that you find grammatical or spelling errors that are out of the ordinary, and that make a message hard to read. Did anyone else notice the most recent message on the front page seems to have such errors? For example, "the lookup of result in process", "hosts being assigned large number or [of?] results to compute", and "The host. think it received", among others. These are not normal for the seti@home front page or any technical message one usually finds on the site.

OR it could have simply been that Eric was tired on a Sunday when he composed that posting and was more concerned with getting the info out than worrying about being grammatically correct..........
I doubt that this is a hint at a hacked site, rather than "normal" typos of a sysadmin trying to get some info out quickly (which is appreciated), possibly on a smartphone or otherwise touchscreen encumbered device...
The "host. think" part is IMHO a clear indication, that happens to me when I am typing accidentally two blanks while walking or driving (as a (bus) passenger!). My Android phone (and I know iPhones/iPads do the same) interprets this as the "end of a sentence" and replaces those two spaces with a dot, and you just keep typing another space to move on...

Ralf
ID: 1307812 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1307869 - Posted: 19 Nov 2012, 20:45:36 UTC

The roads are rolling.
ID: 1307869 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30593
Credit: 53,134,872
RAC: 32
United States
Message 1307874 - Posted: 19 Nov 2012, 20:57:15 UTC - in response to Message 1307765.  

One of the red flags that a site has been hacked or spoofed is that you find grammatical or spelling errors that are out of the ordinary, and that make a message hard to read. Did anyone else notice the most recent message on the front page seems to have such errors? For example, "the lookup of result in process", "hosts being assigned large number or [of?] results to compute", and "The host. think it received", among others. These are not normal for the seti@home front page or any technical message one usually finds on the site.

Of course it is a hack, we have an explanation message and we know the staff never does that, and we know Matt is away so it wasn't him. It is signed by Eric, but we know he never writes here. Ergo is must be a hack :)

Eric, thanks for taking the time before the Greenbay game to work on it.


ID: 1307874 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1307886 - Posted: 19 Nov 2012, 21:37:33 UTC

After a short break the servers are getting back on their feet, but are still somewhat unstable - latest request was greeted thus:

19/11/2012 21:30:30 SETI@home Sending scheduler request: To fetch work.
19/11/2012 21:30:30 SETI@home Reporting 142 completed tasks, requesting new tasks for GPU
19/11/2012 21:30:32 SETI@home Scheduler request failed: Server returned nothing (no headers, no data)



Not sure what's going on, but that doesn't look like a well patient to me....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1307886 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1307890 - Posted: 19 Nov 2012, 21:45:54 UTC - in response to Message 1307886.  
Last modified: 19 Nov 2012, 22:00:25 UTC

After a short break the servers are getting back on their feet, but are still somewhat unstable - latest request was greeted thus:

19/11/2012 21:30:30 SETI@home Sending scheduler request: To fetch work.
19/11/2012 21:30:30 SETI@home Reporting 142 completed tasks, requesting new tasks for GPU
19/11/2012 21:30:32 SETI@home Scheduler request failed: Server returned nothing (no headers, no data)



Not sure what's going on, but that doesn't look like a well patient to me....

One of my machines actually updated, reported, and downloaded new files. The other two say the same thing;
11/19/2012 4:36:36 PM | SETI@home | Sending scheduler request: To fetch work.
11/19/2012 4:36:36 PM | SETI@home | Reporting 60 completed tasks
11/19/2012 4:36:36 PM | SETI@home | Requesting new tasks for NVIDIA
11/19/2012 4:36:40 PM | | Project communication failed: attempting access to reference site
11/19/2012 4:36:40 PM | SETI@home | Scheduler request failed: Server returned nothing (no headers, no data)
11/19/2012 4:36:42 PM | | Internet access OK - project servers may be temporarily down...

Another attempt;
19-Nov-2012 16:31:52 [SETI@home] Fetching scheduler list
19-Nov-2012 16:32:07 [SETI@home] Master file download succeeded
19-Nov-2012 16:32:12 [SETI@home] Sending scheduler request: Requested by user.
19-Nov-2012 16:32:12 [SETI@home] Reporting 60 completed tasks
19-Nov-2012 16:32:12 [SETI@home] Requesting new tasks for NVIDIA
19-Nov-2012 16:32:46 [SETI@home] Scheduler request failed: HTTP bad gateway
19-Nov-2012 16:32:52 [---] Using proxy info from GUI
19-Nov-2012 16:32:52 [---] Not using a proxy
19-Nov-2012 16:33:13 [SETI@home] update requested by user
19-Nov-2012 16:33:16 [SETI@home] Sending scheduler request: Requested by user.
19-Nov-2012 16:33:16 [SETI@home] Reporting 60 completed tasks
19-Nov-2012 16:33:16 [SETI@home] Requesting new tasks for NVIDIA
19-Nov-2012 16:33:18 [SETI@home] Scheduler request failed: Failure when receiving data from the peer
19-Nov-2012 16:33:39 [---] Exit requested by user...

I'm also getting the notorious 'Timeout was reached'.

I've restarted BOINC, tried a proxy, reinstalled BOINC...given up for now...
ID: 1307890 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1307895 - Posted: 19 Nov 2012, 21:55:14 UTC

I think for the time being, I'm treating it as what they always warn us about - a period of a few hours congestion after an outage.

We've had 200,000 results returned in an hour, 1,400 queries a second, and we've still got 94 Mbit/sec. Assuming they leave the splitters turned off until all the current results ready to send have been allocated and downloaded (which I hope they do), we'll get a better idea how well the scheduler copes with 'report only'.
ID: 1307895 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (79) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.