Panic Mode On (24) Server problems


log in

Advanced search

Message boards : Number crunching : Panic Mode On (24) Server problems

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next
Author Message
Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4662
Credit: 123,456,526
RAC: 99,833
United States
Message 933600 - Posted: 15 Sep 2009, 22:22:38 UTC - in response to Message 933598.

Crazy EDF GPU bug you say? Is that when BOINC process a few % of a GPU task, then moves to another, process a few % of a GPU task, then moves to another,process a few % of a GPU task, then moves to another...

I had started seeing that over the past few weeks. I thought my GF8500 was just starting to fail or something so stopped doing GPU tasks. As it was slow anyway.


Correct..
And every suspended CUDA WU take system RAM.
And if you have maybe ~ 20 or 30 suspended CUDA tasks, your PC can't crunch because your system RAM is overloaded.

So you should use also DEV-V6.6.37 .


Well, I've just stopped doing GPU task on that PC. The GF8500 isn't very fast, GPU tasks bogged the system to much, and I don't have the time to baby sit it.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7122
Credit: 61,588,031
RAC: 16,373
Germany
Message 933601 - Posted: 15 Sep 2009, 22:23:14 UTC
Last modified: 15 Sep 2009, 22:24:32 UTC


I looked to your CPU-GPU combi, the CUDA tasks were calculated with the stock CUDA app.

You don't let run opt. apps?

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4662
Credit: 123,456,526
RAC: 99,833
United States
Message 933603 - Posted: 15 Sep 2009, 22:25:38 UTC - in response to Message 933601.


[Extension of my upper post]



I looked to your CPU-GPU combi, the CUDA tasks were calculated with the stock CUDA app.

You don't let run opt. apps?


I had before, but since I reimage the system often. I could run XP, Vista, Windows 7, Server 2003, Server 2008, or 2000 all in one day. The opt app didn't really like that. lol
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 933604 - Posted: 15 Sep 2009, 22:25:54 UTC - in response to Message 933592.


This you could say if your PC make a RAC of ~ 20.

But if your PC make a RAC of ~ 52,000 he must UL ~ 600 to 3,000 results/day!
And if you use DEV-V6.6.38 which don't work well.. you should take a BOINC Version which isn't so buggy.
So I use now DEV-V6.6.37 which don't collects result ULs, this version make the ULs.

No, I can't use V6.6.36 with the 'crazy EDF GPU BUG'.

The UL function in DEV-V6.6.38 don't work well.
I hope in DEV-V6.10.x this UL function work better (like I posted in my upper post) or is disabled.

I don't think you should use 6.6.36, I think you should use 6.6.38 or later.

Sutaru,

The problem is congestion: when the bandwidth is maxed out, everything is competing for the same bandwidth -- and uploads fail.

If the BOINC client slows down, that means less congestion, more successful uploads, and more successful uploads means even less congestion.

There is a saying: you can't put 8 pounds of stuff in a 5 pound bag.

When things back up, we're trying to put 800 pounds of stuff in a 5 pound bag, and we're upset because the bag is constantly breaking.

But this won't work at all if people can turn it off. If they can turn it off, they will, and if they do, we're no better off than we are right now.

This is not a new idea. This is how SMTP E-Mail works, by backing way down when things get too busy. If part of your job was running a busy mail server, you'd see exactly what the upload server (and to a lesser extent, the download and scheduling servers) see every day.

I understand that you look at your upload queue, and you see all the pending retries, and you worry that they won't get through -- and I realize that it's very hard to sit back, leave it alone and do nothing.

But you won't know what happens if you do if you keep pushing harder.

Sorry, I know, it's hard. It took me years to learn that when things are incredibly overloaded that trying to fix the overload usually makes things worse.

-- Ned
____________

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7122
Credit: 61,588,031
RAC: 16,373
Germany
Message 933608 - Posted: 15 Sep 2009, 22:43:18 UTC
Last modified: 15 Sep 2009, 22:44:28 UTC


Yes.. it's a pity that my GPU cruncher have a so big performance and BOINC/SETI@home servers isn't/aren't designed for this.

So more ULs stay in the BOINC overview.. and if one day the UL is possible.. maybe then ~ 800 ULs (or much much more) in a row.. the report file is soo big then that only with good luck the scheduler server get it.

I had two times to delete a ~ 1,600 and IIRC ~ 800 result report file (detach/attach project), because it was impossible to send this big report file to the Berkeley scheduler.


So the best would be that 24/7 the UL/DL/scheduler in Berkeley is working well.
But - this is impossible.

So I MUST UL, UL and UL the results, to hold small the report file, to have the chance that the report file can reach the scheduler.


I have only a 3 day WU cache - more isn't possible because of the BOINC client.


Yes - this is all my prob - only the devs could help me.. or the Berkeley crew (24/7 servers).. ;-)

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 933609 - Posted: 15 Sep 2009, 22:49:55 UTC - in response to Message 933604.
Last modified: 15 Sep 2009, 23:02:00 UTC

Hi, apart rom the fact, that the maintenance outage, was very short, I haven't had or have trouble UP &/or DOWN-loading.

Can't explain.
Maybe your internet connection is not OK, or too slow. Though, I doubt if that's true.
Sometimes, I got better results, by using daytime in Europe to connect to Berkeley.

My (A)DSL has 8000/512Kbit/sec., resp. Down/Up-Load.
____________

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 933610 - Posted: 15 Sep 2009, 22:51:29 UTC - in response to Message 933608.

Yes.. it's a pity that my GPU cruncher have a so big performance and BOINC isn't designed for this.

Sutaru,

I am saying the opposite.

I am saying that this is a way to make more time available for monster crunchers like yourself.

I am saying that the revised upload algorithm will help you much more than it will help anyone else.

... and you're saying "I don't care if the servers are completely buried, I want to make sure that we pile on as much as possible."

You say "the solution to upload problems is to push really, really hard" and I'm saying "imagine a world were retries are unusual, and congestion is rarely a factor."

When this happens in a crowd at some big concert or event, people get trampled and the news reports say "how sad" when an orderly line would have been fine.

-- Ned


____________

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7122
Credit: 61,588,031
RAC: 16,373
Germany
Message 933614 - Posted: 15 Sep 2009, 23:06:14 UTC


If the UL function in BOINC DEV-V6.6.38 would work well.. I would use this version.

But why collected this version ~ 400 result ULs?
It's hard to send this big report file to the scheduler.

I looked in the history and over hours BOINC didn't started a new attempt of an UL.

Someone could say/explain how this UL function work?


After 3 failed ULs the UL is stopped.
But when and how often new ULs will start to test if the UL server is available?

It would be well like I said in my upper post..
If an UL don't reach the UL server, stop all ULs for ~ 60 sec. and then again start one UL, again not possible to reach the UL server again ~ 60 sec. break of all ULs. And then again start one UL. If one UL go through an new UL start to do this procedure.

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 933616 - Posted: 15 Sep 2009, 23:20:03 UTC - in response to Message 933610.
Last modified: 15 Sep 2009, 23:36:12 UTC

Hi, it is possible to push, over and over again, to get a (bunch of) WU's, to Berkeley, but this is, like Ned Ludd said, UNwanted.

Such behavior looks like a DoS attack!

BOINC has to take care of that, but it takes time. And a big cache is more difficult, certainly with several projects and GPU use.

BOINC contacts the server and then UP-Loads a WU, or a series. If it fails, it tries again a few minutes later. . . etc..

The faster your (A)DSL speed, the bigger chance of (more) WU's, being UP & DOWN-Loaded, in a given time.
____________

Ingleside
Volunteer developer
Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 4,339,703
RAC: 72
Norway
Message 933617 - Posted: 15 Sep 2009, 23:21:03 UTC - in response to Message 933592.
Last modified: 15 Sep 2009, 23:24:56 UTC

The UL function in DEV-V6.6.38 don't work well.
I hope in DEV-V6.10.x this UL function work better (like I posted in my upper post) or is disabled.

The only problem with uploading in v6.6.38 is what you don't see when the project-backoff is in effect. Well, if you enable <file_xfer_debug> you will see a message like "project backoff, xxx hours/minutes/seconds", but nothing else, and you'll need to calculate yourself when next connect-attempt is, and these messages is also very easily overlooked in message-tab.

The bug of no info is fixed in v6.10.0, and you'll see in Transfer-tab when a project is being backed-off, and for how long.

Since it now works as intended, it's unlikely it will be removed again. Afterall, the only reason the code was removed again in v5.2.2 was the lack of info to users.


edit - The code works by, if you're repeatably failing to upload (or download), all uploads (or downloads) to project gets an exponential backoff between 1 minute and 4 hours. If later an upload (or download) succeeds, all uploads (or downloads) will go-through as normal, if there aren't any more connection-problems that is...
____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 933627 - Posted: 15 Sep 2009, 23:51:46 UTC - in response to Message 933614.


Someone could say/explain how this UL function work?


After 3 failed ULs the UL is stopped.
But when and how often new ULs will start to test if the UL server is available?

You can read the ticket here.

The basic idea is:

If uploads are working, keep going.

If not, when uploads are failing, back off all uploads, not just the single failed upload.

You can read the changeset, and you can read the discussion in the developer's forum from the above link.

Without this change, the BOINC Clients attached to a project will launch a distributed denial of service attack on the servers.

By stopping the DDoS attack, throughput will improve, and Sutaru, you very much want better throughput.

Note that the <file_xfer_debug> flag turns on logging, and if it is not working, you are probably in the best position to test it.

Getting this to work right and reducing load on the servers, will help throughput.

____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2355
Credit: 8,941,971
RAC: 4,072
United States
Message 933644 - Posted: 16 Sep 2009, 1:39:58 UTC

totally off topic, I just wanted to point out that the post-maintenance network congestion lasted about 3 hours and has come back down to the mid-60mbit range.

Is that a good thing, or a bad thing?

Uploads had a large upward spike after the maintenance period was done, but has been steadily falling and is down to about 8mbit.

I know I don't have hundreds or thousands of WUs to upload, but I did have about 20 that kept failing earlier in the day before the outage, and they all went through on the first try after the outage.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1624
Credit: 22,596,429
RAC: 3,958
United States
Message 933674 - Posted: 16 Sep 2009, 4:55:46 UTC - in response to Message 933644.

My take was that nobody is doing anything special this week, that would take the system down longer. Don't know, because nobody is talking on the "technical news" board.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 933688 - Posted: 16 Sep 2009, 6:18:33 UTC - in response to Message 933617.

edit - The code works by, if you're repeatably failing to upload (or download), all uploads (or downloads) to project gets an exponential backoff between 1 minute and 4 hours. If later an upload (or download) succeeds, all uploads (or downloads) will go-through as normal, if there aren't any more connection-problems that is...

Since this is something I can test, I've been testing it, and it appears to work fine. 6.10.4 is definitely better in that you can see the backoff.

As described, the logic means BOINC will try at least 6 times per day.

If uploads are working, there won't be a project backoff, and uploads proceed normally.
____________

Henk Haneveld
Send message
Joined: 16 May 99
Posts: 154
Credit: 1,494,832
RAC: 22
Netherlands
Message 933693 - Posted: 16 Sep 2009, 6:49:06 UTC

If a project backoff for uploads is in progress in Boinc 6.6.38 it is visible in the projects tab.

Select project and click the properties button.
____________
***********************************************


************************************************

Ingleside
Volunteer developer
Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 4,339,703
RAC: 72
Norway
Message 933700 - Posted: 16 Sep 2009, 8:26:48 UTC - in response to Message 933693.
Last modified: 16 Sep 2009, 8:27:29 UTC

If a project backoff for uploads is in progress in Boinc 6.6.38 it is visible in the projects tab.

Select project and click the properties button.

Hmm, was under the impression this only showed when next can ask for cpu, cuda or ati-work, and not any deferrals for uploads or downloads...

I can't test it at the moment, since can't run BOINC on this computer...
____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

Profile twister@austria-national-team.at
Volunteer tester
Send message
Joined: 26 Jan 00
Posts: 30
Credit: 60,419,551
RAC: 0
Austria
Message 933701 - Posted: 16 Sep 2009, 8:34:04 UTC - in response to Message 933543.
Last modified: 16 Sep 2009, 8:36:01 UTC



?

Hmm.. keine Ahnung wo Du das reinschreibst.. ;-)


Vielleicht lag s auch daran, daß Du mehrere Projekte auf dem PC hattest?

--------------------------------------


Also da habe ich mir eine DOS Patch File gemacht, die ich mit dem
Taskplaner anwerfe.

Und Nein da hatte ich auch nur Seti drauf, als ich Deinen
vorgeschlagenen Manager draufhatte, bekam ich keine Pakete mehr nach.
Erst da habe ich dann die anderen Projekte mit draufgemacht.
Also nun hab ich wieder den neusten 6.6.36 mit Hilfe dieser Patchdatei.
Läuft nun ganz gut wie man nun sehen kann ;-)

Ach ja meine 600.000 Pendings sind immer noch weg,
auch nach dem Boinc Service-Tag von gestern.
____________

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7122
Credit: 61,588,031
RAC: 16,373
Germany
Message 933707 - Posted: 16 Sep 2009, 9:36:52 UTC
Last modified: 16 Sep 2009, 9:41:16 UTC


Thanks to all who would like to 'help' me..


O.K., if it's working 'bad' and Berkeley have UL server probs..
Like you saw.. every 4th UL gone through with DEV-V6.6.37.. but DEV-V6.6.38 stopped after every 3rd UL the ULs.
So if the counter is every 4 hours.. this mean my GPU cruncher have + ~ 100 results ready/additional for UL.
And every 4 hours the UL server have 3 times probs to receive my ULs.. again 4 hours break.. and again 4 hours break.. + 100 , + 100 , .. results in UL overview..


I meant not that all ULs in the overview should have to test the UL server if available.
I meant all ULs stopped and every some minutes a (one) UL test the UL server.. if not possible to UL, the/all ULs stopped.
I don't know how to explain better, english isn't my 1st language.

I don't mean that all ULs in the overview should test every 60 sec. the UL server.
Only one of the whole bunch should test the UL server.
And then a break of all ULs if the 'test UL' don't go through.

Because of the SETI@home unplanned server outages it would be well if this break wouldn't last 4 hours.
You can calculate how big the report file will be then.. and the chance to reach the scheduler will be smaller and smaller..



@ RELCLAG

Hmm.. don't know maybe Eric is busy.
I sent him a PM because of your prob. But didn't got an answer to now.

Hmm.. ich weiß nicht.. vielleicht ist Eric beschäftigt.
Ich sendete ihm ne PN wgn. Deinem Problem. Aber bekam bis jetzt keine Antwort.



@ all

Maybe.. please people with high RAC/pendings.. you can see your pending credits?

Where could be the prob that RELCLAG can't see his big pendings?
Ideas are welcome.

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7122
Credit: 61,588,031
RAC: 16,373
Germany
Message 933710 - Posted: 16 Sep 2009, 9:59:21 UTC
Last modified: 16 Sep 2009, 9:59:59 UTC


[Extension of my upper post]




BTW.
My pending credit: 295,397.36

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5942
Credit: 62,323,191
RAC: 37,107
Australia
Message 933711 - Posted: 16 Sep 2009, 10:04:42 UTC - in response to Message 933710.

My pending credit: 295,397.36

8,267 for me.
A new record.

*shrugs and wanders off*
____________
Grant
Darwin NT.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : Panic Mode On (24) Server problems

Copyright © 2014 University of California