Appliance testing - none of you would do this would you ?

bernardgreen · 11 Mar 2013

Sometimes I wonder if low cost hardware leads to some "strange" ways of providing back up services using complex software procedures and switching ( re-routing ) of data paths. This often adds to the complexity and introduces more potential for failure / error.

SimonH2 · 11 Mar 2013

skotl said:
I'm not sure whether you're just looking to pick a fight here BAS so, if you are, lemme know and I'll just get the popcorn and enjoy the show.

No, he's not picking a fight, just being BAS. We all know what he's like, he's made his mind up, made statements (or at least implied things), and can't possibly deviate from that first position - regardless of what anyone else says.

bernardgreen said:
Sometimes I wonder if low cost hardware leads to some "strange" ways of providing back up services using complex software procedures and switching ( re-routing ) of data paths. This often adds to the complexity and introduces more potential for failure / error.

Yes, it does !
It can come down to - do you buy a high spec server with redundant PSUs, etc, etc, and backed up by a 4 hour on-site guarantee from the vendor. Or do you cobble together several lower spec machines, shared storage, all the overhead and complexity that goes with that, and rely on not too much hardware failing at once.

If you've never looked at what they do, Google is an interesting case of taking low cost to the extreme and providing reliability by other means (in their case, writing their own filesystem !). Their reasoning goes that if the MTBF of a single node is (say) 1000 days, then on average a single machine will break down once every three years. If you have 1000 machines ina cluster, then on average you can expect to lose one machine per day. If you have 20,000 machines in a cluster, then your losses will be an average of 20/day or nearly one/hour.
So their answer is to buy cheap, and write redundancy into the software. Their JFS breaks large datasets into chunks, and the controller ensures that there are at least three copies of any chunk, spread across three nodes in the cluster, and not all in the same rack. Thus you can take out any node, or even two nodes, and the data is still there. You can take out a whole rack and the data is still there - the software will automatically allocate another storage node to replicate the data to to bring the number of available copies up to three.
That's great for a big read-only dataset (such as the back end indexes used for serving up queries) - trying to do that with rear-write data then introduces the nightmare of ensuring write concurrency and consistency.

They also automate their config, so when a node is replaced, all the operator needs to do is tell the management system it's identity and it's role - then when the node comes online, the management system can serve it up the right image, it can self install, and go into service without further intervention.

All brilliant, if you've an operation of such a scale where the investment is worthwhile.

For a small business, your best bet is buy something reasonably reliable, and have a business continuity plan that will allow you to keep going during any outage. Well actually, the BC plan should drive what you put in, since the BC plan will tell you what your technical recovery time objective needs to be.
If your TRTO is "several days" then anything other than something you can repair and be running again in a day or two is overkill. On the other hand, if you end up with a TRTO of 3 hours, having something you can get going by next day isn't worth it - you'll be out of business.

skotl · 11 Mar 2013

What Simon said.

It always frustrates me when customers ask us "what do we need for BC / DR?" - it's *your* application, so you need to understand how much of that application (and which features / functions) need continuous operation, and how much of it needs to be restartable in DR.

We tend to place things in four "rings", with DR Ring 0 being the business critical apps that need to have BC built in, hot DR, and will be restored first.
DR Ring 1 needs BC and can have warm DR.
Rings 2 and 3 typically don't get DR'd unless the outage is expected to be significant.

Typically, there are fewer systems in the inner rings (Ring 0 being innermost) and more in the outer rings.
Placing the apps in these rings makes it obvious to people in IT and in the business just what will be brought up first in DR, and what the restore priorities are.

Users / customers / businesses need to understand the relative criticality of their systems and then agree with IT what that means for BCP and DRP and, hence, what the budget is.

The difference between "hot, with two minutes max downtime" and "warm, with 30 minutes failover" can be a difference of £200,000!

RF Lighting · 11 Mar 2013

I've no idea what any of these letters mean any more.

I used to maintain an office which had a local UPS for each server, and a large UPS which supplied all the critical equipment in the server room including the server UPSs.

The whole lot was fed from a dual supply of the grid mains and a 300kVA auto start diesel generator with an ATS.

One day the grid mains failed.

The generator started, but the grid monitoring on the ATS had failed so the ATS didn't transfer the load, and as usual, both UPSs failed.

Servers went off, and every branch of this company across the country fell off line until someone from our firm got there and got the ATS working.

Whoops

ban-all-sheds · 11 Mar 2013

skotl said:
But I'm not sure what you're taking umbrage at here (other than the suggestion that you don't know your VM technologies)

Which began as soon as I said that if you virtualised you could move VMs in order to empty a server.

I didn't see any previous discussion of you advocating what to do with storage.

Nor I.

And yet...

SimonH2 said:
Who on earth (apart from BAS) wants to put their BACKUP data on the same storage as their live data, and be able to switch the VM of the backup to the same box as their live server ?

Still waiting for him to justify that.

Virtualising it is a useful step and then, yes, you could evacuate its guest VMs to other hosts and bring the physical box down. For that to work effectively, though, you need n+1 or n+2 physical hosts (where n+2 allows a box to be taken down for maintenance while still leaving capacity for failure of another host). You also need shared storage to host the VMs on, and you need one of the heavier-duty virtualisation solutions (ESX rather than ESXi, for example).

Yes, yes and yes.

But so?

And then we have the rest of the hardware which, presumably, also needs to be maintained and tested. That means multiple paths thru multiple fabric switches, load balancers and ethernet switches.

This is a pretty major investment in hardware, (VM) software licences and the expertise to maintain it effectively.

More yes's

So, I am all for virtualisation - my business couldn't operate as competitively without it - but it's not the case that virtualisation alone removes the impact of performing physical maintenance on kit.

Done properly it will remove the impact on users of the service(s).

skotl · 11 Mar 2013

ban-all-sheds said:
So, I am all for virtualisation - my business couldn't operate as competitively without it - but it's not the case that virtualisation alone removes the impact of performing physical maintenance on kit.

Click to expand...

Done properly it will remove the impact on users of the service(s).

We could do this all week, I suspect...

I'm going to rephrase your final point;

ban-all-sheds (should have) said:
Done properly it will lessen the impact on users of the service(s).

Virtualisation alone won't remove the impact (of losing servers) on users of the service. It might if everything else is fault-tolerant, and has been properly configured...

westie101 · 11 Mar 2013

It disappeared off the network when it had it's power cord yanked to do the testing icon_rolleyes.gif Good job he didn't do that to the main servers - it would have escalated to somewhat more than "overheard at the helldesk" status.

In answer to the original question, er no1 (but I would ditch supplies from a substation in a fault situation that could have the same effect )

ban-all-sheds · 12 Mar 2013

skotl said:
I'm going to rephrase your final point;

ban-all-sheds (should have) said:

Done properly it will lessen the impact on users of the service(s).

Click to expand...

Virtualisation alone won't remove the impact (of losing servers) on users of the service. It might if everything else is fault-tolerant, and has been properly configured...

The context was of a controlled, planned movement of VMs from a server which needed to be shutdown.

Done properly it will remove the impact on users of the service(s).

SimonH2 · 12 Mar 2013

ban-all-sheds said:
The context was of a controlled, planned movement of VMs from a server which needed to be shutdown.

Actually, the context was the original posts of <someone> pulling the plug out of the back of a standalone backup server without asking first.

Standalone backup server - one server, self contained storage. In this case, no impact on users, just a requirement to make sure it's synced up after being powered back on.

In this case, the main server is also a standalone single server. The requirements for the business don't make anything else necessary. Yes you could add a shedload of additional hardware, software, etc - but then you just multiply the cost by 3, 4, 5, more ? times for negligible benefit.

IF the business requirement is for such fault tolerance then you'd be right, virtualisation can be a useful tool towards that - I never introduced it into the discussion.

ban-all-sheds · 12 Mar 2013

SimonH2 said:
ban-all-sheds said:

The context was of a controlled, planned movement of VMs from a server which needed to be shutdown.

Click to expand...

Actually, the context was the original posts of <someone> pulling the plug out of the back of a standalone backup server without asking first.

FFS.

The rest of you please look away, as I know a lot of you don't like large letters, but I have to do this for Simon's benefit because he seems to have a major problem seeing things when they are normal size.

ban-all-sheds said:
ericmark said:

Asking

people

to

log

off

to

be

able

to

test

is

common

and

with

servers

finding

windows

to

do

work

again

common.

But

switching

off

is

still

a

problem.

Click to expand...

So

virtualise

them

all,

then

you

can

move

the

VMs

and

empty

the

physical

server.

SimonH2 · 12 Mar 2013

And what the **** does that have to do with the price of fish ?
Your answer is quite clearly that "whatever the reason for shutdown, or size of nature of business, the answer is to virtualise it".

The correct answer may be "just wait till lunchtime when all three users are out and we'll turn it off". For such a situation, doubling (actually more) your servers, and adding all the shared storage etc would be more than a bit overkill

ban-all-sheds · 12 Mar 2013

SimonH2 said:
Your answer is quite clearly that "whatever the reason for shutdown, or size of nature of business, the answer is to virtualise it".

No, it isn't.

You really do have a serious problem with reading and comprehension, don't you.

The correct answer may be "just wait till lunchtime when all three users are out and we'll turn it off".

Then there would not be the problem to which Eric alluded, and therefore it would not be the context in which I replied, would it.

securespark · 12 Mar 2013

SimonH2 said:
And what the **** does that have to do with the price of fish ?

Is that virtualised fish or normal ones?

skotl · 12 Mar 2013

BAS - I have a lot of respect for your knowledge of things electrical, and your enduring quest to stop people frying themselves or other people.

It's also clear that you have some IT experience.

But I'm going to side with Simon's comment at the top of page three; You have some point that you're are attempting to repeatedly club us over the head with.

You've said in the past that you're not a qualified electrician (neither am I) and I have no idea whether you're a qualified IT Ops person, but I know that I am.
You raise a number of valid points but I am afraid that your over-heated stance that virtualisation alone removes the impact to users of application or hardware maintenance is not valid.

We can keep going on this if you like, but there are probably some deluded aussies out there who are more deserving and needful of your advice.

Cheers

ban-all-sheds · 12 Mar 2013

skotl said:
You raise a number of valid points but I am afraid that your over-heated stance that virtualisation alone removes the impact to users of application or hardware maintenance is not valid.

1) The issue was one of hardware maintenance, not application.

2) Please explain how, when properly done, including having a suitable hypervisor, a suitable VM manager, and a suitable infrastructure, so that VMs can be moved, whilst remaining running, from one physical server to another, you don't remove the impact on users of shutting down the server on which the VM was running before you moved it to one which remains up.

diynot

If you need to find a tradesperson to get your job done, please try our local search below, or if you are doing it yourself you can find suppliers local to you.

Select the supplier or trade you require, enter your location to begin your search.

Are you a trade or supplier? You can create your listing free at DIYnot Local

Appliance testing - none of you would do this would you ?

Similar threads