Areas to Test and to Consider
As outlined so far there are countless areas that need to be
evaluated and testing with Exchange 2007. Below I am going to try to
prioritize some of the key areas but these will vary for each organization.
Applications
A single
business critical application that doesn't work with E2k7 can halt your
transition\migration even before it starts. Therefore, it is essential that
all applications that interface with Exchange, besides those using standard
SMTP, must be tested. Below is a prioritized list of the type of applications
that should be tested.
1. Those
that use any of the discontinued features must be identified and replaced.
2. Any
application that runs on the Exchange server, the x64 OS may break them.
3. Those
that use the deemphasized features should be identified and tested.
4. MAPI
or Outlook Object Model based applications, if they are built around Outlook
2003 they will need to be tested for Outlook 2007 support.
Message Routing
For large organizations link
state routing required a bit of black magic sometimes to be able to foresee how
messages would be routed or should have been routed.
In E2k7 all
routing is now based on AD sites. With routing based on AD sites, routing
should be more predictable but below are some key factors to consider. For
more information see this TechNet
reference.
1) AD
site link cost is critical for message routing so they must be re-evaluated.
2) Messages
are now transmitted directly from the HT server in the source AD site directly
to HT server in the destination site, that contain the sending and receiving
mailboxes.
a) Delayed
fan-out or message bifurcation will cause message to be delivered to the HT
server in the AD site that will produce the fewest messages to other HT
servers.
b) If
the HT server in the destination site cannot be contacted messages will be
delivered to the next closest AD site with a HT server, which is available. This behavior is commonly called queue at point of failure.
Disaster Recovery
There are many new changes that
impact DR planning in E2k7. Microsoft has made some major improvements that
should allow an organization to recovery quicker and easier from a loss of a
database, server, or even an entire site. Data replication is now included
with Exchange, something that multiple 3rd party provided for
Exchange 2003 before. The out of the box replications methods include Local
Continuous Replication (LCR),
Continuous Custer Replications (CCR), and
Standby Continuous Replications (SCR)
[coming in SP1]. LCR, CCR, and SCR all use logging shipping to replicate
changes from a source database to a target database. With LCR the replication
support allows for any changes, stored in transaction logs, to be copied to
another location where they are then committed to the second copy of the
database. When LCR or CCR is enabled (not sure about SCR at this point since
it was not in Beta 1 of SP1) a storage group can only contain one
database\store. This should not be an issue since E2k7 now supports up to 50
storage groups and databases (50 max databases across all storage groups). To
reduce the chance of data loss and to address other factors Microsoft reduce
the transaction log file size from 5MB to 1MB in E2k7. CCR works in a similar
fashion but logs are replicated from the primary node, CCR requires Windows
Clustering, to the secondary node, which could be in the same physical site or
a different one (Note: There are major limitations with spanning sites with
E2k7 and W2k7, Windows 2008 will resolve most of those). SCR provides
similar support to but doesn't require clustering but does support replicating
from one E2k7 server to another. SCR supports one-one, one-many, many-many,
and many-one relationships between storage groups and servers. For example,
you could have five servers with five storage groups each and use SCR is
replicate data in those twenty-five storage groups to a single server with
twenty-five storage groups. SCR looks like it is going to provide the critical
support needed for most organizations to implement site level disaster recovery
without the need for 3rd party products.
In addition to the data
replication support in E2k7 Microsoft has also updated Exchange to support true
database
portability. Before if a server were to die or a database needed to be
recovered on another physical server that recovery server had to have the same
name, domain membership, administrative group name, and be setup with the
/DisasterRecovery switch. This required most organizations to have a server
standing by adding no value. With E2k7 databases can be copied to or restored
to another server, which was possible in E2k3 with Recovery Storage Groups in a
limited fashion. The big difference is that client can now connect to these
"restored" databases, after user's settings have been updated in the Active
Directory and a few other steps (which can all be scripted with the EMS) have
been carried out. So instead of spending a couple hours building a server and
then using the Mailbox Recovery Wizard or ExMerge to copy data, a database can just
be mounted and a script run on an existing server to allow users to connect to
it.
So in
addition to the new OS (x64) and many other changes in E2k7 that might impact
exiting DR plans the new features may drastically change those plans.
Clustering and HA
As mentioned above a new type of
clustering has been added to E2k7 called Cluster Continuous Replication (CCR). CCR uses
the Majority
Node Set clustering model, unlike previous versions of Exchange that uses a
shared data model, now called Single Copy Clusters (SCC). With
SCC the OS and storage system required that multiple nodes could connect to the
drives or LUNs with the Exchange and quorum
data on them. A SCC config could be created with a basic SCSI storage system,
special cables, and SCSI controllers to support a two node cluster. To support
3+ nodes required a SAN, iSCSI, or other storage system that support multiple
connections from servers and device reservations or locking. Both of these
were fairly complex to setup and in many environments DECREASED uptime due to
this complexity. Due to these reasons Microsoft and experienced Exchange
consultants would only recommend clustering in special circumstances. With SCC
you can't address the #2 cause of downtime (#1 being human error), which is
database issues. The most "common", which is not very common, problem with
Exchange 2003 and earlier versions was database corruption and storage system
failures. In both cases all users on the cluster who are in an affected
database maybe taken off-line, depending on the level of corruption of
failure. With CCR, and similarly with LCR, data can be replicated across to
completely different storage systems. CCR does this by requiring two nodes,
the maximum, and data for each node is accessed directly by the server. This
data can be stored locally, on iSCSI, or on a SAN but both nodes should NOT
store data on the same iSCSI or SAN storage system, otherwise you still have single
point of failure in your storage system.
LCR provides higher availability
by allowing data to be stored in two places. Each location should be on a
different storage system, with a dedicated RAID\iSCSI\HBA controller, dedicated
external storage cabinet\SAN, and dedicated network\fabric for each. This way
there is no single point of failure in the storage system. If a failure was to
occur, the Exchange admin would need to change the database paths in EMC\EMS to
point to the secondary location or copy\move the secondary files to the primary
location and remount the failed storage groups.
CCR takes this one step further
by providing server redundancy and automatic failover support. Unlike CCR, LCR
does require manual or scripted intervention in the case of database or storage
system failure. With CCR an entire server can be loss and the standby server
will start servicing users, after several minutes while the standby node takes
ownership of the cluster. One area CCR doesn't address is individual database
failure. Unfortunately, if a single database fails in a CCR cluster the entire
cluster must be failed over to the standby node, during the failover all users
will be disconnected until the standby now has taken full ownership and mounted
all stores. Therefore, a business decision must be made in such a case to keep
the users on the failed database offline while the problem is troubleshot or to
take all users down for a brief period while the cluster fails over.
Similar to LCR, SCR replicate
data to another location, this must be on a different server. Because the data
is on a different server the database cannot be just mounted and have users
access them. So SCR required additional manual\scripted steps to enable users
to connect to the new server that is now hosting their data, in the case of a
failure. I plan on writing an entire article on this process at some point but
basically the Active Directory and DNS needs to be updated so Outlook clients
know where to find the users mailbox.
CCR, LCR, and SCR are major new
additions and should significantly affect exiting DR plans. They are also one
of the key features that should help justify the deployment of Exchange 2007.
Scalability
Everyone by know should know
that Exchange 2007 requires an x64 OS (Windows 2003 SP2 or Longhorn) [Note:
W2k3 SP2 is required for E2k7 SP1, W2k3 SP1 is only required for E2k7 RTM] so
I'm not going to go into detail on this. The key thing this affects is caching
on Exchange and this directly affects I/O operations (IOPS) generated by end
users. With E2k3 only 700MB of RAM could be used for caching but E2k7 can use GBs, the current sweet spot is about 24GB but this might be improved with SP1. Past 32GB of RAM the cost of 4GB DIMMs become cost prohibiting and the
additional memory it doesn't provide a linear scalability.
So what does all this extra
memory and caching allow? Well with E2k3 you could deploy about 3-4K medium
use Outlook users on a single server, I have deployed over 6K on a single
server but this was for a 24x7x365 operation where only 40% of the users would
ever be connected at once. As mentioned above the major limiting factor was the
amount of memory available for caching. Since the most commonly accessed user
data could be cached for everyone, each user would generate .5 - 1 IOPS. Therefore,
the storage system became the bottle neck. With E2k7 this changes due to
64-bit memory addressing and other database changes. The IOPS profile can be
reduce between 50-70% with E2k7 and enough memory. The decision that now must
be made is how much money should be spent on memory verse the storage system.
What does this have to do with
testing you might ask? Well, the obvious thing is the lab environment must
be able to simulate say 5,000 users now where before it only need to simulate
2,000 users. In addition, the DR and backup plans will have to be modified
to support the profile of a server with this many more users, if the business
decision can be made to put this many users on a single server. LCR, CCR,
and SCR should help justify putting this many "eggs in one basket" for most
organizations
|