PASS Summit 2013: Final Recap

I haven’t traditionally written recap posts for the PASS Summit, but this year was just phenomenal, and I think that down the road I would regret it if I didn’t take a few minutes to summarize the highlights.

Summit1 1024x768 PASS Summit 2013: Final Recap

Perry and I arrive at the PASS Summit

In no particular order…

#SQLRun

On Wednesday morning about 70-some people congregated in downtown Charlotte for the now-traditional #SQLRun.  Organized by Jes Borland (whom I really don’t have enough wonderful adjectives to describe), it was a 3.4 mile run with fellow SQL runners in the complete dark.   It was the perfect way to start Day 1 of the Summit.  A run, when it includes friends, is never bad.  I met a couple new people that I saw throughout the week, proving again that common interests outside SQL Server help facilitate those relationships within PASS.  Whatever your passion, I encourage you to find people with the same non-SQL interests at conferences and SQLSaturdays.  You just never know who you’ll meet.

My Sessions

My first session was Wednesday morning after the keynote, and it was Making the Leap from Profiler to Extended Events.

Summit2 1024x768 PASS Summit 2013: Final Recap

Perry checking out the crowd before my XE session

This was one of my favorite topics to cover this year, and based on feedback throughout the week, it hit home with many people.  Over 300 attendees made it to the session (the picture above was taken 15 minutes before I started), and I had great questions throughout and finished right on time.  In case you missed it, I’ll be giving a shorter version of the same session this Wednesday at noon EDT for the DBA Virtual Chapter (sign up here) and again next week at SQLIntersection.  Scripts can be downloaded from the PASS site, or from the SQLskills site under the Conference Demos heading.

My second session was Friday morning at 8 AM, How to Avoid Living at Work: Lessons from Working from Home.  Despite the early hour, we had a good number of attendees and a great discussion.  As I mentioned in a post back in August, I’m still adjusting, but it’s going well icon smile PASS Summit 2013: Final Recap

The WIT Panel

I had the honor of sitting on the WIT Panel on Thursday, and even though I probably said less than the other panelists, I had the opportunity to address a couple great questions (including one from an audience member).

Summit4 PASS Summit 2013: Final Recap

2013 WIT Panel (L to R: Gail Shaw, Kevin Kline, Cindy Gross, Rob Farley, me, and moderator Mickey Stuewe)

You can view the recording here, and since Thursday I’ve had a lot of time to reflect on what else I could have said, particularly when I answered the question from the audience member.  I want to include it here, for reference, and if you watch the video it starts at 59:11:

I had an interesting experience.  I was walking around the Expo yesterday and after having a short conversation with someone, someone said to me, well, you are a woman working in technology, you are a foreigner, you are a former collegiate athlete, and you are young.  You have all this working against you, how are going to make it in this industry?

My reply to her was that I would have said, “How am I NOT going to make it?”  Because here’s the thing: YOU decide what you can and cannot do, what you will and will not do.  You are in complete control of your destiny.  People will doubt you.  People will tell you that you aren’t good enough, don’t know enough, that you’re not “something enough”.  Don’t listen to them.  Know who you are…and if you don’t know, figure it out.  I firmly believe that once you fully accept the person that you are, and you like that person, that nothing will stop you.  Have confidence in yourself and then go forth and conquer.  And to the guy that said that?  There’s one part of me that wants to kick his @$$.  The other part of me feels sorry for him.  He has no idea what he’s up against.

The SQLskills Team

A conference bonus is that I get to see the SQLskills team.  It’s not often we’re all together because we’re scattered throughout the US.  I had time with every member of the team, including a couple dinners which really provide time to catch up in a relaxed setting.  I also moderated Kimberly’s session, Skewed Data, Poor Cardinality Estimates, and Plans Gone Bad, on Thursday, which was a lot of fun.  If you have any interest in statistics, go watch her session on PASS TV.

Summit3 PASS Summit 2013: Final Recap

Me and Kimberly – notice how tall I am?! (photo credit @AmazingScotch)

SQL Sentry

I cannot say enough good things about SQL Sentry.  They sponsored many events at the Summit including (and if I miss one please let me know):

  • Quizbowl at the Welcome Reception
  • Karaoke on Tuesday night
  • #SQLRun on Wednesday morning (they marked the path and provided t-shirts to those who registered early)
  • WIT Panel Luncheon (including a cool USB hub for swag)
  • The SQL Sentry shuttle on Tuesday, Wednesday, and Thursday nights that provided transportation for Summit attendees around Charlotte

In addition to being a community leader, SQL Sentry is simply a remarkable company.  I have met many members of their team, and it’s a close-knit group that values their customers, and just puts out great products.  I have been a fan of the company and its team since I joined the community, and they raised the bar even further this year.  Well done.

Dr. DeWitt’s Keynote

On Thursday morning Dr. DeWitt returned to the PASS Summit…I actually have no idea how many times he’s given a talk at the PASS Summit, but I know that for each of the past four years that I have been there, he’s been there.  This year his topic was Hekaton and of course it did not disappoint.

summit5 768x1024 PASS Summit 2013: Final Recap

Perry listening to Dr. DeWitt talk about Hekaton while I type rapidly

I live-blogged his session and was able to capture a fair bit of his content.  Dr. DeWitt explains complex database topics in a way that many understand – he’s not just a smart guy, he’s a great teacher.  Thank you Dr. DeWitt for your session, and thank you PASS for bringing him back again.  Can we do it again next year?

My Peeps

I cannot list everyone here.  You would all just end up looking for your name icon smile PASS Summit 2013: Final Recap But seriously, there are so many factors that contribute to a successful Summit for me, and one of them is most certainly seeing friends and meeting new people.  Whether we had a 5 minute chat, discussed a technical problem and how to solve it, or enjoyed a beer at some point: thank you for being part of the SQL community, and for being part of my world.  I feel so fortunate that I have a group of individuals, within my professional field, I call true friends.

Ok, ok…I have to give a special shout out to Johan Bijnens who brought me chocolate all the way from Belgium, and Aaron Bertrand who brought me Kinder Eggs from Canada.  Thank you both for feeding my addiction icon smile PASS Summit 2013: Final Recap

I’m already thinking about next year’s Summit, but I hope to see you all before then.  Have a great week, and good luck catching up on email!

p.s. One of my favorite pictures from the week, courtesy of Jimmy May.  And if you’re wondering why the heck this Perry the Platypus stuffed animal shows up in all these pictures…well, I take him with me on trips and then take pictures to send back to my kids.  They think it’s hilarious.  Ok…I do too.

summit6 1024x768 PASS Summit 2013: Final Recap

Perry and me before my XE session

PASS Summit 2013: Day 2

And day 2 at this year’s PASS Summit starts with a sweet surprise from Aaron Bertrand ( b | t ), Kinder eggs.  It promises to be a good day.

Today is Dr. DeWitt‘s keynote (did I mention that he’s a University of Michigan alum? Go Blue!), and here we go…

8:15 AM

Douglas McDowell, outgoing Vice-President of Finance starts with information about the PASS budget.  Summit is the largest source of revenue for PASS, the Business Analytics Conference provided a nice contribution to the budget this year (over $100,000), and PASS has a one million dollars in reserve.

Last year PASS spent 7.6 million dollars on the SQL Server community, with the largest amount spent on the Summit.  The second largest cost was the BA Conference.  Per Douglas, Headquarters (HQ) is a critical investment for PASS.  Right now the IT department has 3 individuals maintaining 520 websites.  (And you thought you were a busy DBA!)  One initiative for PASS this year, and going forward, is an international expansion, which took about 30% of the budget this past year.  Overall, PASS is a very good financial place – and thanks to Douglas for all his work as a Board member.

8:31 AM

Bill Graziano takes the stage to thank Douglas for his time on the Board, and also Rob Farley who moves off the Board this year.  Bill asked Rushabh to come on stage…Rushabh has been on the Board of Directors for 8 years.  He’s held the positions of VP of Marketing, Executive VP of Finance, and President.

8:34 AM

Incoming PASS President Tom LaRock takes the stage, and starts with an omage to Justin Timberlake and Jimmy Fallon’s hashtag video.  Awesome.  Tom introduces the incoming PASS BoD:

  • Thomas LaRock (President)
  • Adam Jorgensen (Executive VP of Finance)
  • Denise McInerney (VP of Marketing)
  • Bill Graziano (Immediate Past President)
  • Jen Stirrup (EMEA seat)
  • Tim Ford (US seat)
  • Amy Lewis (open seat)

Tom has announced the PASS BA Conference – it will be May 7-9, 2014 in CA.  Next year’s Summit will be November 4-7, 2014 in Seattle, WA.

The WIT Lunch is today – and I’m on the panel so I hope to see you there!

8:41

Dr. DeWitt takes the stage, and the topic is Hekaton: Why, What, and How.

I was able to meet the co-author of this session, Rimma Nehme, before today’s keynote – she’s a Senior Researcher in his lab (which is apparently in an old Kroger grocery store building on the Madison campus).

DeWitt says that Hekaton is an OLTP rocket ship.  The marketing team has renamed Hekaton to In-Memory OLTP, and DeWitt wants people to vote on Twitter.  I am Team #Hekaton…it just sounds cooler (and it’s much easier to type).

Covering three things: What, the Why and How of Hekaton.

Hekaton is memory optimized, but durable.  It’s a very high performance OLTP engine, but can be used for more than that.  It’s fully integrated into SQL Server 2014, not a bolt-on.  Architected for modern CPUs.  (Slide deck will be posted later, I’ll post the link when I have it.)

Why Hekaton?  Many OLTP databases now fit in memory.  There are certain kinds of workloads that SQL Server can no longer meet.  Historically, OLTP performance has been improved by better software (driven by TPC benchmarks), CPU performance doubling every 2 years, existing DBMS software maturing.  DeWitt says we’ve done as much as we can with mainline products.  CPUs are not getting faster – that well is drive.

Hekaton was picked because the goal was 100X improvements.  Not quite there yet.  Customers have seen 15-20X.  If you’re burning 1 million instructions/sec and only yields 100 TPS.  If you want to get to 10,000 TPS, but reduce number of instructions/sec to a value that’s just not possible.

Getting to 100X with the flood of new products available (e.g. Oracle-TimesTen, IBM-SolidDB, Volt-VoltDB), including Hekaton.  Why a new engine?  Why not just pin all the tables in memory?  That won’t do the trick.  Performance would still be limited by the use of:

  • latches for shared data structures such as the buffer pool and lock table
  • locking as the concurrency control mechanism
  • interpretation of query plans

Implications of a shared buffer pool is a consideration.  Assume the pool is empty.  Query 1 comes along and needs page 7.  Is page 7 in the pool?  No, a frame is allocated, the query has to wait while the IO occurs.  The IO completes and Query 1 can continue.  Remember that the data structure is a shared structure.  If Query 2 checks for page 7, the buffer manager will report where it is, but Query 2 will be blocked by the latch on page 7 until Query 1 is finished.

(sidebar: a transaction or query only holds 2 latches at a time)

There can be significant contention for latches on “hot” pages in the buffer pool.  This can be a big performance hit.  All “shared” data must be protected with latches.

The need for concurrency control…  Query 1: A = A + 100.  Database actions: Read A, update the value, write A.  Query 2: A = A + 500.  Database actions: Read A, update the value, write A.  If A was originally 1000, after both queries, it will be 1600.  This represents a serial schedule.

Two phase locking developed by Dr. Jim Gray – which is the standard.  Two simple rules:

  1. Before access, query must acquire “appropriate” lock type from Lock Manager
  2. Once a query releases a lock, no further locks can be acquired

If these rules are followed, resulting schedule of action is equivalent to some serial (good) schedule.  Dr. Gray got Turin Award for proof of this, one of two given to database scientists.

(sidebar: can I get a fellowship in Dr. DeWitt’s lab?  Seriously…)

Still need a deadlock detection/resolution mechanism also needed (wanted to get rid of this for Hekaton…which is why it’s been a 5 year effort).

After a query is parsed and optimized, get an execution plan which is given to a query interpreter that walks the tree of operators and executes them in a particular order.  When the database is on disk, the cost of interpreting tree is insignificant.

All these three things (concurrency, query optimization and latches) are why you can’t get to 100X with current implementations.

Currently in SQL Server, shared data structures use latches.  Concurency control is done via locking, and query execution is via interpretation.

With Hekaton, shared data structures are lock-free.  For concurrency control, versions with timestamps plus optimistic concurrency control is used.  For query execution, compile into DLL that loads when queries are executed.  This is what will get us to 100X.

SQL Server has 3 query engines – relational, apollo (column store), and Hekaton.

To use Hekaton, create a memory-optimized table.  Two kinds of durability: schema-only and schema-and-data.  (Every Hekaton table must have a primary key index – can be hash or range.  Also have new b-tree in Hekaton, b-w tree, which gives high performance on range queries.) Some schema limitations for V1.  Once you’ve created the table, then populate the table.  Run a SELECT INTO statement, or do a BULK LOAD from a file.  Just need to make sure it’s going to fit into memory.  Then, use the table.  Via standard ad-hoc T-SQL query interface (termed “interop”), up to 3X performance boost.  Adapt, recompile and execute T-SQL SPs, get 5X-30X improvement.

Query optimization is the hardest part of relational databases, per Dr. DeWitt.  Lock-free data structures truly are rocket science – they make query optimization look simple.

Lock-free data structures invented by Maurice Herlihy at Brown University – got him elected to the National Academy of Engineering (which is a big deal).

When you think lock-free, think about latch free – it allows multiple processes with threads to access the same data structure without blocking.  Dr. DeWitt has a great slide showing performance differences with multiple threads for latch vs. lock-free.  He mentioned that it was a tough slide to animate (and if you see it, you’ll understand why…I was actually wondering how he did it).  With lock-free (aka latch-free?) – an update does not block reader threads – there is no performance hits.  Every shared data structure in Hekaton was built around this functionality.

In Hekaton, now have a different concurrency control.  It’s optimistic:

  • Conflicts are assumed to be rare
  • Transactions run to “completion” without blocking or setting locks
  • Conflicts detected during a Validation phase

Second component of concurrency control is multiversion – updating a row creates a NEW version of the row.  It works really well when you do this in memory.  The third component is timestamps – every row version has a timestamp:

  • Each row version has an associated time range
  • transactions use their begin timestamp to select correct version
  • timestamps also used to create a total order for transaction to obtain equivalent of serial order

This approach drastically reduces number of threads – dozens not hundreds.

Transaction phases in Hekaton:

  • Read committed versions of rows
  • Updates create new tentative versions
  • Track read set, write set, and scan set

When the transaction is done, goes through second phase which is validation, this is where the concurrency control mechanism decides whether transaction can commit.  Reaches commit point…

When transaction begins, current clock value is used as Begin_TS for transaction.  At the start of the validation phase, transaction is given unique End_TS.  It is used during validation to determine whether it is safe to commit the transaction.  Begin_TS are NOT unique, End_TS are ALWAYS unique.

Hekaton tables have either hash or range index on unique key.  Rows allocated space from SQL’s heap storage.  Additional indices (hash or range) on other attributes.

Hekaton row format – all rows tagged with Begin_TS and End_TS.  Latest version has infinity on the End_TS ( most recent version of the row).  The Begin_TS is the End_TS of the inserting transaction.  The End_TS is the logical time when the row was deleted and/or replaced with a new version.

Multiversioning Example – Have transaction to increase value by 10,000.  A new version of the row is created.  Pointers are used to link the rows together in memory.  Don’t think about them being contiguous in memory.  The transaction puts its signature (transaction ID) on each row (End_TS of initial row, Begin_TS of new row).  When the transaction is later validated and committed, for all rows it updated/created/deleted, it will re-access each row with that transaction ID and replace it with the End_TS.  NO LATCHES ARE USED!  NO LOCKS ARE SET!  NO BLOCKING OF ANY TRANSACTIONS!  (I’m not not yelling, neither is Dr. DeWitt.)  This is timestamps and versioning – used on rows AND transactions.  Always create new versions of rows when doing updates.  Per Dr. DeWitt, competitors are not going to have the same level of performance.

9:27 AM

Optimistic multi-version – this is the lock/latch-free mechanism in Hekaton (Dr. DeWitt says it so fast it’s hard to catch it icon smile PASS Summit 2013: Day 2

When is it safe to discard “old” versions of a row?  When the begin timestamp of the oldest query in the system is ahead of the last End_TS…older rows no longer needed.  Hekaton garbage collection is non-blocking, cooperative, incremental, parallel, and self-throttling. It has minimal impact on performance.  It happens completely under the covers.

Steps:

  1. Updates create new version of each updated row
  2. Transaction use  combination of time stamps and versions for concurrency control
  3. A transaction is allowed to read only versions of rows whose “valid” time overlaps the Begin_TS of the Xi.
  4. Transactoins essentially never block (WAIT, there’s a caveat here that Dr. DeWitt is glossing over…hm…)

Validation Phase

  1. Transaction obtains a unique End_TS
  2. Determine if transaction can be safely committed.

Validation steps depend on the isolation level of the transaction – “new” isolation levels for Hekaton.

Read Stability key idea: check that each version read is still “visible” at the end of the transaction using End_TS.

Phantom Avoidance requires a repeat each scan checking whether new versions have become visible since the transaction started.  And if any scan returns additional rows, validation fails.  This sounds expensive, but, keep in mind all rows are in memory.  It is only performed for transaction running at a serializable level, and it is still a LOT cheaper than acquiring and releasing locks.

There is a also a post-processing phase with 3 sub-phases (which I couldn’t type fast enough…oy).

Checkpoints & Recovery – the data is not lost, have a normal checkpoint process, use logs to generate checkpoints (holds data during shutdown).  Restart/recovery – starts by loading a known checkpoint and scans log to recover all work since then, fully integrated with HA (giving readable secondaries of memory optimized tables).

Standard method for query execution on a relational system is complicated…and slow-ish.  Regular T-SQL access is Interop.  Queries can access and update both Hekaton and disk-resident tables.  Interpreted execution limits performance.  When you put this all into a DLL, get much faster execution.

Native plan generation – query through parse tree, produces logical plan and then feeds it into optimizer which produces physical plan – but these are likely to be different for Hekaton (different algorithms and cost model).  Take the physical plan and then translate it into C code (the ugliest C code you’ve ever seen, no function calls, per DeWitt), goes into C compiler and produces DLL – which is very slim, only what’s needed to run the SP.  The DLL is then invoked and loaded – it’s stored in the catalog.

9:50

Dr. DeWitt just gave a bunch of quantitative data showing performance improvement in terms of transaction/sec, instructions/sec and CPU…I couldn’t copy it fast enough. icon smile PASS Summit 2013: Day 2

For more details, there is a session at 1:30 PM in Room 208 A-B there is a DBA Focused Session, tomorrow at 10:15 AM there is a dev-focused session.

Dr. DeWitt takes some time to thank his team.  This is something I really appreciate about him.  He brings Rimma Nehme on stage (who surprised him by showing up today) and gives nothing but praise…ending with a slide that has a picture of Rimma and a bubble: “Let the woman drive.”  Love it!

You can download DeWitt’s deck and supporting files here.

Thank you Dr. DeWitt!

PASS Summit 2013: Day 1

Greetings from Charlotte, NC where Day 1 of the PASS Summit is about to kick off!  I just got back from this year’s #SQLRun, sponsored by SQL Sentry, and I can’t believe the size to which the event has grown since Jes ( b | t ) organized the first run two years ago.  It was great to see so many fellow runners at 6 AM this morning!

Today’s keynote kicks off with Quentin Clark, Corporate Vice President of the Data Platform Group at Microsoft, and one announcement you can probably expect is the release of CTP2 for SQL Server 2014, which you can download here (thank you Magne Fretheim for the link!).  Check out the SQL Server 2014 CTP2 Product Guide and Release Notes for more details.  If you cannot attend this morning’s keynote because you’re still recovering from the run, or recovering from last night, or because you’re not here, just tune in to PASSTV to watch it live.  Stay tuned to PASSTV to catch specific sessions live – note that each session will have a hashtag to follow, to which you can also Tweet questions.

Speaking of sessions, I’m up first after the keynote first this morning with Making the Leap from Profiler to Extended Events in Room 217A.  If you’ve been a long-time user of Profiler and Trace, but know you need to get familiar with Extended Events (as much as you might dread it!), please join me.  It’s a demo-filled session that walks you through how to re-wire your approach to capturing live data.  We’re not diving too deep into XE, I’ll leave that to Jonathan’s two Extended Events courses on Pluralsight, but this session will you get you over that initial “I don’t know where start” feeling.  Hope to see you there!

During this morning’s keynote I also expect to see some members from the PASS Board of Directors – notably outgoing President Bill Graziano and incoming President Thomas LaRock.  If you’re interested, there is Q&A with the PASS BoD on Friday from 9:15 AM to 10:45 AM.  Not familiar with PASS?  Visit the Community Zone this week for more info!

I won’t be at the blogger’s table this morning to live blog the Keynote, but will be present tomorrow for the keynote from Dr. David DeWitt, which I’m looking forward to immensely.  Until then, I hope you see at my session or elsewhere at Summit.  Say hi, send a tweet, and enjoy the day at Summit!

Thoughts on the Nomination Committee and PASS Board Elections

The PASS Nomination Committee finished interviewing those who applied for the Board last Wednesday, August 28, 2013.  In the event that you’re not familiar with the NomCom and what it does, there’s a short paragraph on the Nomination Committee page.  After the interviews, the candidates were ranked based on their written applications and their interviews with the NomCom.  The candidates were informed of their final ranking, and have the opportunity to withdraw their name from consideration before the slate is publicly announced.

The Process

When I applied to be on the NomCom, I had an understanding of what it entailed, but I didn’t know all the details.  For a bit of NomCom history, I would encourage you to review the PASS archive about the Election Review Committee (ERC).  Within the archive there is a Documents page, and the two documents there outline the process for the NomCom to follow when reviewing applications and interviewing candidates.  The NomCom NDA limits what I can disclose about the events of this year, but I can tell you that it was extremely eye-opening, and worth every minute I spent working with the other members of the team.  I learned even more about PASS as a whole, what important qualities are needed for a Board member, and gained great insight into the applicants who applied.  And of course the whole process got me thinking…

Thoughts on the Candidates

The slate for the three open Board positions will be announced in the coming weeks, and voting will begin soon after.  The applications originally submitted by the candidates will be available on the PASS site, and candidates will have additional opportunities to answer questions and campaign.

During the NomCom process I was reminded that people communicate best in different ways.  Some write extremely well, others speak extremely well.  Some are fortunate to excel at both.

To the candidates: I encourage you to know your strengths, and then take advantage of them.  If you can articulate your thoughts, your vision, and your qualifications in writing so that your passion also shines through: do it.  Get your content published – on the PASS site, on your blog, by someone who supports you.  If you can explain your opinions and ideas better by speaking: do it.  Host a conference call, a Google hangout, something, to have your voice heard (literally).

Thoughts on Voting

To anyone who is not running, I encourage you to exercise your right as an active PASS member to vote.

But I more strongly encourage you to take the time to find out as much as you can about each candidate before you vote.  I say this because it’s possible that an election can become a popularity contest, and you may or may not know every candidate who is running.  You may know candidates personally, you may only know by them name.  You may have seen a candidate speak at an event.  You may recognize a candidate name because you’ve seen it in the PASS Connector.  You may not know anyone who’s on the slate (this was certainly the case for me the first year I voted in a PASS election).  Whether you know a candidate or not, before you vote you do need to know what they think about PASS, what they’ve done, where they see PASS going, and whether they will be a good leader.  Take advantage of the information the candidates share to learn as much you can about each individual.

Why You Should Vote

The easy answer, when someone asks, “Why should I vote?” is because you’re a member of the organization and it’s your right.  Some may argue it’s your duty.

I don’t know why you should vote.  But I’ll tell you why I’m voting.

Almost six years ago at a SQL Server conference – which was not the PASS Summit – I realized that I was not the only person who really loved working with SQL Server.  I discovered there were more like me.  Just over three years ago I finally realized there was a professional organization for all the people who loved working with SQL Server: PASS.  It took me over two years to find PASS.  Two.  And why does that matter?  Because in the three-plus years since I found PASS I have become a better professional.  Yes, I have found a lot more people like me and I have developed friendships that extend beyond the walls of an organization and the functions of an application.  And for that I’m eternally grateful.  But I am just as thankful for the opportunities I have encountered as a PASS member, and for the improvements I’ve made technically and professionally.

So for me, I vote to invest in my career, my future.  I can’t do it alone.  I cannot become better by just sitting in my corner of the world.  I need my co-workers, my colleagues, my user group, my fellow PASS members, my #sqlfamily.  And to keep that group moving forward, we need a strong and focused Board.  I’ll vote for that any day of the week and twice on Sunday.

PASS Board of Directors Applications

Yesterday, August 7, 2013, was the last day to submit applications for the PASS Board of Directors election.  I have the honor of serving on the Nomination Committee this year, and while there is very little I can disclose, I do want to take a moment to extend a heartfelt thank you to those that submitted an application for a Board of Directors position.

Thank you for taking the time to complete the application, I know it can be daunting.

Thank you for taking the time to think through what you want to see happen as a Board member.  Whether you are elected or not, I hope that you will still endeavor to make those things happen.

Thank you for talking to your friends and colleagues and asking them to be references for you (and thank you to those that agreed to do so).

And thank you, in advance, for committing additional time to meet with the NomCom to share your story and your goals.

When you run for any office, or any board, in any arena, you take a risk.  You become vulnerable, and you put your own name out there for discussion.  That takes courage, and I applaud all of you for stepping forward and taking a risk so that you may serve the SQL Server community as a member of the PASS Board.  I wish all of you the best, and I am confident that three very qualified individuals will fill the Board positions this year.  Good luck!

What I Know For Sure…After One Year at SQLskills

Today, August 1, 2013, marks my one year anniversary as a member of the SQLskills team.  Simply put, it’s been a great year.  Challenging in many ways, exhausting at times, but absolutely what I wanted (and expected) to be doing in this role.  Over the past year I’ve been asked many times, “How’s your new job?!”  It’s not-so-new now, but since I didn’t blog much about the non-technical side of life during the past year, I thought I’d use this post to tell you about my new job.  Specifically, the five most important things I learned during the past year.

Talking, out loud, is important

Working for SQLskills means I work remotely, therefore I work from home.  This was quite an adjustment.  I knew it would significantly change the rhythm of each day, but I had no idea what it would look like.  I’ve considered writing about it many times, but a few months ago Merrill Aldrich wrote a post, Telecommuting, Month 9, that explained – very well – many of my own thoughts and observations.  In the comments my friend Jes Borland, who also works from home, clearly articulates one challenge of working remotely.

I found out that what I miss is being able to say, out loud, “I have this idea. What do you think of it?” and getting immediate feedback.

Yes.  YES!  I love the solitude of my office…having the entire house to myself.  Some days I don’t even turn on music or anything for background noise.  But when I want to talk about something, I want to talk about it right now…out loud (funny sidebar, this video makes me laugh…let’s taco ‘bout it).  Trying to discuss ideas over email or chat isn’t the same.  It doesn’t create the same excitement, or the cross-pollination of ideas that occurs during a true conversation.  As Joe says, “it’s where the magic happens.”  It’s true.

Half the battle is realizing the problem.  The other half is figuring out what to do about it.  I make notes about what I want to discuss, and then fire off an email or set up a WebEx.  Jon and I have had numerous late night WebEx sessions where we talk through something, and suddenly at 1 AM I find myself with a litany of post-it notes spread across my desk and ideas churning in my head.  I love those moments.  They are not as organic or spontaneous as they were in an office setting, but I can still make them happen with a little effort.

When theory meets execution

SQL Server is a vast product, and many of us have seen and done a lot…but we haven’t seen and done everything.  As such, there are scenarios and tasks that we’ve read about, that make sense, but we haven’t actually walked through on our own.  We know what’s required to set up an availability group.  We have the checklist, the steps are logical, we can estimate how long it will take, and we’ve read every supporting blog post and technical article we can find.  But I’ve yet to find anything that replaces the actual execution of the task.  In some cases, what’s expected is actually what happens.  And that’s a wonderful thing.  But there are other times where what is planned is not what occurs.  I like this quote I just read in Bob Knight’s book, The Power of Negative Thinking:

Don’t be caught thinking something is going to work just because you think it’s going to work.

Planning beats repairing.

Theory and execution are not always same – it’s certainly nice when they are and when the implementation goes as planned.  But don’t rely on it.  Ultimately, practice and preparation are required to consistently ensure success.

Nothing can replace experience

If you’ve worked in technology a while, you know that a core skill is troubleshooting.  And to be good at troubleshooting, you must have an approach, a methodology that you follow as you work through an issue.  But to be really good at troubleshooting, you also need to recognize patterns.

I came into this role with many years of experience troubleshooting database issues.  But I spent the majority of that time looking at the same database, across different customer installations (if you don’t know my background, I used to work for a software vendor and as part of my job I supported the application database).  I became familiar with the usual database-related problems, and knew how to quickly identify and fix them.  We typically call this pattern matching, and I found it well explained in this excerpt from The Sports Gene, where it’s defined as “chunking.”  From the article:

… rather than grappling with a large number of individual pieces, experts unconsciously group information into a smaller number of meaningful chunks based on patterns they have seen before.

In the past year I’ve seen a lot of new patterns.  And some days were extremely frustrating because I would look at a problem, get stuck, and then ask another member of team to look at the issue with me.  It was usually Jon, who would often look at the issue for a couple minutes and then say, “Oh it’s this.”  It was infuriating.  And I would ask Jon how he knew that was the problem.  And the first time I asked him I think he thought I was questioning whether he was right.  But in fact, I just wanted to know how he figured it out so quickly.  His response?  “I’ve seen it before.  Well maybe not this exact thing, but something similar.”  It’s pattern matching.  It’s chunking.  It’s experience.  You cannot read about it.  You cannot talk about it.  You just have to go get it.  And be patient.

I have a great team

I actually have two great teams: my team at work and my team at home.  I work with individuals who are experts in the SQL Server Community.  Their support is unwavering.  Their willingness to help knows no limits.  I am always appreciative for the time and the knowledge they share, and I am proud to not just work with them, but to call them friends.  To the SQLskills team: thank you for a fantastic first year – I look forward to what’s ahead!  (And happy birthday Glenn!)

My team at home is Team Stellato: my husband Nick and my two kids.  The first year of any job is an adventure, and for me there’s a lot of overhead – a lot of thought around what I’m doing, what I need to finish, what’s next, etc.  And much of that continues when I’m not at my desk.  I haven’t always been 100% present this past year and over last 12 months I’ve said, I don’t know how many times, that I’m still figuring it out.  And I am still figuring it out.  It’s hard to balance everything.  It’s hard to stay in the moment all the time.  I firmly believe I can do it, but I also believe I can do it better than I’m doing it today.  Thank you Nick for just being you – being supportive, understanding, and patient, and for making me laugh.  We’ll get there.  And thank you to my kids for trying to understand that being at home and being available aren’t always the same thing.  This year I will do better at being present during our time.

Make time for the gym

The last item to mention is something I need to be successful, but it may not be necessary for everyone.  It’s exercise.  It seems pretty straight-forward, right?  For some reason it’s a continual battle I fight in my head.  I don’t always have enough hours in the day to get done what I want to get done, so something has to give.  I’m very quick to sacrifice a run, a spin class, or a hot yoga session.  My though process is: “I will need 30/60/90 minutes for that workout.  That’s time I could spend working/hanging out with my family/having lunch with a friend.”  But when I give up that work out multiple days in a row, my mental and emotional health suffer…more than my physical health.  A work out clears my head – solutions come faster, ideas flow easier, I am more focused when I need to be – and it reduces my stress.  It’s ironic if you think about it…making time to work out introduces this stress (“Can I do everything?!”) but the act of working out makes everything else I need to do so much easier.  And it’s not about how far I run, or how many classes I get to in a week.  It’s the workout itself – whether it’s an intense 50 minutes of spin, a 1.5 mile run while the kids bike, or an hour in the yoga studio.

Year 2 and beyond

So, how’s my new job?  It’s great.  In many ways it is exactly what I expected, and in other ways it’s not – and that’s not a bad thing.  I didn’t anticipate every challenge I would have in working from home, but I am not afraid of them, nor do I think they’re unconquerable.  I have learned how to step back and critically look at where I am in my career, and evaluate what’s working well and what isn’t.  And this is working well.  It’s hard – hard because I am learning a ton and juggling many things, and that can be exhausting.  But I wouldn’t want it any other way.  I hate to be bored!  I absolutely love working with people who know so much, because it reminds me how much there is to know and what I can learn.  It is a fantastic motivator for me.  And the SQLskills team is fun.  A little weird at times icon smile What I Know For Sure…After One Year at SQLskills but very fun and extremely supportive.  I cannot explain the importance of that, for me, enough.  And so begins year 2, let’s see what adventures this brings…IE0 anyone?!!

The Accidental DBA (Day 26 of 30): Monitoring Disk I/O

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Database storage can seem like a black box. A DBA takes care of databases, and those databases often reside somewhere on a SAN – space simply presented to a DBA as a drive letter representing some amount of space. But storage is about more than a drive letter and a few hundred GBs. Yes, having enough space for your database files is important. But I often see clients plan for capacity and not performance, and this can become a problem down the road. As a DBA, you need to ask your storage admin not just for space, but for throughput, and the best way to back up your request is with data.

I/O Data in SQL Server
I’ve mentioned quite a few DMVs in these Accidental DBA posts, and today is no different. If you want to look at I/O from within SQL Server, you want to use the sys.dm_io_virtual_file_stats DMV. Prior to SQL Server 2005 you could get the same information using the fn_virtualfilestats function, so don’t despair if you’re still running SQL Server 2000!  Paul has a query that I often use to get file information in his post, How to examine IO subsystem latencies from within SQL Server. The sys.dm_io_virtual_file_stats DMV accepts database_id and file_id as inputs, but if you join over to sys.master_files, you can get information for all your database files. If I run this query against one of my instances, and order by write latency (desc) I get:

filestats1 1024x238 The Accidental DBA (Day 26 of 30): Monitoring Disk I/O

output from sys.dm_os_virtual_file_stats

This data makes it look like I have some serious disk issues – a write latency of over 1 second is disheartening, especially considering I have a SSD in my laptop! I include this screenshot because I want to point out that this data is cumulative. It only resets on a restart of the instance. You can initiate large IO operations – such as index rebuilds – against a database that can greatly skew your data, and it may take time for the data to normalize again. Keep this in mind not only when you view the data at a point in time, but when you share findings with other teams. Joe has a great post that talks about this in more detail, Avoid false negatives when comparing sys.dm_io_virtual_file_stats data to perfmon counter data, and the same approach applies to data from storage devices that your SAN administrators may use.

The information in the sys.dm_io_virtual_file_stats DMV is valuable not only because it shows latencies, but also because it tells you what files have the have the highest number of reads and writes and MBs read and written. You can determine which databases (and files) are your heavy hitters and trend that over time to see if it changes and how.

I/O Data in Windows

If you want to capture I/O data from Windows, Performance Monitor is your best bet. I like to look at the following counters for each disk:

  • Avg. Disk sec/Read
  • Avg. Disk Bytes/Read
  • Avg. Disk sec/Write
  • Avg. Disk Bytes/Write

Jon talked about PerfMon counters earlier and the aforementioned counters tell you about latency and throughput.  Latency is how long it takes for an I/O request, but this can be measured at different points along the layers of a solution. Normally we are concerned with latency as measured from SQL Server. Within Windows, latency is the time from when Windows initiated the I/O request to the completion of the request. As Joe mentioned his post, you may see some variation between what you see for latency from SQL Server versus from Windows.

When we measure latency using Windows Performance Monitor, we look at Avg Disk sec/Read and Avg Disk sec/Write. Disk cache, on a disk, controller card, or a storage system, impact read and write values. Writes are typically written to cache and should complete very quickly. Reads, when not in cache, have to be pulled from disk and that can take longer.  While it’s easy to think of latency as being entirely related to disk, it’s not. Remember that we’re really talking about the I/O subsystem, and that includes the entire path from the server itself all the way to the disks and back. That path includes things like HBAs in the server, switches, controllers in the SAN, cache in the SAN and the disks themselves. You can never assume that latency is high because the disks can’t keep up. Sometimes the queue depth setting for the HBAs is too low, or perhaps you have an intermittently bad connection with a failing component like a GBIC (gigabit interface converter) or maybe a bad port card. You have to take the information you have (latency), share it with your storage team, and ask them to investigate. And hopefully you have a savvy storage team that knows to investigate all parts of the path. A picture is worth a thousand words in more complex environments. It often best to draw out, with the storage administrator, the mapping from the OS partition to the SAN LUN or volume. This should generate a discussion about the server, the paths to the SAN and the SAN itself. Remember what matters is getting the I/O to the application. If the IO leaves the disk but gets stuck along the way, that adds to latency. There could be an alternate path available (multi-pathing), but maybe not.

Our throughput, measured by Avg. Disk Bytes/Read and Avg. Disk Bytes/Write, tells us how much data is moving between the server and storage. This is valuable to understand, and often more useful than counting I/Os, because we can use this to understand how much data our disks will be need to be able to read and write to keep up with demand. Ideally you capture this information when the system is optimized – simple things like adding indexes to reduce full table scans can affect the amount of I/O – but often times you will need to just work within the current configuration.

Capturing Baselines

I alluded to baselines when discussing the sys.dm_os_virtual_file_stats DMV, and if you thought I was going to leave it at that then you must not be aware of my love for baselines!
You will want to capture data from SQL Server and Windows to provide throughput data to your storage administrator. You need this data to procure storage on the SAN that will not only give you enough space to accommodate expected database growth, but that will also give you the IOPs and MB/sec your databases require.

Beyond a one-time review of I/O and latency numbers, you should set up a process to capture the data on a regular basis so you can identify if things change and when. You will want to know if a database suddenly starts issuing more IOs (did someone drop an index?) or if the change is I/Os is gradual. And you need to make sure that I/Os are completing in the timeframe that you expect. Remember that a SAN is shared storage, and you don’t always know with whom you’re sharing that storage. If another application with high I/O requirements is placed on the same set of disks, and your latency goes up, you want to be able to pinpoint that change and provide metrics to your SAN administrator that support the change in performance in your databases.

Summary

As a DBA you need to know how your databases perform when it comes to reads and writes, and it’s a great idea to get to know your storage team. It’s also a good idea to understand where your databases really “live” and what other applications share the same storage. When a performance issue comes up, use your baseline data as a starting part, and don’t hesitate to pull in your SAN administrators to get more information. While there’s a lot of data readily available for DBAs to use, you cannot get the entire picture on your own. It may not hurt to buy your storage team some pizza or donuts and make some new friends icon smile The Accidental DBA (Day 26 of 30): Monitoring Disk I/O Finally, if you’re interested in digging deeper into the details of SQL Server I/O, I recommend starting with Bob Dorr’s work:

The Accidental DBA (Day 25 of 30): Wait Statistics Analysis

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

For the last set of posts in our Accidental DBA series we’re going to focus on troubleshooting, and I want to start with Wait Statistics.  When SQL Server executes a task, if it has to wait for anything – a lock to be released from a page, a page to be read from disk into memory, a write to the transaction log to complete – then SQL Server records that wait and the time it had to wait.  This information accumulates, and can be queried using the sys.dm_os_wait_stats DMV, which was first available in SQL Server 2005.  Since then, the waits and queues troubleshooting methodology has been a technique DBAs can use to identify problems, and areas for optimizations, within an environment.

If you haven’t worked with wait statistics, I recommend starting with Paul’s wait stats post, and then working through Tom Davidson’s SQL Server 2005 Waits and Queues whitepaper.

Viewing Wait Statistics

If you run the following query:

SELECT *
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC;

You will get back output that isn’t that helpful, as you can see below:

WaitStats output The Accidental DBA (Day 25 of 30): Wait Statistics Analysis

sys.dm_os_wait_stats output

It looks like FT_IFTS_SCHEDULER_IDLE_WAIT is the biggest wait, and SQL Server’s waited for 1930299679 ms total.  This is kind of interesting, but not what I really need to know.  How I do really use this data?  It needs some filtering and aggregation.  There are some waits that aren’t going to be of interest because they occur all the time and are irrelevant for our purposes; we can filter out those wait types.  To make the most of our wait stats output, I really want to know the highest wait based on the percentage of time spent waiting overall, and the average wait time for that wait.  The query that I use to get this information is the one from Paul’s post (mentioned above).  I won’t paste it here (you can get it from his post) but if I run that query against my instance, now I get only three rows in my output:

WaitStats output 2 The Accidental DBA (Day 25 of 30): Wait Statistics Analysis

sys.dm_os_wait_stats output with wait_types filtered out

If we reference the various wait types listed in the MSDN entry for sys.dm_os_wait_stats, we see that the SQLTRACE_WAIT_ENTRIES wait type, “Occurs while a SQL Trace event queue waits for packets to arrive on the queue.”

Well, this instance is on my local machine and isn’t very active, so that wait is likely due to the default trace that’s always running.  In a production environment, I probably wouldn’t see that wait, and if I did, I’d check to see how many SQL Traces were running.  But for our purposes, I’m going to add that as a wait type to filter out, and then re-run the query.  Now there are more rows in my output, and the percentage for the PAGEIOLATCH_SH and LCK_M_X waits has changed:

WaitStats output 3 1024x248 The Accidental DBA (Day 25 of 30): Wait Statistics Analysis

sys.dm_os_wait_stats output with SQLTRACE_WAIT_ENTRIES also filtered out

If you review the original query, you will see that the percentage calculation for each wait type uses the wait_time_ms for the wait divided by the SUM of wait_time_ms for all waits.  But “all waits” are those wait types not filtered by the query.  Therefore, as you change what wait types you do not consider, the calculations will change.  Keep this in mind when you compare data over time or with other DBAs in your company – it’s a good idea to make sure you’re always running the same query that filters out the same wait types.

Capturing Wait Statistics

So far I’ve talked about looking at wait statistics at a point in time.  As a DBA, you want to know what waits are normal for each instance.  And there will be waits for every instance; even if it’s highly tuned or incredibly low volume, there will be waits.  You need to know what’s normal, and then use those values when the system is not performing well.

The easiest way to capture wait statistics is to snapshot the data to a table on a regular basis, and you can find queries for this process in my Capturing Baselines for SQL Server: Wait Statistics article on SQLServerCentral.com.  Once you have your methodology in place to capture the data, review it on a regular basis to understand your typical waits, and identify potential issues before they escalate.  When you do discover a problem, then you can use wait statistics to aid in your troubleshooting.

Using the Data

At the time that you identify a problem in your environment, a good first step is to run your wait statistics query and compare the output to your baseline numbers.  If you see something out of the ordinary, you have an idea where to begin your investigation.  But that’s it; wait statistics simply tell you where to start searching for your answer.  Do not assume that your highest wait is the problem, or even that it’s a problem at all.  For example, a common top wait is CXPACKET, and CXPACKET waits indicate that parallelism is used, which is expected in a SQL Server environment.  If that’s your top wait, does that mean you should immediately change the MAXDOP setting for the instance?  No.  You may end up changing it down the road, but a better direction is to understand why that’s the highest wait.  You may have CXPACKET waits because you’re missing some indexes and there are tons of table scans occurring.  You don’t need to change MAXDOP, you need to start tuning.

Another good example is the WRITELOG wait type.  WRITELOG waits occur when SQL Server is waiting for a log flush to complete.  A log flush occurs when information needs to be written to the database’s transaction log.  A log flush should complete quickly, because when there is a delay in a log write, then the task that initiated the modification has to wait, and tasks may be waiting behind that.  But a log flush doesn’t happen instantaneously every single time, so you will have WRITELOG waits.  If you see WRITELOG as your top wait, don’t immediately assume you need new storage.  You should only assume that you need to investigate further.  A good place to start would be looking at read and write latencies, and since I’ll be discussing monitoring IO more in tomorrow’s post we’ll shelve that discussion until then.

As you can see from these two examples, wait statistics are a starting point.  They are very valuable – it’s easy to think of them as “the answer”, but they’re not.  Wait statistics do not tell you the entire story about a SQL Server implementation.  There is no one “thing” that tells you the entire story, which is why troubleshooting can be incredibly frustrating, yet wonderfully satisfying when you find the root of a problem.  Successfully troubleshooting performance issues in SQL Server requires an understanding of all the data available to aid in your discovery and investigation, understanding where to start, and what information to capture to correlate with other findings.

The Accidental DBA (Day 23 of 30): SQL Server HA/DR Features

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

Two of the most important responsibilities for any DBA are protecting the data in a database and keeping that data available.  As such, a DBA may be responsible for creating and testing a disaster recovery plan, and creating and supporting a high availability solution.  Before you create either, you have to know your RPO and RTO, as Paul talked about a couple weeks ago.  Paul also discussed what you need to consider when developing a recovery strategy, and yesterday Jon covered considerations for implementing a high availability solution.

In today’s post, I want to provide some basic information about disaster recovery and high availability solutions used most often.  This overview will give you an idea of what options might be a fit for your database(s), but you’ll want to understand each technology in more detail before you make a final decision.

Backup/Restore

No matter what type of implementation you support, you need a disaster recovery plan.  Your database may not need to be highly available, and you may not have the budget to create a HA solution even if the business wants one.  But you must have a method to recover from a disaster.  Every version, and every edition, of SQL Server supports backup and restore.  A bare bones DR plan requires a restore of the most recent database backups available – this is where backup retention comes in to play.  Ideally you have a location to which you can restore.  You may have a server and storage ready to go, 500 miles away, just waiting for you to restore the files.  Or you may have to purchase that server, install it from the ground up, and then restore the backups.  While the plan itself is important, what matters most is that you have a plan.

Log Shipping

Log shipping exists on a per-user-database level and requires the database recovery model to use either full or bulk-logged recovery (see Paul’s post for a primer on the differences).  Log shipping is easy to understand – it’s backup from one server and restore on another – but the process is automated through jobs.  Log shipping is fairly straight forward to configure and you can use the UI or script it out (prior to SQL Server 2000 there was no UI).  Log shipping is available in all currently supported versions of SQL Server, and all editions.

You can log ship to multiple locations, creating additional redundancy, and you can configure a database for log shipping if it’s the primary database in a database mirroring or availability group configuration.  You can also use log shipping when replication is in use.

With log shipping you can allow limited read-only access on secondary databases for reporting purposes (make sure you understand the licensing impact), and you can take advantage of backup compression to reduce the size of the log backups and therefore decrease the amount of data sent between locations.  Note: backup compression was first available only in SQL Server 2008 Enterprise, but starting in SQL Server 2008 R2 it was available in Standard Edition.

While Log Shipping is often used for disaster recovery, you can use it as a high availability solution, as long as you can accept some amount of data loss and some amount of downtime.  Alternatively, in a DR scenario, if you implement a longer delay between backup and restore, then if data is changed or removed from the primary database – either purposefully or accidentally – you can possibly recover it from the secondary.

Failover Cluster Instance

A Failover Cluster Instance (also referred to as FCI or SQL FCI) exists at the instance level and can seem scary to newer DBA because it requires a Windows Server Failover Cluster (WSFC).  A SQL FCI usually requires more coordination with other teams (e.g. server, storage) than other configurations.  But clustering is not incredibly difficult once you understand the different parts involved.  A Cluster Validation Tool was made available in Windows Server 2008, and you should ensure the supporting hardware successfully passes its configuration tests before you install SQL Server, otherwise you may not be able to get your instance and up and running.

SQL FCIs are available in all currently supported versions of SQL Server, and can be used with Standard Edition (2 nodes only), Business Intelligence Edition in SQL Server 2012 (2 nodes only), and Enterprise Edition.  The nodes in the cluster share the same storage, so there is only one copy of the data.  If a failure occurs for a node, SQL Server fails over to another available node.

If you have a two-node WSFC with only one instance of SQL Server, one of the nodes is always unused, basically sitting idle.  Management may view this as a waste of resources, but understand that it is there as insurance (that second node is there to keep SQL Server available if the first node fails).  You can install a second SQL Server instance and use log shipping or mirroring with snapshots to create a secondary copy of the database for reporting (again, pay attention to licensing costs).  Or, those two instances can both support production databases, creating a better use of the hardware.  However, be aware of resource utilization when a node fails and both instances run on the same node.

Finally, a SQL FCI can provide intra-data center high availability, but because it uses shared storage, you do have a single point of failure.  A SQL FCI can be used for cross-data center disaster recovery if you use multi-site SQL FCIs in conjunction with storage replication.  This does require a bit more work and configuration, because you have more moving parts, and it can become quite costly.

Database Mirroring

Database mirroring is configured on a per-user-database basis and the database must use the Full recovery model.  Database mirroring was introduced in SQL Server 2005 SP1 and is available in Standard Edition (synchronous only) and Enterprise Edition (synchronous and asynchronous).  A database can be mirrored to only one secondary server, unlike log shipping.

Database mirroring is extremely easy to configure using the UI or scripting.  A third instance of SQL Server, configured as a witness, can detect the availability of the primary and mirror servers.  In synchronous mode with automatic failover, if the primary server becomes unavailable and the witness can still see the mirror, failover will occur automatically if the database is synchronized.

Note that you cannot mirror a database that contains FILESTREAM data, and mirroring is not appropriate if you need multiple databases to failover simultaneously, or if you use cross-database transactions or distributed transactions.  Database mirroring is considered a high availability solution, but it can also be used for disaster recovery, assuming the lag between the primary and mirror sites is not so great that the mirror database is too far behind the primary for RPO to be met.  If you’re running Enterprise Edition, snapshots can be used on the mirror server for point-in-time reporting, but there’s a licensing cost that comes with reading off the mirror server (as opposed to if it’s used only when a failover occurs).

Availability Groups

Availability groups (AGs) were introduced in SQL Server 2012 and require Enterprise Edition.  AGs are configured for one or more databases, and if a failover occurs, the databases in a group failover together.  They allow three synchronous replicas (the primary and two secondaries), whereas database mirroring allowed only one synchronous secondary, and up to four asynchronous replicas.  Failover in an Availability Group can be automatic or manual.  Availability Groups do require a Windows Failover Clustering Server (WFCS), but do not require a SQL FCI.  An AG can be hosted on SQL FCIs, or on standalone servers within the WFCS.

Availability Groups allow read-only replicas that allow for lower latency streaming updates, so you can offload reporting to another server and have it be near real-time.  Availability Groups offer some fantastic functionality, but just as with a SQL FCI, there are many moving parts and the DBA cannot work in a vacuum for this solution, it requires a group effort.  Make friends with the server team, the storage team, the network folks, and the application team.

Transactional Replication

Transactional Replication gets a shout out here, even though it is not always considered a high availability solution as Paul discusses in his post, In defense of transaction replication as an HA technology.  But it can work as a high availability solution provided you can accept its limitations.  For example, there is no easy way to fail back to the primary site…however, I would argue this is true for log shipping as well because log shipping requires you to backup and restore (easy but time consuming).  In addition, with transactional replication you don’t have a byte-for-byte copy of the publisher database, as you do with log shipping, database mirroring or availability groups.  This may be a deal-breaker for some, but it may be quite acceptable for your database(s).

Transactional Replication is available in all currently supported versions and in Standard and Enterprise Editions, and may also be a viable option for you for disaster recovery.  It’s important that you clearly understand what it can do, and what it cannot, before you decide to use it.  Finally, replication in general isn’t for the faint of heart.  It has many moving parts and can be overwhelming for an Accidental DBA.  Joe has a great article on SQL Server Pro that covers how to get started with transactional replication.

Summary

As we’ve seen, there are many options available that a DBA can use to create a highly available solution and/or a system that can be recovered in the event of a disaster.  It all starts with understanding how much data you can lose (RPO) and how long the system can be unavailable (RTO), and you work from there.  Remember that the business needs to provide RPO and RTO to you, and then you create the solution based on that information.  When you present the solution back to the business, or to management, make sure it is a solution that YOU can support.  As an Accidental DBA, whatever technology you choose must be one with which you’re comfortable, because when a problem occurs, you will be the one to respond and that’s not a responsibility to ignore.  For more information on HA and DR solutions I recommend the following:

The Accidental DBA (Day 19 of 30): Tools for On-Going Monitoring

This month the SQLskills team is presenting a series of blog posts aimed at helping Accidental/Junior DBAs ‘keep the SQL Server lights on’. It’s a little taster to let you know what we cover in our Immersion Event for The Accidental/Junior DBA, which we present several times each year. If you know someone who would benefit from this class, refer them and earn a $50 Amazon gift card – see class pages for details. You can find all the other posts in this series at http://www.SQLskills.com/help/AccidentalDBA. Enjoy!

In yesterday’s post I covered the basics of baselines and how to get started.  In addition to setting up baselines, it’s a good idea to get familiar with some of the free tools available to DBAs that help with continued monitoring of a SQL Server environment.

Performance Monitor and PAL

I want to start with Performance Monitor (PerfMon).  I’ve been using PerfMon since I started working with computers and it is still one of my go-to tools.  Beginning in SQL Server 2005, Dynamic Management Views and Functions (DMVs and DMFs) were all the rage, as they exposed so much more information than had been available to DBAs before.  (If you don’t believe me, try troubleshooting a parameter sniffing issue in SQL Server 2000.)  But PerfMon is still a viable option because it provides information about Windows as well as SQL Server.  There are times that it’s valuable to look at that data side-by-side.  PerfMon is on every Windows machine, it’s reliable, and it’s flexible.  It provides numerous configuration options, not to mention all the different counters that you can collect.  You have the ability to tweak it for different servers if needed, or just use the same template every time.  It allows you to generate a comprehensive performance profile of a system for a specified time period, and you can look at performance real-time.

If you’re going to use PerfMon regularly, take some time to get familiar it. When viewing live data, I like to use config files to quickly view counters of interest.  If I’ve captured data over a period of time and I want to get a quickly view and analyze the data, I use PAL.  PAL stands for Performance Analysis of Logs and it’s written and managed by some folks at Microsoft.  You can download PAL from CodePlex, and if you don’t already have it installed, I recommend you do it now.

Ok, once PAL is installed, set up PerfMon to capture some data for you.  If you don’t know which counters to capture, don’t worry.  PAL comes with default templates that you can export and then import into PerfMon and use immediately.  That’s a good start, but to get a better idea of what counters are relevant for your SQL Server solution, plan to read Jonathan’s post on essential PerfMon counters (it goes live this Friday, the 21st).  Once you’ve captured your data, you can then run it through PAL, which will do all the analysis for you and create pretty graphs.  For step-by-step instructions on how to use PAL, and to view some of those lovely graphs, check out this post from Jonathan, Free Tools for the DBA: PAL Tool.  Did you have any plans for this afternoon?  Cancel them; you’ll probably have more fun playing with data.

SQL Trace and Trace Analysis Tools

After PerfMon, my other go-to utility was SQL Trace.  Notice I said “was.”  As much as I love SQL Trace and its GUI Profiler, they’re deprecated in SQL Server 2012.  I’ve finally finished my mourning period and moved on to Extended Events.  However, many of you are still running SQL Server 2008R2 and earlier so I know you’re still using Trace.  How many of you are still doing analysis by pushing the data into a table and then querying it?  Ok, put your hands down, it’s time to change that.  Now you need to download ClearTrace and install it.

ClearTrace is a fantastic, light-weight utility that will parse and normalize trace files.  It uses a database to store the parsed information, then queries it to show aggregated information from one trace file, or a set of files.  The tool is very easy to use – you can sort queries based on reads, CPU, duration, etc.  And because the queries are normalized, if you group by the query text you can see the execution count for the queries.

A second utility, ReadTrace, provides the same functionality as ClearTrace, and more.  It’s part of RML Utilities, a set of tools developed and used by Microsoft.  ReadTrace provides the ability to dig a little deeper into the trace files, and one of the big benefits is that it allows you to compare two trace files.  ReadTrace also stores information in a database, and normalizes the data so you can group by query text, or sort by resource usage.  I recommend starting with ClearTrace because it’s very intuitive to use, but once you’re ready for more powerful analysis, start working with ReadTrace.  Both tools include well-written documentation.

Note: If you’re a newer DBA and haven’t done much with Trace, that’s ok.  Pretend you’ve never heard of it, embrace Extended Events.

SQLNexus

If you’re already familiar with the tools I’ve mentioned above, and you want to up your game, then the next utility to conquer is SQLNexus.  SQLNexus analyzes data captured by SQLDiag and PSSDiag, utilities shipped with SQL Server that Microsoft Product Support uses when troubleshooting customer issues.  The default templates for SQLDiag and PSSDiag can be customized, by you, to capture any and all information that’s useful and relevant for your environment, and you can then run that data through SQLNexus for your analysis.  It’s pretty slick and can be a significant time-saver, but the start-up time is higher than with the other tools I’ve mentioned.  It’s powerful in that you can use it to quickly capture point-in-time representations of performance, either as a baseline or as a troubleshooting step.  Either way, you’re provided with a comprehensive set of information about the solution – and again, you can customize it as much as you want.

Essential DMVs for Monitoring

In SQL Server 2012 SP1 there are 178 Dynamic Management Views and Functions.  How do you know which ones are the most useful when you’re looking at performance?  Luckily, Glenn had a great set of diagnostic queries to use for monitoring and troubleshooting.  You can find the queries on Glenn’s blog, and he updates them as needed, so make sure you follow his blog or check back regularly to get the latest version.  And even though I rely on Glenn’s scripts, I wanted to call out a few of my own favorite DMVs:

  • sys.dm_os_wait_stats – I want to know what SQL Server is waiting on, when there is a problem and when there isn’t.  If you’re not familiar with wait statistics, read Paul’s post, Wait statistics, or please tell me where it hurts (I still chuckle at that title).
  • sys.dm_exec_requests – When I want to see what’s executing currently, this is where I start.
  • sys.dm_os_waiting_tasks – In addition to the overall waits, I want to know what tasks are waiting right now (and the wait_type).
  • sys.dm_exec_query_stats – I usually join to other DMVs such as sys.dm_exec_sql_text to get additional information, but there’s some great stuff in here including execution count and resource usage.
  • sys.dm_exec_query_plan – Very often you just want to see the plan. This DMV has cached plans as well as those for queries that are currently executing.
  • sys.dm_db_stats_properties – I always take a look at statistics in new systems, and when there’s a performance issue, initially just to check when they were last updated and the sample size.  This DMF lets me do that quickly for a table, or entire database (only for SQL 2008R2 SP2 and SQL 2012 SP1).

Kimberly will dive into a few of her favorite DMVs in tomorrow’s post.

Wrap Up

All of the utilities mentioned in this post are available for free.  But it’s worth mentioning that there are tools you can purchase that provide much of the same functionality and more.  As an Accidental DBA, you may not always have a budget to cover the cost of these products, which is why it’s important to know what’s readily available.  And while the free tools may require more effort on your part, using them to dig into your data and figure out what’s really going on in your system is one of the best ways to learn about SQL Server and how it works.