Day 2 is starting here at PASS Summit in Seattle – watch this space for updates the next hour and a half!
Grant Fritchey is kicking off day 2! I’m a big fan of Grant. He talks about the community and how people should take advantage of what’s available here at Summit. I couldn’t agree more – I wouldn’t be where I am in my career without this conference, nor would I have the friends I do all over the world. Grant hands it over to Denise McInerney, VP of Marketing, and Denise shares her journey within PASS (she’s a 6 year Board Member). Denise continues driving the point about the value of this conference and PASS. She then talks about the volunteers for PASS, and announces the winner of the PASSion Award. This year’s winner is Roberto Fonseca.
Denise is talking about the survey feedback from the past couple years, and session evaluations at PASS. *PLEASE* complete these by the end of next week (if not earlier) – both PASS and the speakers truly value this feedback. If you provide additional feedback there is a Board Q&A tomomrrow at Friday, 2PM. Today is the WIT lunch, and Denise announces that next year’s Summit is November 6 – November 9, 2017. Denise introduce Rimma Nehme, a Group Product Manager/Architect for Azure Cosmos DB and Azure HDInsight at Microsoft. Today is going to uncover CosmosDB. It will be technical! Let’s go!!
Do we need another database?
This is the wrong question to ask. If you look at the rising data problems and needs, most of the production systems today (particularly the ones designed in the 70s and 80s), the modern calls and needs are addressing these problems. 90% of the world’s data was created in the last 2 years alone. The estimated growth in the next 3-5 years is 50x. Another trend is global, and another is data is big. Not just a few TB, but trends of companies processing hundreds of TB to pedabytes. Every 60 seconds 200 million emails are generated. Rimma is throwing out ridiculous numbers about the amount of data being generated. I can’t keep up!
Data is also interconnected. What you do in Seattle can be connected to another location in the world. This the butterfly affect. We are experiencing about 125 exabytes of data (that’s a lot of zeroes). Companies are looking at ways of extracting that data and monetize that information. Another trend is the balance of power continues to shift from structured to unstructured data. About 80% of data originates in unstructured data. Never push the data to computation – push the computation to the data.
When dealing with distributed, you need to deal with a lot of differences. For example, different architectures. In 2010 an engineer inside Microsoft observed this and identified that we need a different architectures to deal with these fundamental differences in data at scale. This is how Project Florence was born, which is the base of what is now CosmosDB. It was one of the exemplary partnerships between Microsoft Research and the Azure Data team.
At the time they were working to address the problem of the data for large scale applications within Microsoft (e.g. XBox). They tried the “earthly databases”, building something on their own, and these options weren’t working. Hence project Florence to meet the internal needs. A basic set of requirements were laid out:
- Turnkey global distribution
- Guaranteed low latency at the 99th percentiles, worldwide
- Guaranteed HA within region and globally
- Guaranteed consistency
- Elastically scale throughput and storage, at any time, on demand, and globally
- Comprehensive SLAs (availability, latency, throughput, consistency)
- Operate at low cost (this is crucial and important! – first customers were Microsoft departments)
- Iterate and query without working about schemas and index management (applications evolve over time and rapidly))
- Provide a variety of data model and API choices
This manifests into three principals that have evolved
- Global distribution from the ground up
- fully resource governed stack
- Schema-agnostic service
It is very hard to build any service (particularly with such requirements).
If it was easy, everyone would do it (via NASA). So this is how CosmosDB was built. This is used internally by Microsoft. It is one of the fastest services in the cloud. It is a ring-0 service, meaning it is available in all regions by default. It is millions of lines of C++ code. It is 7 years in the making, it is truly not a new services. Here is what it looks like (a bit of marketing here).
The foundations of the service for a globally distributed, massively scale-able multi–model database service are
- comprehensive SLA
- five well-defined consistency model
- guaranted low latency at t the99th percentile
- elasticscale out of storage and throughput
Fine grained multi-tenancy. This cannot be an after thought. From left to right, you can take a physical resource like a cluster and dedicate to a single tenant (e.g. customer or database). You can take an entire machine and dedicate. You can go another step and take a machine to homogeneous customers. The final level of granularity is taking that machine and dividing between heterogeneous tenants and providing performance and scalability.
In terms of global distribution, Azure has 42 regions world wife…36 are available, 6 are still being built out. You can span your CosmosDB across all of those regions.
Within a database account you have a database. Within that you have users and permissions. Within that CosmosDB is a container. A container of data with a particular data model. Below that are other user defined code. The database may span multiple clusters and regions and you can scale it in terms of these containers. It is designed to scale throughput and storage INDEPENDENTLY. How is the system designed by the scene (10K foot view)? Within regions there are data centers, with data centers there are stamps, within that there are fault domains, within that there are containers and within that the replicas. Within the replicas are the data. On the database engine this is where the secret sauce comes in – bw-indexes, resource manager, log manager, IO manager, etc. On any cluster will see thousands or hundreds of tenants. Need to make sure that none of the tenants are noisy.
Another tenant that is import is the concept of partitioning. How does CosmosDB solve this? The tenants create containers of data and behind the scenes these are partitions. The partitions are comprised are 4 replicas. This is consistent and reliable. Each one is a smart construct. Out of those partitions, you can create partition sets. These can then span clusters, federations, data centers, regions. You can overlay topologies to implement solutions that span across multiple regions across the planet. You need to make sure that the applications then work really well (particularly when merge or split partitions set). You have the partition which is a data block and then you can build the partition set of various topological.
What are some of the best practices?
- Always want to select a partition key that provides even distribution
- user location aware partition key for access locally
- Select a partition key that can be a transaction scope
- Don’t want to use the timestamp for write-heavy workloadso
The resource model summary : Resources, Resource model, partitioning model
Core capabilities Turnkey global distribution – this is adding regions with a click. Yu can come to an Azure portal, you can see the map of the entire world and pick the regions where you want your data to be. The data is replicated behind the scenes and then its available for access. You’re not dealing with VMs, cores. You can add and remove regions at any time and the application does not need to be re-deployed. The connection between application and database is logical. This is enabled by multi-homing capability API. You can connect to physically to the end point. Another capability is that you can associate priorities with each of the regions. If there is an outage or failover in a region, the failover will occur in the order of priority, and that can be changed at any time. Something added for customers is to simulate a regional outage (but don’t go crazy with this says Rimma!). You can test HA of the entire application stack.
Another capability is being able to provide geo-fencing. If you come from any other part of the world there can be regulations where data has to present in particular regions, so if data needs to stay withing a location for requirements, that capability is required.
How does AlwaysOn work? By virtue of partitioning have multiple locations. One replica goes down, the application will be unaffected. If partition goes down, the application will go t partition in another region. If an entire region goes down, the application will go to another region. The data is always available.
Another area of interest is active writers and active readers in any region. Right now turnkey provided at database level, but looking to push this down to the partition key level (a true Active Active topology). Online backups are available, they are stored in Azure blob storage. The key capability is that it’s intended for “oops I deleted my data”, it”s’ not for a data center going down (that’s hwy you have multiple regions).
Second capability is elastic scale out. As data size, scale throughput independently. Could start out with small amount of data and keep adding more and more. Back end will seamlessly scale. Transnational data tends to be small, web and content data is medium sized, and social data/machine generated data is much larger. As data size grows or throughput grows, scale occurs and this happens seamlessly behind the scenes. This is done with SSDs behind the scenes.
Resource governance is the next component. As operations occur, they occur RUs. You provision RUs that you need (how many transactions/sec to you need?). All replicas (just a partitioning of data) get a certain budget of RUs. If you exceed, you’ll get rate limited. At any time can increase provision throughput. Can then support more transactions/sec. Can also decrease at any time.
RU is a read-based currency partitioned at granularity of a 1 second. It is normalized across DB operations. Cost the operations via machine learning pipelines that cost queries (e.g. scans, lookups, etc.). Have run machine learning on models on telemetry data, and then calibrate the cost model accordingly for RUs. ((DATA DRIVEN). Back to partitioning model: at any time can change throughput and behind the scenes you can specify the throughput (RUs) you want. Behind the scenes the re-partitioning will occur, and each one will get more RUs to provide the throughput asked for. This is where splitting/merging partitions matters, but it happens behind the scenes and you don’t have to worry about it.
What about when you add regions? You want to add more RUs so you don”t starve existing regions. Those RUs are spread across all partitions and regions. Rimma shows how one customer elastically provisioned resources during the holiday season to size up to meet demand, and then size down when no-longer needed. In a 3 day period, Rimma shows a graph of RUs. At the top end there are 3 trillion RUs. (THREE TRILLION IN THREE DAYS PEOPLE)
Various nodes have a various number of RUs serving different workloads, and you can look at the different tenants and partitions in there. Multi-tenancy and global distribution at that level is incredible.
Another tenant: Guaranteed low-latency at 99%. This was a core requirement because time is money. From the business part of view, twice as much revenue lost to slowdowns. So the system is designed. At 99th percentile, less than 10ms for the reads measured at 1KB document (which is =80-90% of workload). At average, will observe lower latency (less than 2 ms for reads and 6ms for writes. How is this accomplished? Reads and writes from local region and SSDs done. The database designed to be write optimized and using latch-free database engine. All data is indexed by default. This is a fundamental difference from relational databases, here we have automatically indexed SSD storage. Customer example: application in California and data in far east. Added another region and then latency dropped. Over black Friday/cyber Monday latencies less than 10ms for reads and 15ms for writes. Azure Cosmos DB allows you to be the speed of light. If have a database in Chicago and have friends in Paris who want to read your data. If this was a central database they would request to read the data from Paris and getting that data from Chicago to Paris takes 80-100 ms. With CosmosDB you get it in less than 10ms because of those regional locations.
The last here is the consistency model in CosmosDB. How does it go? When you are dealing with any distributed system, whether databases or other sytem, typically you are faced with fundamental trade off of latency, availability, consistency and throughput. If centralized database all requests against primary copy. With global distribution, get geo–replication get HA and low latency. But what happens if one replica can’t communicate with others and updates are being made? What kind of consistency guarantees are made? This can be a problem! Do you need to wait for data to be synchronized before you serve it? Do you want strong consistency or eventual consistency? Do you want the red pill or blue pill? With a relational database you get a perfect consistency. They won’t serve the data until quorum is an agreement. The price there is latency. On the other hand, the whole movement of no consistency guarantees means low latency. But real-world consistency is not a binary choice as just described.
What about something in between? The research literature talks about the wild west of consistency models (not one or the other). A parper recommended is Replicated Data Consistency Explained Through Baseball by Doug terry, a Microsoft Research individual. Uses real-world examples from baseball. Depending on who you are in the game, you might get value out of different consistency models. The engineers asked the question: can we pick out an intermedicate consistency model and is easy to configure, programmable, presents clear trade-offs? Most real-life applications don”t fall into those two extremes. Bounded-stateless, session (monotonic reads and writes local to geo location), and consistent prefix (when updates applied, the prefixes are guaranteed to be consistent).
How is this accomplished? use TLA+ specifications to specify consistency models. If you don”t know, check out video by Leslie Lampert who is an important figure in how the system was designed. Leslie was a Touring award winner for Paxis (sp?) algorithm and founding father of what is used in the stack.
Operationalized the five different consistency models. Using telemetry to see then how those models are used. Only 3% use strong consistency, 4% use eventual, and 93% are using the three models in between. Depending on consistency model specified, might need more computational work, which requires RU. Have to make trade offs accordingly, and you can monetize this and decide what’s’ best.
Comprehensive SLAs…high availability SLAs are not enough. Microsoft is in the service business, not just the software business. Typically services don’t give any SLA, or give for HA. When tried to approach this problem, asked “What are all the things that developers and customers really care about?” They care about performance, throughput, consistency, availability and latency. SO this is the first service in the market that has published comprehensive SLAs that are backed up by Microsoft financially. Can now see that guaranteed if come in and run workload, will get guaranteed performance, throughput, consistency, availability and latency . Availability tracked at the tenant and partition level in 5 minute granularity. Customers can see their run time statistics against their SLA.
Service is also multi-model. Wanted to enable native integration with different data models. Behind scenes just ARS model (atom-record-sequence). All models get translated to ARS model. Very easy for the service to then on-board other data models now and in the future. If want document and graph, do not need two copies of data, it can be handled by the same set of data. This is a powerful combination — to look at data through different lenses.
Why schema agnostic approach? Modern applications that are built in the cloud are not static. Can start with one schema, add more tables/new columns…need a robust approach to handle these scenarios. The object model is schema-free. The data gets stored as-is. How do you query this data? Behind the scenes the data is ARS. At the global scale, dealing with indexes, schema, etc. is a nonstarter. In CosmosDB there is schema agnostic indexing. The indexes are a union of all document trees, and can then consolidate into one and only keep unique values. All of this structure info is then normalized. It is an inverted index which gives optimal write performance. Can identify where documents located and then serve up. The index overhead in practice is pretty small. There is a published paper, Schema-Agnostic Indexing with Azure Cosmos DB, go read it!
The Bw-tree in Azure Cosmos DB is highly concurrent, non-blocking, optimized for SSDs. Avoids random writes. There are three layers, the b-tree, cache and log structured store (see paper). Rhema is going faster now. I’m not kidding. Bw-tree is implemented as delta updates. There is a mapping table to the updates and updates stored as deltas (sounds like in-memory index structure?).
Rimma shows architecture of Query Processing, there are different semantics but have the same underlying components (compiler, optimizer, etc.). The engine is very flexible and expect that in the future will host other run-times. The multi-API approach allows native support for multiple APIs. If want to store data in cloud but not re-write your app, you can. There are more APIs coming in the future. What does this strategy enable?
- No recompilation needed
- Better SLAs, lower TCO
- leverage the existing OSS tool-chain and ecosystem and development IT expertise
- Life and shift from on-premises to cloud
- No vendor lock-in
- Symmetric on-premises and cloud database development
All happy databases are alike, each unhappy database is unhappy in its own way (Kyle Kingsbury via Leo Tolstoy).
How run service? Weekly deployments worldwide. 16 hours of stress testing every day. It’s like magic factory across the globe. Also run variant and invariant checks. Bots that are fixing nodes that might have issues. Consistency checking and reporting going over the data continually.
In conclusion…wanted to put herself in our shoes. It’s a lot of information to digest…particularly if not invested in this technology. Why should you care? Rimma brings up a quote from Darwin:
It is not the strongest species that survive, nor the most intelligent, but the ones most responsive to change.
Can try CosmosDB for free, no need to credit card info, etc. Childhood dream of going to Cosmos (space) will be fulfilled.
Key points to remember:
- global distribution, horizontal partitioning and fine grained multi-tenancy cannot be an afterthought
- schema agnostic database engine design is crucial for a globally distributed database
- intermediate consistency models are extremely useful
- globally distributed database must provide comprehensive SLAs beyond just HA
This is a hidden gem, but the bigger message remember the entire NoSQL movement is a counter-culture movement. But the question is how would we build databases if we started today? Without the legacy that we know, would we look at things differently? Would we focus on limitations or focus on the bigger picture? Sometimes it is ok to break the rules and try to different things. Life is too short to build something that nobody wants. If we focus on real pain points, not the “nice to have things”, but really look at the issues and abandon our constraints and self–imposed limits, we can create a modern SQL. Rimma ends by thanking Dharma Shukla and entire CosmosDB team.