Databricks provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. At Connect() developer event today, Microsoft announced the new Azure Databricks service for highest-performance streaming analytics projects. Microsoft worked with the founders of Apache Spark for this new service. Azure Databricks is an Apache Spark-based analytics platform that delivers one-click setup, streamlined workflows and an interactive workspace. Azure Databricks also comes with native integration with Azure SQL Data Warehouse, Azure Storage, Azure Cosmos DB, Azure Active Directory and Power BI. Read about this integration below.
- Diversity of VM types: Customers can use all existing VMs including F-series for machine learning scenarios, M-series for massive memory scenarios, D-series for general purpose, etc.
- Security and Privacy: In Azure, ownership and control of data is with the customer. We have built Azure Databricks to adhere to these standards. We aim for Azure Databricks to provide all the compliance certifications that the rest of Azure adheres to.
- Flexibility in network topology: Customers have a diversity of network infrastructure needs. Azure Databricks supports deployments in customer VNETs, which can control which sources and sinks can be accessed and how they are accessed.
- Azure Storage and Azure Data Lake integration: These storage services are exposed to Databricks users via DBFS to provide caching and optimized analysis over existing data.
- Azure Power BI: Users can connect Power BI directly to their Databricks clusters using JDBC in order to query data interactively at massive scale using familiar tools.
- Azure Active Directory provide controls of access to resources and is already in use in most enterprises. Azure Databricks workspaces deploy in customer subscriptions, so naturally AAD can be used to control access to sources, results, and jobs.
- Azure SQL Data Warehouse, Azure SQL DB, and Azure CosmosDB: Azure Databricks easily and efficiently uploads results into these services for further analysis and real-time serving, making it simple to build end-to-end data architectures on Azure.
- Internally, we use Azure Container Services to run the Azure Databricks control-plane and data-planes via containers.
- Accelerated Networking provides the fastest virtualized network infrastructure in the cloud. Azure Databricks utilizes this to further improve Spark performance.
- The latest generation of Azure hardware (Dv3 VMs), with NvMe SSDs capable of blazing 100us latency on IO. These make Databricks I/O performance even better.
On a related note, Microsoft today announced that they are joining MariaDB Foundation as a platinum member. Soon they will be releasing a preview of Azure Database for MariaDB for a fully managed MariaDB service in the cloud. They also announced that Apache Cassandra API support for Cosmos DB. They will offer Cassandra as a service over turnkey global distribution, multiple consistency levels and industry-leading SLAs.