Business Intelligence & Data Mining

    BigData & Hadoop

    Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions."

    At multiple terabytes in size, the text and images of Wikipedia are a classic example of big data.

    Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers". What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze.

    HDInsight is an Apache Hadoop implementation that runs in globally distributed Microsoft datacenters. It’s a service that allows you to easily build a Hadoop cluster in minutes when you need it, and tear it down after you run your MapReduce jobs. As Windows Azure Insiders, we believe there are a couple key value propositions of HDInsight. The first is that it’s 100 percent Apache-based, not a special Microsoft version, meaning that as Hadoop evolves, Microsoft will embrace the newer versions. Moreover, Microsoft is a major contributor to the Hadoop/Apache project and has provided a great deal of its query optimization know-how to the query tooling, Hive.

    The second aspect of HDInsight that’s compelling is that it works seamlessly with Windows Azure Blobs, mechanisms for storing large amounts of unstructured data that can be accessed from anywhere in the world via HTTP or HTTPS. HDInsight also makes it possible to persist the meta-data of table definitions in SQL Server so that when the cluster is shut down, you don’t have to re-create your data models from scratch.

    Microsoft SQL Server is a complete set of enterprise-ready technologies and tools that help people derive the most value from information at the lowest total-cost-of-ownership. Microsoft Virtualization technologies, which include Hyper-V in Windows Server with SP1 and System Center, deliver to you new options to provide deployment flexibility for SQL Server. Additionally, you can realize benefits in terms of reduced hardware and energy costs.