Big Data Architecture

Big Data Architecture

Google was first to invent ‘Big Data Architecture' to serve millions of users with their specific queries. The search-engine gathered and organized all the web information with the goal to serve relevant information and further prioritized online advertisements on behalf of clients. To accomplish, all this, it created web crawling agents which follows links and copy all the web-pages content. Furthermore, sorts or index it so that users can search it effectively.

The Internet data is growing exponentially; hence Google developed a scale-art architecture which could linearly increase its storage capacity by inserting additional computers in its computer network. The information gets distributed over a large number of machines in the cluster.

  • Cloud Bigtable: A variety of objects stored in the key-pair NoSQL database architecture.
  • The Google developed the MapReduce parallel processing architecture where thousand of parallel computers process large database and each processing a chunk of data to produce quick results for overall job.
  • Google Cloud Dataflow: a batch and stream data processing service intended for used for batch computation, ETL (extract, transform, load) and streaming analytics. It supports fast, simplified pipeline development via expressive Java and Python APIs in the Apache Beam SDK. Furthermore provides a rich set of windowing and session analysis primitives as well as an ecosystem of the source and sink connectors. That lets the reuse code across streaming and batch pipelines
  • Google Cloud Dataproc: The managed service for Apache Hadoop and Spark offers automated cluster management, per-second billing and scalable clusters that quickly settle into their new sizes.
  • Cloud Datalab based on Jupyter, an open-source data science platform that allows users to create and share interactive documents that contain visualizations, text and live code.
  • Cloud Dataprep service operated by Trifacta, allows users to clean up their structured and unstructured data in preparation for analysis.
  • Data Studio turns big data insights into charts and dashboards that business executives and typical office workers can understand. Furthermore used to create shareable visualizations and reports that can help workforces make data-driven decisions.
  • Serverless computing: Automate server provision and configuration, thus freeing developers to create and update apps while worrying less about setting up and managing the servers required to run them.
  • BigQuery: An enterprise big data warehouse for low-cost business analytics on petabyte-scale datasets.
  • Cloud IoT Core enables businesses to securely link their IoT devices to Google's analytics and AI services. Users can ingest all IoT data and connect to state-of-the-art analytics services including Google Cloud Pub/Sub, Google Cloud Dataflow, Google Cloud Bigtable, Google BigQuery, and Google Cloud Machine Learning Engine to gain actionable insights.
  • Cloud Pub/Sub to ingest IoT event streams, enabling event-driven computing and stream analytics.
  • Cloud Platform AI Portfolio: Developers seeking to build AI-enabled apps have a range of Google APIs at their disposal.
  • Google Natural Language API used to read between the lines, revealing a user's intent based on a text chat or determine the sentiment surrounding a brand or product based social media posts.
  • Cloud Video Intelligence turns videos into searchable content by using the library of 20,000 labels. It automatically analyzes video and can identify objects and when they appear.
  • Google Cloud Speech API turns speech into text and helps turn voice commands issued in over 110 languages and variations into action. It breaks down language barriers with real-time, neural network-based translation services.

The Google File system was the precursor of HDFS (Hadoop distributed file system), columnar database system HBase, a quering tool Hive, storm, and Y-shaped architecture.

The ‘Big Data Architecture' features include secure, cost-effective, resilient, and adaptive to new needs and environment.

Google Cloud Platform (GCP): The range of public cloud computing hosting services for computing, storage, networking, big data, machine learning and the internet of things (IoT), as well as cloud management, security, developer tools and application development that run on Google hardware. The Google Cloud Platform services accessed by software developers, cloud administrators and other enterprises IT professionals include:

  • Google Compute Engine: An infrastructure-as-a-service (IaaS) offering users with virtual machine instances for workload hosting.
  • Google App Engine: A platform-as-a-service (PaaS) offering software developers access to Google's scalable hosting. Developers use a software developer kit (SDK) to develop software products that run on App Engine.
  • Google Cloud Storage: A cloud storage platform designed to store large, unstructured data sets. It offers database storage options, including Cloud Datastore for NoSQL nonrelational storage, Cloud SQL for MySQL fully relational storage and Google's native Cloud Bigtable database.
  • Google Container Engine: A management and Kubernetes container orchestration engine for Docker containers that runs within Google's public cloud.