sexta-feira, 10 de abril de 2015

Análise do livro - Joe Celko's Complete Guide to NoSQL

Terminei de ler o livro "Joe Celko's Complete Guide to NoSQL" e deixo aqui minhas impressões…

Joe Celko é reconhecido no mundo relacional e faz muito tempo que quero ler um livro dele, então pensei: nada melhor que alguém que conhece bem os bancos de dados relacionais para falar de NoSQL, pois a visão seria mais crítica e realista, e não um devaneio achando que bancos de dados relacionais vão morrer até 2020 ou sei lá quando.

Infelizmente o livro deixa a desejar, principalmente se você está com expectativa bem alta, como era meu caso. Isso se deve ao fato do autor se perder em alguns capítulos, com conteúdo que não agrega ao assunto em questão, colocando exemplos e trechos de código que estão for a do objetivo, deixando outros capítulos bem superficiais.

E se você quer um livro onde vai ver alguns pequenos exemplos do uso do MongoDB, Hadoop, Cassandra, Neo4J, etc., para traçar paralelos, este NÃO é o seu livro. O autor em geral somente cita os produtos, sem mostrar exemplos de uso (ex.: programação ou administração), ou delinear prós e contras de cada um, o que deixa claro a proposta do livro em tratar os conceitos por detrás dos diferentes tipos de bancos de dados.

Pontos fortes:

  • O autor fala de outros modelos de bancos de dados que fogem as 4 tradicionais "categorias" de NoSQL (key-value store, column store, document store e graph).
  • Com a imensa bagagem do Celko ele conhece boa parte do histórico de diversos bancos de dados, o que nos ajuda a entender bem a evolução da tecnologia.
  • Muitas referências dos capítulos são fantásticas, oferecendo um bom guia para continuarmos aprofundando no assunto.
  • Para os temas em que aparentemente o autor tem mais intimidade, ele consegue ilustrar bem como é resolvido o problema no mundo relacional, comparando com outra abordagem.
    • O detalhamento de algumas questões sensíveis, como concorrência, níveis de isolamento e seus efeitos, são tratados com excelente detalhamento. 

Pontos fracos:

  • É comum você parar no meio de um capítulo e se questionar… por que isso está aqui? Bem fora de contexto e que poderia ser apenas uma referência para leitura. 
  • Também fica claro quais são os capítulos em que o autor tem mais experiência e gasta mais tempo, deixando o livro desigual.
    • Ex.: O capítulo "Hierarchical and network databases" é mais detalhado e extenso que o "map-reduce"
  • Não é um livro que mostra grandes exemplos de código NoSQL, mas sim trechos onde mostra como seria resolver com SQL tradicional.
    • Uma pequena curiosidade… a palavra JSON aparece 1 vez no livro. A palavra MongoDB somente em 4 parágrafos diferentes.
  • Provavelmente você espera ver mais detalhes de bancos colunares ou sobre map-reduce do que um capítulo sobre "Biometrics, Fingerprints, and Specialized Databases".

Então estou na dúvida se minha nota seria de duas ou três estrelas, então atribuo 2,5 estrelas (de 5!). Alguns capítulos são muito bons, já outros não valem a pena, pois são rasos até para um overview.

Aqui está o link para o livro na Amazon: http://www.amazon.com/Celkos-Complete-Guide-NoSQL-Non-Relational/dp/0124071929

Confesso que tenho na estante (ou no Kindle) outros livros do Joe Celko que ainda não li e, mesmo decepcionado com este livro, com certeza vou investir meu tempo para ler os outros, pois já tive ótimas referências.

Abraços

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

quinta-feira, 9 de abril de 2015

Agendado: Primeiro treinamento de Oracle

Em Junho/2015 vou fazer meu primeiro treinamento Oracle e claro, sendo um database geek, nem preciso dizer que estou muito animado com a ideia. Então resolvi falar um pouco sobre isso...

Já faz tempo que quero conhecer o Oracle, pois hoje não conheço PN do produto, o que é um quase um pecado capital para um profissional que todo dia estuda para se tornar um melhor arquiteto de dados.

Defendo veementemente que  o conhecimento de outros bancos de dados, entendendo seus pontos fortes, limitações e detalhes de arquitetura, vai fazer com que você não se torne um xiita babaca, sabendo propor soluções elegantes para problemas complexos... Além de ser muito divertido!
Por exemplo: estou conduzindo um benchmark para o armazenamento colunar do DB2 e discutindo com os consultores Nimbus sobre o uso do SQL Server ColumnStore em uma empresa. Comparando as diferentes implementações, já temos uma boa ideia de potenciais dificuldades e como superá-las.

E se continuar somando conhecimento de Oracle, NoSQL, outros modelos de armazenamento e processamento de dados, com certeza no futuro os problemas que terei chance de encarar serão ainda mais desafiadores.

Lembre-se que você está ouvindo isso de um profissional que está a mais de 15 anos com o SQL Server por perto e, mesmo depois de passar os últimos 2 anos dedicando mais tempo para o DB2, se considera hoje um melhor consultor e instrutor SQL Server.

Mas e aí, vou fazer os treinamentos oficiais da Oracle? Nem a pau! No momento é puro preconceito meu, desenvolvido ao longo dos anos conhecendo diferentes treinamentos SQL Server, então vou apostar todas as minhas fichas na Nerv. Vejo muita semelhança nos modelos de treinamento da Nimbus e da Nerv, empresa do Oracle ACE Ricardo Portilho (http://nervinformatica.com.br/instrutores.php), então a escolha foi lógica.

Vou começar logo por um treinamento avançado, o Oracle Performance Diagnostic e Tuning (http://nervinformatica.com.br/opdt.php), pois poderei traçar muitos paralelos com outros SGBDs e potencialmente fazer perguntas interessantes, que em uma turma iniciante seria totalmente fora de contexto.

Depois do curso vou registrar aqui as minhas impressões! E lá vou tentar controlar minha curiosidade, senão tenho certeza que o Portilho vai me expulsar de sala e nunca mais farei outro treinamento... :-)

Abraços

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

quarta-feira, 8 de abril de 2015

SQL Server Além do Conceito – Blog Post Collection

Foi lançado uma coletânea de posts de blogs brasileiros sobre SQL Server, que tem por objetivo compilar alguns artigos escolhidos por cada autor.
A ideia partiu de um grupo que se intitula "SQL Friends" e tem por objetivo ajudar interessados pelo SQL Server, centralizando em um único local um conteúdo bem interessante.

E obviamente o livro é grátis... faça o download!

** O Nogare cedeu um espaço do seu site para publicarmos o livro, que também pode ser acessado diretamente do dropbox: https://www.dropbox.com/s/15bk2vh2cjdrpu5/SQL%20Server%20Al%C3%A9m%20do%20Conceito%20-%20Blog%20Post%20Collection%20-%20Original.pdf?dl=0



Espero que vocês gostem, boa leitura.

Abraços

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br

terça-feira, 7 de abril de 2015

[Fun] Objetivos de uma prova... A maior lista da história da computação.

Esbarrei com isso agora pouco e não pude deixar de dar uma boa gargalhada...

Exame C2030-136: Foundations of IBM Big Data & Analytics Architecture V1 (http://www-03.ibm.com/certify/tests/objC2030-136.shtml)

Se você for na guia de "Objectives", vai encontrar uma pequena lista, que coloco abaixo para registro... Acho que é o material completo de estudo. kkkkk

Vou até jogar isso em um cluster hadoop e implementar alguns map-reduce... BIG DATA FOR THE WIN!!! :-)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Section 1: Big Data & Analytics Benefits and Concepts

Explain Volume, Velocity, Variety and Veracity in relation to BD&A.
With emphasis on the following:

Volume (data at scale) is about rising volumes of data in all of your systems - which presents a challenge for both scaling those systems and also the integration points among them.

Experts predict that the volume of data in the world will grow to 25 Zettabytes in 2020.

Big Data & Analytics shifts the scale from analyzing subsets of data to analyzing all the data available.

Velocity (data in motion) which is the ingesting data and analyzing in-motion opening up the opportunity to compete on speed to insight and speed to act.

Big Data & Analytics shifts from analyzing data after it has landed in a warehouse or mart to analyzing data in motion as it is generated, in real time.

Variety (data in many forms) is about managing many types of data, and understanding and analyzing them in their native form.

80% of all the data created daily is unstructured - videos, images, emails, and social media

Big Data & Analytics shifts from cleansing data before analysis to analyzing the information as is and cleansing only when needed.

Veracity (trustworthiness of data) as complexity of big data rises it becomes harder to establish veracity, which is essential for confident decisions.

According to the IBM GTO study in 2012, by 2015, 80% of all available data will be uncertain and rising uncertainty = declining confidence.

Big Data & Analytics shifts from starting with a hypothesis and testing that against data to exploring all the data to identify correlations.

Define analytics and the various types
With emphasis on the following:

To solve a specific business problem, organizations will need to deploy the right type of analytics to suit each distinct situation.

Descriptive Analytics is the retrospective analysis that provides a rearview mirror view on the business through reporting on what happened and what is currently happening. Business Intelligence (BI) falls under this category.

Predictive Analytics (PA) is the forward-looking analysis that provides an organization a future-looking insights on the business by predicting what is likely to happen and why it's likely to happen.

Prescriptive Analytics Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen. Further, prescriptive analytics suggests decision options on how to take advantage of a future opportunity or mitigate a future risk and shows the implication of each decision option. Prescriptive analytics can continually take in new data to re-predict and re-prescribe, thus automatically improving prediction accuracy and prescribing better decision options. Prescriptive analytics ingests hybrid data, a combination of structured (numbers, categories) and unstructured data (videos, images, sounds, texts), and business rules to predict what lies ahead and to prescribe how to take advantage of this predicted future without compromising other priorities.

Cognitive Analytics allows organizations determine best course of action by using cognitive computing systems that learns and interacts naturally with people to extend what either humans or machine could do on their own. Cognitive computing systems learn and interact naturally with people to extend what either humans or machine could do on their own. They help human experts make better decisions by penetrating the complexity of Big Data.

Describe the role of machine learning on BD&A - Extracting meaningful information from large data sets and processing these large datasets in a reasonable amount of time is challenging. Traditionally, data has always been dominated by trial-and-error analysis and this approach has becomes impossible when datasets are large and heterogeneous. Machine learning will enable cognitive systems to learn, reason and engage with us in a more natural and personalized way. These systems will get smarter and more customized through interactions with data, devices and people. They will help us take on what may have been seen as unsolvable problems by using all the information that surrounds us and bringing the right insight or suggestion to our fingertips right when it's most needed. Machine learning offers a solution to this problem by emphasizing on real-time and highly scalable predictive analytics using fast and efficient algorithms for real-time processing of data. Examples of business use cases that benefits from machine learning techniques include churn prevention, customer segmentation, fraud detection and product recommendations.

Difference PA vs BI: PA provides more future-looking answers and recommendations to questions that cannot be addressed at all by BI. PA + BI delivers significantly higher returns than traditional BI implementations that are not predictive in nature.

Describe the value of analytics to support business decisions
With emphasis on the following:

Describe the role of interactive analysis and reporting

Business analytics enables organizations to turn information into better outcomes. To achieve better outcomes, Decision makers need to make smarter decisions, based upon having answers to the following: How are we doing? Why is the trend occurring? What should we be doing? Analytics empowers users at every level of the organization, in every role, to make better decisions. It come down to a desire to make more business decisions based on actual facts rather than gut instinct. Business users need tools that allow them to sort data any way they like to derive additional insights. 

-Providing a sophisticated, yet simple to use interactive analysis and reporting tools into the hands of every business user drives creativity and inspire a culture of evidence-based decision making. It empowers users at every level of the organization, in every role, to make better decisions. It come down to a desire to make more business decisions based on actual facts rather than gut instinct. Business users need tools that allow them to sort data any way they like to derive additional insights with minimal IT reliance.

-Self-service predictive analytics puts the power of predictive modeling in the hands of business users. Using predictive models, users identify patterns based on what has happened in the past, and use them to predict what is likely to happen in the future. For example, you can use a model to predict which customers are least likely to churn, or most likely to respond to a particular offer, based on characteristics such as income, age, and the organizations and memberships they subscribe to. The resulting predictions can be used to generate lists of target customers or cases of interest, as input for strategic planning, or can beintegrated with rules in the context of a predictive application.

Explain how Big Data & Analytics are interlocked
With emphasis on the following:

Data: Data is both unstructured to structured. In general, data does not need to be processed by Extract, Transform, and Load(ETL) to be useful.

Variety: Data comes for multiple sources, across different domains. Care must be taken to avoid aligning data with specific applications early in workflow to allow data to be used in other applications

Association: Correlation criteria ranges from simple to complex. Solutions will employ

Provisioning: Resources are provisioned based on consumer demand. Expect IT to be responsible for allocation and initial data staging
- Benefits:
- Reduced IT cycle time and management cost.
- Extended range of sources to meet demand.
- Storage allocated and expanded to meet demand.
- Processing capacity added and removed to meet demand.
- Processing Expand process capacity to meet demand. 
- Personas: IT and line of business users are jointly responsible for finding and initially preparing data.

Governance: Newly provisioned data must adhere to corporate and data governance policies. Mechanisms must be created to share policy decisions and to track compliance.

Analytics: Services and tools that augment data sources define the analytics that add value and understanding to new data.

Schema: Data is generally organized in columns so it is easily consumed by analytic tools. Unstructured data is also used but is restricted to search and tools use MPP like Hadoop. Data should not necessarily be normalized for use in a specific application but be provisioned for more general access and analysis.

Visualizations: Data is often highlighted using advanced visualizations and graphics where traditional charts do not offer enough insight into overall dataset.

Accuracy: Provide results that are relevant and statistically sound. Tools must highlight areas of concern where assumption may be incomplete or statistically invalid.

Range ofusers: Tools must allow business users to use analytics and statistics without insisting these user be domain experts.

Compatibility: Results that are compliant with existing reporting and application frameworks.

Persuade: Use results to prove a hypothesis and to persuade.

Related Information : Extend knowledge to related concepts and other sources of data.

Explain the various data preparation processes
With emphasis on the following:

Explain methods used to transform the data for analytics

One of the main functions of an ETL tool is to transform structured data. The transformation step is the most vital stage of building a structured data warehouse. Here are the major transformation types in ETL:

- Format revision. Fields can contain numeric and text data types. If they do, you need to standardize and change the data type to text to provide values that could be correctly perceived by the users. The length of fields can also be different and you can standardize it.

- Decoding of fields. In multiple source systems, the same data items are described by a variety of field values. Also, many legacy systems are notorious for using cryptic codes to represent business values. This ETL transformation type changes codes into values that make sense to the end-users.

- Calculated and derived values. Sometimes you have to calculate the total cost and the profit margin before data can be stored in the data warehouse, which is an example of the calculated value. You may also want to store customer's age separately–that would be an example of the derived value.

- Splitting single fields. The first name, middle name, and last name, as well as some other values, were stored as a large text in a single field in the earlier legacy systems. You need to store individual components of names and addresses in separate fields in your data repository to improve the operating performance by indexing and analyzing individual components.

- Merging of information. This type of data transformation in ETL does not literally mean the merging of several fields to create a single field. In this case, merging of information stands for establishing the connection between different fields, such as product price, description, package types, and viewing these fields as a single entity. 
- Character set conversion
- Unit of measurement conversion
- Date/Time conversion
- Summarization
- Key restructuring
- De-duplication

Transformation of structured data also varies by where it occurs. Data can be transformed in the source system before it is moved, in an ETL engine, or in the target system after it lands ELT.

Some type of feature extraction must be applied to unstructured data to convert it to structure data before applying the above kinds of transformations.

Explain methods used to clean the data for analytics.

Data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate data from a data set.

Data validation typically applies data cleansing methods interactively as data is being entered. Other forms of data cleansing are typically performed in larger batches.

Care must be taken when modifying data to be used for analytics that the modifications do not affect the analytic results. Enough cleansing must be done to permit analytics, but less change is better.

Error detection may be strict or fuzzy. Strict detection may match against permitted legal values, or apply simple algorithms like regular expressions. Fuzzy detection applies statistical techniques to identify errors. Duplicates may be detected within a data set using strict or fuzzy matching.

Error correction may include standardization of values, converting formats, enhancing data by merging with additional data, filtering out incomplete records, and merging duplicate records.

High-quality data needs to pass a set of quality criteria. Those include:
- Validity: The degree to which the measures conform to defined business rules or constraints. Data constraints fall into the following categories: 

- Data-Type Constraints - e.g., values in a particular column must be of a particular datatype, e.g., boolean, numeric (integer or real), date, etc.

- Range Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.

- Mandatory Constraints: Certain columns cannot be empty.

- Unique Constraints: A field, or a combination of fields, must be unique across a dataset. For example, no two persons can have the same social security number.

- Set-Membership constraints: The values for a column come from a set of discrete values or codes. For example, a person's gender may be Female, Male or Unknown (not recorded).

- Foreign-key constraints: This is the more general case of set membership. The set of values in a column is defined in a column of another table that contains unique values. For example, in a US taxpayer database, the "state" column is required to belong to one of the US's defined states or territories: the set of permissible states/territories is recorded in a separate States table. The term foreign key is borrowed from relational database terminology.

- Regular expression patterns: Occasionally, text fields will have to be validated this way. For example, phone numbers may be required to have the pattern (999) 999-9999.

- Cross-field validation: Certain conditions that utilize multiple fields must hold. For example, in laboratory medicine, the sum of the components of the differential white blood cell count must be equal to 100 (since they are all percentages). In a hospital database, a patient's date of discharge from hospital cannot be earlier than the date of admission.

- Decleansing is detecting errors and syntactically removing them for better programming.

- Accuracy: The degree of conformity of a measure to a standard or a true value.

- Completeness: The degree to which all required measures are known. 

- Consistency: The degree to which a set of measures are equivalent in across systems.

- Uniformity: The degree to which a set data measures are specified using the same units of measure in all systems.

Data Preparation: Data is prepared to balance flexibility, timeliness, accuracy and size. ETL and Big Data are often driven by different requirements.
- ETL
- Volume: Dataset size is driven by application requirements. Datamarts contain summary or snapshot data.
- Velocity: Data is staged until it can be filtered and added to the appropriate data warehouse.
- Variety: Key structures normalized to follow First Normal Form (1NF) to Third Normal Form (3NF) schema
- Veracity: Data is trustworthy within the domain of the application for which it is designed.
- Big Data
- Volume: Dataset size bounded by technology and cost constraints. Virtual datamarts contain dynamic information.
- Velocity: Data is captured for both real time and post processing as required.
- Variety: Key structures are denormalized. Data is generally inconsistent when viewed across application domains.
- Veracity: Data is trustworthy when it shown to be consistent and accurate.

Describe the trade-off of transforming, cleaning or aligning of data for analytics

Tradeoffs: Data becomes more accurate within the domain of an applications as it processed and cleaned. At the same time it can become less flexible and less accurate in different applications. Data is more readily consumed by analytics when it is formatted in columns.
- Variety: Data comes from multiple sources, across different domain.
- Generality: Less processed data potentially is more flexible but less consistent.
- Accuracy: Highly processed data is potentially more accurate in predefined application domains; less so across domains.
- Conformance: Data that is aligned with existing reporting structures is typically easier to use and more relevant. Data alignment can reduce consumability in different domains.
- Consistency: Removal of duplicate and missing values increases accuracy within targeted application domain.
- Schema and Cleaning: Data structures in 1NF to 3NF increase relevance within targeted application domain.

Describe the four steps for preparing data that is "good enough" for analytics

These are "Data Wrangling" steps for a data reservoir:
- Provision storage. Consider cloud for elastic environments. 
- Create columns that classify and distinguish incoming data. Use date/time and geolocation when nothing else fits.
- Clean potential keys columns to have most consistent values possible across largest number of domains (non-trivial). Remove duplicates, normalize aliases, to achieve something close to 1NF.
- Apply analytics to add value to the source data set. Update schema and global catalog where appropriate.

Describe the need and use for geospatial and temporal filtering in BD preparation.

steps for preparing data that is "good enough" for analytics

Describe the need and use for geospacial and temporal filtering in BD preparation

Filtering by an interval of time can reduce the amount of data to be considered for analytic purposes.

Filtering by a geospacial region can reduce the amount of data to be considered for analytic purposes.

These 2 filters are commonly applied as a first step to reduce large data volumes.

Explain specific data preparation techniques for creating structured data
With emphasis on the following:

Define text analytics

refers to the process of deriving high-quality information from text. Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

Describe the benefits of text analytics

Text analytics applies analytic techniques to unstructured text as may be found in documents or within text fields in structured data

Text analytics on survey text lets you transform unstructured survey text into quantitative data to gain insight using sentiment analysis.

Text analytics on social media content can be used to identify buzz (level of interest in a topic) and sentiment (positive or negative feelings toward a topic). Social media content can also augment customer profiles.

Social media analytics can be used to predict customer behavior and create targeted marketing campaigns

Text analytics can be used to filter or categorize documents, such as sorting email into folders or identifying spam

NLP can provide the following benefits from text analytics:

- Provides an easy to use environment for capturing the knowledge of your business domain experts into dictionaries and semantic rules for re-use. 

- Allows customizable Information Extraction for logical reasoning to draw inferences from natural, unstructured communications. 

- Offers Entity & Relationship Recognition to classify words or phrases into categories that can be analyzed for business meaning.

Explain how to extract features from unstructured data to provide input for analytics

Analytic tools and algorithms require input to be presented in specific statistical data types: binary (yes/no or true/false), categorical (arbitrary labels like blood type or gender), ordinal (relative score), binomial (number of successes out of possible), count (number of items in a given space or time), and real numbers.

Feature extraction is any process for converting unstructured data into statistical data types. Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. Some examples:

- Feature extraction on text includes word stemming to standardize words into root words, bag of words to remove sequence information and compare documents in a vector space model, counting word occurrences, regular expressions for detecting simple patterns in sequences, search indexing, and phonetic indexing algorithms like Soundex or NYSIS.

- Feature extractions on image data includes histograms, edge detection, blob detection, template matching, motion detection, visual flow, optical character recognition.

- Feature extraction on machine log data may include pattern recognition, standardization to a common format, classification, correlation (collecting messages from different systems that belong to a single event), and various filtering techniques.

Describe the general sources of data for BD&A
With emphasis on the following:

Big data requires a new view on business intelligence, data management, governance, and delivery. Traditional IT landscapes are becoming too expensive to scale and maintain in order to meet the continuous exponential growth of data, both structured and unstructured. Data truly is driving the need for innovative and more cost effective solutions.

The availability of new and more data opens up opportunities for gaining more in depth and accurate insights on operations, customers, weather, traffic, and so on. It is important to keep in mind that Big Data & Analytic solutions deal not only with emerging data (or new data sources). A truly cohesive solution considers traditional data as well as this emerging data, often with an objective of having each enhance the understanding of the other.

Data volumes and data generation speed are significantly higher than it has been before. All this new kinds of data require new set of technology to store, process and to make sense of data. Data Sources. About 80% of data available to an organization is unstructured, and there are many more new types of data coming from many sources such as social media posts (twitter, Facebook), surveillance cameras, digital pictures, call center logs, climate data, and many others. There are also other types of structured data from sensor devices, smart meter, click streams data and others.

Big data is typically identified as having one or more of the following attributes: volume, variety, velocity, and veracity. Typical types of source data can be broken down into the following categories, each of which may demonstrate any of the four V's.
Type Description Examples
Machine and sensor data. An increasingly sensor-enabled, instrumented, and connected world generates huge volumes of data with machine speed characteristics. Machine and sensor data covers all data generated by machines. Radio Frequency ID Data generated from servers, applications, networks, etc.
Image and video Digital images and videos Security surveillance cameras, Smart phones, Web Images/Videos
Enterprise content An organization's documents, and other content, that relate to the organization's processes. Documents, Forms, Checks
Transaction and application data Typically, structured data that describes an event and generally recognized as a traditional source. Point of sale transaction dataData entered by user via a web formData entered into a CRM application
Social data An expression of social media (user-generated content on the internet) in a computer-readable format that also includes metadata providing context (location, engagement, links). Focused strictly on publicly available data. TwitterFacebookYouTube
Third-party data Data obtained under license from third party organizations. Typically structured and recognized as a traditional source. Experian, D&B

Section 2: Big Data & Analytics Design Principles

Explain when it is appropriate to use Hadoop to support the BD&A use case
With emphasis on the following:

Exploration, Landing and Archive - Big Data Repositories:

New and economical generation of technology emerged to enable organizations to explore and apply analytics and extract value from any kinds of data. The open source Hadoop became the choice of technology to store the all the data (big data).

There are some limitations or drawback on a Hadoop database, it often compromise consistency in favor of availability. The current offerings lack of full ACID transaction support. Also, as tools are still evolving, there is l a need to do some complex coding to perform certain types of data processing, but the platform offer many benefits to enable organizations to improve efficiency to deliver analytics.

Hadoop is based on Massive Parallel Processing (MPP) architecture it's possible to scale a Hadoop cluster to hundreds (and even thousands) of nodes. Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed throughout the cluster. In this way, the map and reduce functions can be executed on smaller subsets of your larger data sets, and this provides the scalability that is needed for big data processing. In addition to cost and scalability, Hadoop also brings other benefits such as high availability and data redundancy.

Landing

Landing refers to a landing or provisioning area. In the case of Big Data & Analytics this landing area may support data that is not yet modeled and/or unstructured data. A key consideration is to utilize a cost efficient platform. Data can be landed in its original format (raw state) and leveraged both for a historical record of its original state and to support exploration activities.

The open source Hadoop and commodity hardware become the preferred choice as it can scale as needed and it offers as a cost effective solution to store and process large amounts of structure and unstructured data. A Hadoop database provides a mechanism for storage and retrieval of data that is not necessary modeled in traditional tabular relations as used in relational databases.

One of the key benefits of the Hadoop system is the fact that the data can be stored into a Hadoop File System (HDFS) before a schema is defined. This make very simple to organizations to collect any kind of data (structured and unstructured) and land into a Hadoop system for further investigation and processing. After data is loaded, a schema can be applied and it provides SQL like capabilities to allow data interaction and exploration using traditional BI and Reporting tools.

With evolution of the data integration tools, the Hadoop platform also became an attractive cost effective platform to support part of the data transformation process. This allows organizations to offload such workloads from expensive platforms such as a Data Warehouse.

Exploration

The exploration is one of the key elements Next generation architecture for delivering information & insights.

Exploration refers to a system supporting exploratory querying and analytics of detailed data (not aggregated), often in the raw form. While it may have a SQL based interface, it is not restricted to structured data. Activities supported here would include clustering, segmentation, forecasting, predictive analytics, and others that require large amounts of data. And typical users would include business analytic users and/or data scientist.

In this new architecture the Exploration Repository has much more flexibility and can provide much more agility for the end users to quickly obtain insights on new sets of information that was not available in Data Warehouse & Data Mart. New Big Data and Analytics tools allow users to access structured and unstructured data for exploratory analytics, identify interesting correlations and insights. Particularly leveraging the detailed data to support the needs of exploratory analysis while pushing-down the processing to execute as much as possible in this area directly, without needing to duplicate and move the data elsewhere for exploratory use.

Typical tools used are R, SAS, and SPSS. The data here may include replicated data pulled from operational systems, third party and public sources. Exploration activities imply that the answer or the question is not typically known. Thus, a modeled environment is ill suited for these activities. Rather, a schema-less environment allows users to access data in its original form quickly, manipulate it in various ways, and use all available data.

The cost and flexibility of this platform is ideal to store longer history of data (data that typically was stored on the Warehouse and Data Marts). Basically, Hadoop database became a system of record area and fulfills the needs of the power users (data scientists) that need to perform exploratory analytics on deep history (several years) and raw data.

Archive

While the need to archive "cold" data from data warehousing environments (to reduce costs, improve performance) is as necessary as ever, customer are also requiring to have the ability to query this data. Similarly, while this data may be of no interest for operational reporting or business intelligence, it may still be of relevance for a small set of users performing exploratory or deep analytics

As the Hadoop infrastructure provides a better cost effective platform, the data that is no longer required to be active can be archived for legal reasons. This allows in the future that this data could be accessed without the need to procure new hardware to restore from other magnetic devices.

Explain when it is appropriate to use data streaming to support the BD&A use case
With emphasis on the following:

When streaming data is available, it can either be analyzed in real time as it arrives or stored for later analysis.

Analyzing streaming data in real time can reduce the time to deliver results.

Analyzing streaming data in real time can reduce the storage required by eliminating the need to collect data in a repository before analyzing it.

Techniques that can be applied to streaming data include summarization by creating metadata from unstructured data, aggregating data to create averages, filtering and other data reduction techniques.

Streaming data from multiple sources can be correlated and merged into a single stream.

Streaming data includes audio data, video data, sensor data, geospatial data, telemetry, telematics, and machine log data. The internet of things is generating an increasing volume of streaming data.

Sample use cases are:

Real time fraud detection, cyber security, telematics, network load optimization, real time contextual marketing campaign

Explain when it is appropriate to leverage data streaming and Hadoop for data integration Extract, Transform, and Load (ETL).
With emphasis on the following:

Hadoop is designed to use low-cost commodity hardware to provide a data repository. Using Hadoop as a landing zone for streaming data can provide a cost effective repository for collecting streaming data for later exploration and analysis.

Other options for landing streaming data include relational database or file storage systems.

Describe when to use analytics on data-in-motion vs analytics on data-at rest to support a business use case
With emphasis on the following:

Analytics on Data at Rest

The data at rest consist of physically store data that is considered more static. This data support analytics for use cases where decision frequency is not needed in real-time

Data at rest provides historical data for analysis.

Analytics on data-in-motion - Real-Time Analytical Processing

The data-in-motion consist of leveraging high-speed and highly scalable computing systems to perform analytics on streaming data that is temporary persisted in-memory.

Applying analytics to data in motion supports real time decision-making, adapting to business environments, and helping customers when it really matters - now. It involves ingesting data at varying volumes and velocities, classifying and filtering that data, applying appropriate analytics to it, and providing the appropriate information immediately so that proper action may be taken (either by a person or automatically). Data may be coming from social media, machine data, log data, or other.

The analytics applied in real time are the same models that can be applied to batch data analysis. The models are based on batch data collected. Streaming data eventually will be (perhaps) landed to disk and be able to be used to update or fine tune models. This creates a continuous learning cycle. With each new interaction an organization learns more about their customer that can then be applied to future interactions. In this way, real-time processing works in coordination with the broader big data and analytics architecture. It acts as a method of ingesting data to persist in the environment. It acts as a method to apply analytics in the moment for immediate results.

Real time analytical processing (RTAP) analyzes massive data volumes quickly (in real time) and turn data into Insight to be used to make better decisions. Data can be quickly ingested, analyzed, and correlated as it arrives from thousands of real-time sources (sensor devices, machine data, call detail records, etc). Insights resulted from this processed are turned into actions for automated decision management.

There are many applications of RTAP for almost any industry, here are just few examples to illustrate the benefits of real-time processing and analytics.

Alerting solutions:

The RTAP application notifies the user(s) that the analysis has identified a situation (based on a set of rules or process models) has occurred and provides options and recommendations for appropriate actions. Alerts are useful in situations where the application should not be automatically modifying the process or automatically taking action. They are also effective in situations where the action to be taken is outside the scope of influence of the RTAP application.

Example of alerting application: A patient monitoring application would alert a nurse to take a particular action, such as administering additional medicine.

Feedback applications:

The RTAP application identifies that a situation (based on a set of rules or process models) as occurred and makes the appropriate modifications to the processes to prevent further problems or to correct the problems that have already occurred.

Feedback analysis is useful in, for example, manufacturing scenarios where the application as determined that defective items have been produced and takes action to modify components to prevent further defects.

As an example: A manufacturer of plastic containers might run an application that uses the data from sensors on the production line to check the quality of the items through the manufacturing cycle. If defective items are sensed, the application generates instructions to the blending devices to adjust the ingredients to prevent further defects from occurring.

Detecting system or application failures :

The RTAP application is designed to notice when a data source does not respond or generate data in a prescribed period of time.

Failure detection is useful in determining system failure in remote locations or problems in communication networks.

Examples: An administrator for a critical communications network deploys an application to continuously test that the network is delivering an adequate response time. When the application determines that the peed drops below certain level or is not responding at all, it alerts the administrator.

Use the CAP theorem to choose an optimal data storage technology.
With emphasis on the following:

Brewer's CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees

Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it was successful or failed)

Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Relational database systems provide consistency and availability.

NOSQL systems like Cloudant and CouchDB provide availability and partition tolerance .

NOSQL systems like Hadoop Hbase, Google BigTable, and MongoDB provide consistence and partition tolerance .

Describe the considerations of security on BD&A.
With emphasis on the following:

Explain what are the common focused areas across a security framework

People: Manage and extend enterprise identity context across all security domains with end-to-end Identity Intelligence.

Data: Enterprise-wide solutions for assuring the privacy and Integrity of trusted Information in your data center

Applications: Reducing the costs of developing secure applications and assuring the privacy and Integrity of trusted information.

Infrastructure:
-Endpoint and Server: Ensuring endpoints, servers, and mobile devise remain complaint, updated, and protected against todays threats.
-Network: Guard against sophisticated attacks using an Advanced Threat Protection Platform with Insight into users, content and applications.

Security Intelligence and Analytics: Helping our customers optimize security with additional context, automation and integration.

Describe the data types in the context of data security and complianceWhen talking about Data we distinguish between:

Static Data - data at rest, on disks, tapes and in data repositories (databases, data warehouses, etc)

Dynamic Data - data in motion, as it is being extracted/used by individuals or applications

Meta Data - Data about data, configuration, the settings and vulnerability of the repository itself.

2.7.3. Describe dynamic data security best practices

Create a secure, detailed, verifiable audit trail of all database activities
-User activity, including privileged users
-User creation and object creation and manipulation

Gain visibility into all database activity involving sensitive data
-Who, what, when and how
-Real-time alerts for suspicious activity

Integrate with business processes for audit compliance
-Dissemination of reports to appropriate personnel for signoff and review
-Retain reports and signoffs per audit requirements

Cross-platform, common solution for the enterprise.

Describe Static Data security best practices

All data sources potentially contain sensitive information

Data is distributed as needed throughout the cluster by the Big Data application

Deploy encryption agents to all systems hosting Data Stores

Agents protect the data store at the file system or volume level

Big Data governance automation solutions need to support security and privacy in a manner optimal to the speed and quality objectives of the organization.

The solutions should integrate with corporate glossary, data definitions, and blueprints to align with broader corporate governance processes to ensure accuracy.

The solution should be able to explore and profile data, explore its lineage and relationships, and classify sensitive data.

Data Architects need to De-identify sensitive data within the warehouse and apply obfuscation techniques to both structured and unstructured data while maintaining effective alignment with their analytics objectives.

IT security needs to monitory the warehouses and gain real-time alerts with a centralized reporting of audit data while preventing data breaches.

IT needs the system to automate change management in response to policy, resource and environmental changes.

Enterprise firms need this to scale to large platforms like System z.

Security officers will need the system to integrate with their SEIM platform for a cohesive enterprise dashboard.

Describe the primary prerequisites for predictive analytics.
With emphasis on the following:
1.8. @List prerequisites for predictive analytics:

Data sufficient to train a predictive model for the predicative goal

A business requirement for predictive analytics, such as one of the business applications listed in this article (i.e., a way a predictive model can and will be used, rather than just being a nifty model that may not provide business value); management buy-in for the integration and deployment of predictive scores

Buy-in from business users' community of a predictive analytics initiative

Management sponsorship

Historic data that adequately captures problem domain

Propose alternate outcomes and determine rules that model associated behaviors.

Explain the role of SQL for BD&A
With emphasis on the following:

SQL on Hadoop to support Big Data & Analytics Applications

Hive
- Hive is a data warehouse solution that has a thin SQL-like querying language called HiveQL. This language is used for querying data, and it saves you from writing native MR processing to get your data out. Since you already know SQL, Hive is a good solution since it enables you to take advantage of your SQL knowledge to get data in and out of Apache Hadoop. One limitation of the Hive approach, though, is that it makes use of the append-only nature of HDFS to provide storage. This means that it is phenomenally easy to get the data in, but you cannot update it. Hive is not a database but a data warehouse with convenient SQL querying built on top of it. Despite the convenient interface, particularly on very large datasets, the fact that the query time required to process requests is so large means that jobs are submitted and results accessed when available. This means that the information is not interactively available.

HBASE
- HBase, by comparison, is a key-value (NoSQL) data store that enables you to write, update, and read data randomly, just like any other database. But it's not SQL. HBase enables you to make use of Hadoop in a more traditional real-time fashion than would normally be possible with the Hadoop architecture. Processing and querying data is more complex with HBase, but you can combine the HBase structure with Hive to get an SQL-like interface. HBase can be really practical as part of a solution that adds the data, processes it, summarizes it through MR, and stores the output for use in future processing.

In short, think of Hive as an append-only SQL database and HBase as a more typical read-write NoSQL data store. Hive is useful for SQL integration if you want to store long-term data to be processed and summarized and loaded back. Hive's major limitation is query speed. When dealing with billions of rows, there is no live querying of the data that would be fast enough for any interactive interface to the data. For example, with data logging, the quantities of data can be huge, but what you often need is quick, flexible querying on either summarized or extreme data (i.e., faults and failures). HBase is useful when what you want is to store large volumes of flexible data and query that information, but you might want only smaller datasets to work with. Hence, you might export data that simultaneously: Needs to be kept "whole," such as sales or financial data May change over time Also needs to be queried HBase can then be combined with traditional SQL or Hive to allow snapshots, ranges, or aggregate data to be queried.

SQL on Relational Databases (RDBMS) to support Big Data & Analytics

SQL on Relational Databases is more appropriate for transactional and Operational Analytical Workloads.

Transactional Applications:
- Workloads that drive a lot of inserts, updates and deletes
- Workloads that generate random access on disk 
- Workloads that need to guarantee transaction integrity.
- Workloads with large concurrency (hundreds/thousands of transactions per seconds).
- Operational Analytical :
- Analytical workloads in which high query performance is a key requirement.
- Analytical workloads that requires the transactional characteristics of a relational database (transaction integrity, high concurrency levels, etc).

Compare columnar and row oriented storage for RDBMS .
With emphasis on the following:

Columnar database stores data in columns instead of rows. The major difference between the traditional row oriented databases to the column oriented databases is with the performance, storage requirements and modifying the schema.

Columnar databases are a great option in case your database table has lots of columns and you require query for a small number of them. The column database is designed to proficiently to write and read data from hard disk storage which can speed up the time to return a query. One of the major benefits of a columnar database is that it helps in compressing the data greatly which makes operations very fast. Also, column oriented database is a self indexing that makes use of less disk space than a row oriented database system. Columnar databases method is especially important in data warehousing domain which deals with large volume of data. In data warehouse domain large amount of complex data is loaded, transformed and accumulated which can be easily done by using Column-oriented database system.

Row oriented dataset is ideal when many columns of a single row are required at the same time and when row-size is relatively small, as the entire row can be retrieved with a single disk read. Row oriented databases are well suited for OLTP workloads which are more heavily loaded with interactive transactions

Describe when to leverage "Information Virtualization" on BD&A
With emphasis on the following:

Explain what is "Information Virtualization".

Information virtualization provides views over stored information that has been adapted to suit the needs of systems and processes that use the information. Information virtualization enables information to be stored in various locations but managed, viewed, and used as a single collection. It maximizes the sharing of information, while ensuring individual teams receive information that is customized to their needs.

Information virtualization has two layers of services:
- Information delivery: Information Locator, Search & Navigation, User Interfaces & Reports, Information Services & application program interfaces(APIs).
- Information provisioning: Cache, Consolidation, Federation, Replication.

Describe the information virtualization capabilities.

Information delivery.
- Consumer focused access points include ser interfaces, services, and APIs that provide information to both systems and people. This information can be the business information itself or descriptions and statistics about the business information, which is called metadata. Metadata is used to locate the correct information to use.

Information provisioning.
- Authoritative information sources are made available to the access points using the most appropriate provisioning mechanisms. The provisioning mechanisms can be the following items: 

- Caching provides read-only, in-memory local copies of data that provide fast access for situations where the same query is issued repeatedly on slowly changing information. 

- Federation is real-time extraction and merging of information from a selection of sources providing the most current (but not necessarily most consistent) information. 

- Consolidation makes use of a pre-constructed source of information fed from multiple sources. This approach provides consistent and complete local information (although it might not be the latest available). 

- Replication is an exact local copy of remotely sourced information, which provides locally stored, read-only access to information that is maintained elsewhere. 

- The choice of provisioning method is determined by the needs of the systems and processes that use the information. 

Describe the various ways to index Big Data including.
With emphasis on the following:

Search: Search provides access to unmodeled structured and unstructured data using keywords.

Variety: Data comes for multiple sources, across different domains.

Ease of Use: Data related to keywords is easily found using familiar search interface.

Association: Keywords and results can be associated to metadata that provides easy association with enterprise reporting and analytics.

Indexing: Search indexes enable a search engine to retrieve ordered results in a fraction of the time needed to access the original content with each query.

Indexing: Full-text indexes provide access to large amounts of search data with results provided in milliseconds
- Indexing structured data in columns is the basis for faceted result groupings.
- Indexing unstructured data is the basis for keyword clustering and grouping.
- Indexing structured and unstructured data can enable better search queries by augmenting unstructured keywords with related structured metadata.

Results: Search engines provide results by access the index that was built by crawling and processing source content.

Filtering: Faceted results based on structured data indexing facilitate refinement and filtering.

Exploration: Keyword cluster results based on unstructured data facilitate exploration and discovery of related information.

Describe the benefits of in-memory database to support BD&A solutions.
With emphasis on the following:

In-memory databases provided faster response times for applications that required extreme performance than traditional disk-based RDBMS. However, the drawback was that the data set had to exist entirely in-memory as most of these in-memory products had no ability to store inactive parts of tables on-disk like the traditional RDBMS vendors provided. Therefore, these in-memory databases tended to be used for small transaction processing types of applications or small data marts versus data warehouses, as the data set size and memory requirements did not lend themselves to the size and data volumes of larger data warehouse environments.

Analytic queries tend to access only a portion of a table's columns, rather than the entire row of data. On traditional row based table the result is many more I/O requests to bring the requested columns into memory, as the database must return pull rows of data from disk to access the desired columns of information. END RESULT - Poor application performance

In-memory access to data is much faster than disk I/O and although today's servers allow large memory configurations for database buffer caches, having to bring entire rows of data into memory (to access only a few columns) wastes memory and is not as efficient as a columnar table for analytic queries. END RESULT - Server resources (memory and cores) are inefficiently utilized and query performance is not optimized (better then with smaller amounts of memory, but not as efficient due to non-desired columns being stored in the database buffer cache).

Big data volumes are increasing faster than memory costs are dropping

In-memory databases can support a larger number of concurrent users. Responsiveness is higher because the data is resident in memory.

Section 3: IBM Big Data & Analytics Adoption

Explain how to leverage maturity models for IBM Big Data & Analytics to identify the customers current state and define the future progression. 
With emphasis on the following:

A Solution Advisor can leverage a variety of maturity models to support their assessment of an organization's stature for BD&A such as the IBM BD&A Maturity Model and the Analytics Quotient(AQ) Maturity Model. The Solution Advisor can use these models to solicit information about the organization that helps identify how prepared they are and how to most effectively adopt BD&A. The Solution Advisor can review the information with the organization to clearly identify their current state of ability to consume and manage information then apply appropriate analytics technology for new insights or decisions. The organization leaders can use the forward looking progression points of maturity to consider new levels of advancement they may want to attain and work with the Solution Advisor to outline next steps to make these improvements. The various models address different qualities about an organization and can be applied in combination or separately as effective.

The IBM Big Data & Analytics Maturity Model helps organizations assess their current capabilities in order to generate value from big data investments in support of strategic business initiatives. It does so by forming a considered assessment of the desired target state, identifying gaps, and provides guidance on the steps required to realize this end state. The BD&A Maturity model depicts 5 stages of progress in maturity for 6 different areas an organization needs to consider to support big data and analytics.

The 5 stages of progression include Ad Hoc, Foundational, Competitive, Differentiating, and Breakaway. These stages are designed to reflect both maturity levels of capability and degree of competitiveness an organization has.

The 6 areas include business strategy, information, analytics, culture and execution, architecture, and governance. These address a holistic set of areas from business and technology that affect an organizations ability to compete using BD&A.

Solution advisors can examine each stage against each measurement area with the customer to identify how well that customer performs across the different parts so they can best determine how competitive they are as a whole. They can then identify the aspects where they want to grow next and to what level of maturity over a defined period of time to chart a plan for their growth.

AQ Maturity Model measures your organization's readiness, ability and capacity to locate and apply insight, thereby re-orienting your business to make better decisions that deliver better outcomes. It measures your ability to act based on understanding history and context from the past, and your ability to make insightful forecasts and anticipate likely outcomes to optimize recommendations and judiciously automate decisions.

The AQ concept has two core components. The first is a numerical score that we calculate based on your answers to 15 multiple-choice questions. The second component is an AQ Maturity Model that maps these scores to one of four stages of increasing analytical maturity.

The four stages of Analytical Maturity are as follows: Novice, Builder, Leader, Master.

The Solution Advisor can devise a variety of techniques to apply the models and solicit information from the organization to understand their current state and future objectives.

They can use the AQ Maturity model questionnaire to help an organization identify their AQ Maturity. The Solution Advisor can solicit results from appropriate resources in the organization and compile their results in a workshop style. There are guidelines and techniques to advise an organization to progress from each maturity stage that the Solution Advisor can leverage to advise them further.

For more information on the IBM Big Data & Analytics Maturity Model visit AQ Maturity ModelAQ Maturity Model and The Big Data & Analytics Hub

Provide industry adoption examples for BD&A to guide a customer on their own use cases. 
With emphasis on the following:

A Solution Architect can reference common use cases that align with the customer's industry to guide their client on potential ways to adopt BD&A to solve common problems or attain relevant gains. There are a core set of general use cases that translate across a variety of industries and many vertical-industry specific use cases that directly apply to their unique context. Each of these use cases will have case study examples and configuration patterns of hardware and software that deliver them. Once a Solution Architect has identified the main business objectives for an organization to improve itself using BD&A, she can provide the appropriate use case to guide them on how they might adopt BD&A.

There are 5 core use cases that apply to a diverse range of vertical industries which include Big Data Exploration, Enhanced 360 view of the customer, Security Intelligence Extension, Operations Analysis, and Data Warehouse Modernization. These use cases have configuration patterns of components to ingest and manage the appropriate types of information then apply relevant analytics technologies to support the insight or decision required. Most customers of any vertical industry will find at least one of these use cases to be applicable though they may require some additional industry context to be applied. 

- Big Data Exploration - Find, visualize, understand all big data to improve decision making. Big data exploration addresses the challenge that every large organization faces: information is stored in many different systems and silos and people need access to that data to do their day-to-day work and make important decisions.

- Enhanced 360� View of the Customer - Extend existing customer views by incorporating additional internal and external information sources. Gain a full understanding of customers–what makes them tick, why they buy, how they prefer to shop, why they switch, what they'll buy next, and what factors lead them to recommend a company to others.

- Security Intelligence Extension - Lower risk, detect fraud and monitor cyber security in real time. Augment and enhance cyber security and intelligence analysis platforms with big data technologies to process and analyze new types (e.g. social media, emails, sensors, Telco) and sources of under-leveraged data to significantly improve intelligence, security and law enforcement insight.

- Operations Analysis - Analyze a variety of machine and operational data for improved business results. The abundance and growth of machine data, which can include anything from IT machines to sensors and meters and GPS devices requires complex analysis and correlation across different types of data sets. By using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions and behavior.

- Data Warehouse Modernization - Integrate big data and data warehouse capabilities to increase operational efficiency. Optimize your data warehouse to enable new types of analysis. Use big data technologies to set up a staging area or landing zone for your new data before determining what data should be moved to the data warehouse. Offload infrequently accessed or aged data from warehouse and application databases using information integration software and tools.

Most industry verticals have use cases and case studies that apply directly to their context that a Solution Architect can use to provide more direct alignment to the organization's needs. Some Examples can be found in healthcare, insurance, and retail among others.

Healthcare organizations are leveraging big data technology to capture all of the information about a patient to get a more complete view for insight into care coordination and outcomes-based reimbursement models, population health management, and patient engagement and outreach. Successfully harnessing big data unleashes the potential to achieve the three critical objectives for healthcare transformation: Build sustainable healthcare systems, Collaborate to improve care and outcomes, Increase access to healthcare.

Insurance companies harness big data to drive business results on four imperatives to drive competitive advantage and differentiation: Create a customer-focused enterprise, Optimize enterprise risk management, Optimize multi-channel interaction, Increase flexibility and streamline operations.

Retail companies use BD&A to generate valuable insights for personalizing marketing and improving the effectiveness of marketing campaigns, optimizing assortment and merchandising decisions, and removing inefficiencies in distribution and operations. Adopting solutions designed to capitalize on big data allows companies to navigate the shifting retail landscape and drive positive transformation including these critical objectives: Deliver a smarter shopping experience, Build smarter merchandising and supply networks.

Describe the benefits of social media analytics to support BD& A use cases.
With emphasis on the following:

IBM Social Media Analytics can help organizations harness social data and take decisive action across the enterprise to address use cases like identifying new patterns and trends for your product development team, protecting your brand image, micro-segmenting customers to refine marketing campaigns, and targeting prospective new hires for HR. The functional results gained from Social Media Analytics include:

Understand attitudes, opinions and evolving trends in the market.

Correct course faster than competitors.

Identify primary influencers in social media segments.

Predict customer behavior and improve customer satisfaction.

Recommend next best action.

Create customized campaigns & promotions.

Develop competitive human resource strategies.

The Social Media Analytics Framework supported includes Social Media Impact, Segmentation, Discovery, and Relationships.

Social Media Impact helps answer the questions "ARE WE MAKING THE RIGHT INVESTMENTS IN PRODUCTS, SERVICES, MARKETS, CAMPAIGNS MARKETS, PARTNERS? " This supports the initial step of monitoring what's happening in the social world to uncover sentiment regarding your products, services, markets campaigns, employees and partner.

Social Media Segmentation helps answer the questions "ARE WE REACHING THE INTENDED AUDIENCE - AND ARE WE LISTENING?". This step helps you categorize your audience by locations, demographics, influencers, recommenders, detractors, users and prospective users so you can understand if you are reaching your intended audience and adjust your message for best results.

Social Media Discovery helps answer the questions "WHAT NEW TOPICS ARE EMERGING IN SOCIAL MEDIA? WHAT NEW IDEAS CAN WE DISCOVER? "This capability helps organizations discover the "unknown unknowns" among social media topics, participants and sentiment. The powerful analytical platform of IBM Social Media Analytics can uncover hidden or unexpected elements within social media dialogues that may be critical to your business strategy.

Social Media Relationships helps answer the questions "WHAT IS DRIVNG SOCIAL MEDIA ACTIVITY, BEHAVIOR AND SENTIMENT?" Understanding the relationships between social media topics is an advanced capability that helps determine the strength of negative or positive sentiment.

Social Media Analytics is a packaged offering composed of a variety of IBM BD&A components including BigInsights, Cognos, SPSS, DB2, and contract with third party providers of social media data flows.

Describe how to create a customer roadmap and blueprint to adopt BD&A
With emphasis on the following:

Solution Architects can use a variety of sources to help a customer define their roadmap and blueprint for adopting BD&A over a short and long plan duration including the IBV 9 levers, prior maturity model discovery workshops, and using the BD&A reference architecture with appropriate use cases.

The Institute for Business Value conducted a study in 2013 to understand why some organizations excel at creating value from analytics and they identified nine key sets of activities that differentiate organizations creating the greatest value from analytics which can be treated as levers for value creation. A Solution Architect can examine these with a customer to help them develop a blueprint for their own growth.

The maturity models like BD&A Maturity Model and Analytics Quotient can provide a significant set of requirements and objectives for a Solution Architect to guide companies to develop plans that will make them more competitive using Analytics.

The BD&A Reference Architecture offers much context, example, and use cases to help a Solution Architect guide customers towards a solution pattern that meets their needs.

The IBV 9 Levers include Culture, Data, Expertise, Funding, Measurement, Platform, Source of Value, Sponsorship and Trust. The IBV study provides further details about each of these levers and the behaviors leading firms exhibit with them to achieve greater competitive advantage using BD&A. The study also provides three key set of recommendations that Solution Architects can use to develop a business-driven blueprint for BD&A adoption using strategy, technology, and organization.

A Solution Architect can review the 9 levers with a customer and help them identify their current approaches to each in contrast to leading organizations and develop goals for improvements. They can use the case study examples provided for each lever to develop ideas for tactics and approaches to grow.

A Solution Architect can apply the three recommendations with customers as the study referenced below

- Strategy - Accelerate analytics with a results based program and Instill a sense of purpose. Establish a business-driven agenda for analytics that enables executive ownership, aligns to enterprise strategy and business goals, and defines new business capabilities.

- Technology - Enrich the core analytics platform and capabilities and architect for the future. Enable the agenda with shared analytical expertise, new technologies, and a simplified and flexible platform.

- Organization - Drive change with analytics as a core competency and Enable the organization to act. Create a data driven culture built on relationships to generate business value.

The maturity models provided by IBM and others are tools to help a Solution advisor identify the challenge and opportunity areas with customers to increase their competitive position using BD&A. Typically a customer may have a wide variety of areas they need to address across the range of aspects from governance, information, strategy, to analytics that cannot all be addressed at once. The Solution advisor can identify the greatest value sources for the customer to address first and the building blocks of next growth points to define the customer roadmap. Some of the maturity models like Analytics Quotient offer prescriptive next steps based on a particular phase the customer has attained.

The BD&A Reference Architecture offers a variety of uses cases aligned with the Imperatives that a Solution Architect can leverage to offer as example to customers for how they can solve the problems or embrace the opportunities they typically face. Each of these use cases provide configuration patterns of components that can come together to support the customer's objectives. The Solution Architect can compare the customer's existing architecture with configuration pattern to determine new solutions they can acquire to align with the new target.

Define the various business and technical roles within BD&A.
With emphasis on the following:

Business users and Business leaders have a role to play in the development and maintenance of a BD&A solution by providing complete and continuous guidance to IT about their evolving requirements, use cases, and terminology. Business users need to be motivated by the transformational affects analytics can provide and willing to adopt to that change by valuing and leveraging the insights provided.

Executive sponsors / decision makers - Identify the organizational strategy for BD&A and changes they will make in response to insight, provide budgetary and political support for BD&A initiatives. Identify BD&A focal points and priorities and support planning for execution.

Analytic consumer - provide requirements for analytic services, context in which information is used, and timing and delivery methods needed.

Domain expert - advise the leaders for BD&A efforts about the terminology, processes, and information challenge areas faced by a target domain within the organization.

Business Analyst - interface between business and IT about the business requirements and how they can be applied to the BD&A environment to provide effective information.

Chief Data Architect role ensures data assets are, data integration, and means for getting data are supported by data architecture that aids an organization to achieve strategic goals.

Chief data officer role is responsible for defining, developing and implementing the strategy and methods by which the organization acquires, manages, analyzes and governs data. It also carries the strategic responsibility to drive the identification of new business opportunities through more effective and creative use of data. Define the plan and roadmap for BD&A adoption.

The IT department has a role in building a data reservoir which includes managing all aspects of information (both structured and unstructured) from business requirements to the design of solutions (logical and physical). The roles address the full information management life cycle from creation, classification and acquisition, through cleanse, transform, and storage to presentation, distribution, security, privacy, archiving and governance.

-The Data or Information Architect defines the methods, processes, and architecture for all of these efforts and must address the key issues of Information Management: understanding why information needs to be managed (information requirements); what information will be managed (information architecture); who will manage the information (people and governance); how the information will be managed (processes, tools, policy and solution architectures), and where the information will be managed (locations and nodes). The Data Architect also play a key role by enhancing the data reservoir in the following ways:

-Adding new types of repositories in the data reservoir to support specialist data storage and analytics

-Adding new data refineries to exchange data between the operational systems and the data reservoir. This approach ensures the data reservoir has the latest operational information and that the operational systems benefit from the insight generated in the data reservoir.

-Adding feeds from non-traditional sources of information such as, a log data and social media.

Data Analysts help identify sources of data for analysis then supports its preparation and cleansing for inclusion in data reservoirs.

Data Scientists help organizations identify the best value questions to pursue with a BD&A initiative, identify the appropriate data sources to extract the answer, prepare the data and create analytic routines that address the logic of the query, perform analysis on the data and generate reports for business users to leverage. The Data Scientist works with business users to identify ways the data can integrate with their workloads and business processes to transform their decision making to become more data-driven.

Data Scientists support the data mining process. Data mining is focused on extracting patterns and previous unknown facts from large volumes of data. It helps businesses uncover key insights, patterns, and trends in data. Then, it uses this insight to optimize business decisions. Data mining techniques can be divided into major categories. These categories include classification (arranging data into predefined groups), clustering (similar to classification but groups are not predefined), and regression (statistical analysis between a dependent variable and one or more independent variables).

Section 4: IBM Big Data & Analytics Solutions

Explain when it is appropriate to use IBM BigInsights versus other hadoop distributions.
With emphasis on the following:

Benefits of General Parallel Filesystem (GPFS) on hadoop system

GPFS was originally designed as a SAN file system, which typically isn't suitable for a Hadoop cluster, since Hadoop clusters use locally attached disks to drive high performance for MapReduce applications. GPFS-FPO is an implementation of shared-nothing architecture that enables each node to operate independently, reducing the impact of failure events across multiple nodes. By storing your data in GPFS-FPO, you are freed from the architectural restrictions of HDFS. Additionally, you can take advantage of the GPFS-FPO pedigree as a multipurpose file system to gain tremendous management flexibility.

You can manage your data using existing toolsets and processes along with a wide range of enterprise-class data management functions offered by GPFS, including:
-Full POSIX compliance
-Snapshot support for point-in-time data capture
-Simplified capacity management by using GPFS for all storage needs
-Policy-based information lifecycle management capabilities to manage petabytes of data
-Control over placement of each replica at file-level granularity
-Infrastructure to manage multi-tenant Hadoop clusters based on service-level agreements (SLAs) 
-Simplified administration and automated recovery

Original Map Reduce (MR)

MR is a framework introduced by Google for programming commodity computer clusters to perform large-scale data processing in a single pass. The framework is designed in a way that a MR cluster can scale to thousands of nodes in a fault-tolerant manner. But the MR programming model has its own limitations. Its one-input and two-stage data flow is extremely rigid, in addition to the fact that it is very low-level. For example, you must write custom code for even the most common operations. Hence, many programmers feel uncomfortable with the MR framework and prefer to use SQL as a high-level declarative language. Several projects (Apache Pig, Apache Hive, and HadoopDB) have been developed to ease the task of programmers and provide high-level declarative interfaces on top of the MR framework.

The term MR actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MR implies, the reduce job is always performed after the map job.

Adaptive MR

The adaptive MR component is capable of running distributed application services on a scalable, shared, heterogeneous grid. This low-latency scheduling solution supports sophisticated workload management capabilities beyond those of standard Hadoop MR.

The adaptive MR component can orchestrate distributed services on a shared grid in response to dynamically changing workloads. This component combines a service-oriented application middleware (SOAM) framework, a low-latency task scheduler, and a scalable grid management infrastructure. This design ensures application reliability while also ensuring low-latency and high-throughput communication between clients and compute services.

Hadoop has limited prioritization features, whereas the adaptive MR component has thousands of priority levels and multiple options that you can configure to manage resource sharing. This sophisticated resource sharing allows you to prioritize for interactive workloads that are not possible in a traditional MR environment. For example, with the adaptive MR component, you can start multiple Hadoop jobs and associate those jobs with the same consumer. Within that consumer, jobs can share resources based on individual priorities.

For example, consider a 100 slot cluster where you start job "A" with a priority of 100. Job "A" starts, and consumes all slots if enough map tasks exist. You then start job "B" while job "A" is running, and give job "B" a priority of 900, which is nine times greater than the priority of job "A". The adaptive MR component automatically rebalances the cluster to give 90 slots to job "B" and 10 slots to job "A", so that resources are distributed in a prioritized manner that is transparent to the jobs.

The adaptive MR component is not a Hadoop distribution. It relies on a MR implementation that includes Hadoop components like Pig, Hive, HBase, and a distributed file system. The scheduling framework is optimized for MR workloads that are compatible with Hadoop. Because InfoSphere BigInsights is built on Hadoop, you can use the adaptive MR component as a workload scheduler for InfoSphere BigInsights instead of the standard MR scheduler. When coupled with InfoSphere BigInsights, the adaptive MR component transparently provides improved performance at a lower cost of a variety of big data workload managers.

YARN or MR 2.0 (MRv2)

MR has undergone a complete overhaul in hadoop-0.23 and now is called MRv2 or YARN.

The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.

The RM and per-node slave, the NodeManager (NM), form the data-computation framework. The RM is the ultimate authority that arbitrates resources among all the applications in the system.

The per-application AM is, in effect, a framework specific library and is tasked with negotiating resources from the RM and working with the NM(s) to execute and monitor the tasks.

Describe how data can be provisioned for use with SPSS modeler to deliver analytic results.
With emphasis on the following:

SPSS Modeler: Uses statistical models to associate data with outcomes.

Decision Management: Propose next best action based rules and related algorithms.

Predictive Analytics: Historic data used to model future outcomes

Optimization: Automated systems feed results and current data into the models that recommend optimal actions.

Data Preparation: Data is prepared with a variety procedures to enhance SPSS Modeler functionality.

Volume: Hadoop File System (HDFS) supports highest volume data (BigInsights).
-Schema: Data is structured to meet needs of multiple applications (InfoSphere MDM). 
-Flexibility: Complete datasets with sufficient rows and columns to capture range of data anticipated. (BigInsights, Data Stage, DataClick, Client 3rd Party tools)
-Compliance: Hive over Hadoop offers convenience of SQL over the volume supported by HDFS (BigSQL)

Applications: Applications that rely on API's (especially) are typically more flexible, scalable and offer higher amounts of reuse.

Environments: On premise solutions offer good security and control (InfoSphere Brand). Cloud environments like IBM Bluemix offer flexible and highly scalable development options.

Velocity: High speed data acquisition with need for fast results drive the adoption of technologies like Hadoop and related parallel processing environments (SPSS).
-Timeliness: real-time requirements drive adoption of streams and related technology like Platform Computing Symphony.
-Flexibility: Query and processing flexibility drive adoption towards Hadoop

Volume: Large volume data processing drives adoption of data in:
-HDFS with associations built by SPSS Modeler. 
-Netezza database
-DB2 with BLU acceleration.

Data: Characteristics include: 
-Flexibility: Comes from wider data variety (DB2, Watson Explorer),
-Accuracy: enhanced by alignment with enterprise reporting and analytics metadata (Cognos BI).
-Compliance: Improved with common access methods, like SQL and MDX (DB2, Netezza).
-Consistency: In general, analytics improve as data becomes cleaner and more consistent (DataStage). 
-Reuse: Sharing results of data exploration and discovery need to be shared (InfoSphere MDM)

Describe what solutions are needed to control and manage BD&A workloads.
With emphasis on the following:

Workloads: Workloads are constrained by data volume, transfer rates, responsiveness, processing capacity and user load. Designing with these criteria in mind allows for more proactive workload optimization.

Data: Data volume and transfer rates are easily underestimated. Resource provisioning starts with planning and ends at deployment and rollout (IBM C&SI, System Z)

Inbound Data: Data-in-motion and application latency are key elements related to velocity. Data-at-rest drives data volume and data staging limits including escrow (InfoSphere Streams, InfoSphere Data Stage, Maximo)

Transfer Rates and Capacity: Network topology and optimal data storage provisioning are typical concerns for on-premise solutions. For cloud environments, relative location of data is oftenprominent (InfoSphere Information Server, System Z, Bluemix. Softlayer).

User Load: Predicting user demand is challenging. Downstream implications include all of the above attributes (System Z, IBM C&SI, IBM Bluemix).

Define how to improve Risk Management with BD&A
With emphasis on the following:

Managing Risk: Organizations must identify risk and weigh them against business objectives. With big data analysis, stakeholders can strike a balance between risk and opportunity (IBM Algorithmics software, IBM OpenPages Software).

Impact: The cascading impact of intrusions highlight the need to understand risk holistically, rather than managing it in silos. Use of relevant external data, such as social media, becomes a critical element in operational risk management.

Velocity: Stock markets and currency exchanges are impacted by a broad range of inputs. Real-time analytics are a crucial element in reducing the potential cost of losses and manage financial risk.

Regulations: New risk elements are created with the introduction of cloud computing, social media, enterprise mobility and big data. All impact regulatory compliance and complicate overall risk management.

Compliance: Policy definition for big data requires expanded policy definitions with more diverse compliance frameworks. Tracking compliance, and dealing with violations becomes increasingly important.

Describe the benefits of "in-database" analytics (Pure Data for Analytics, aka Netezza).
With emphasis on the following:

In-database analytics

"In-database analytics" refers to the practice of executing analytic computations and processes directly inside the database where the data reside. This approach to analytics offers many benefits over the traditional approach of extracting data from a database, porting it to a separate analytic environment for analysis, and then shipping the results back to the database where other users or applications can access them.

One of the most significant benefits of the in-database approach is improved performance. As data volumes grow, reducing the overall time from questions to answers becomes increasingly important. First, when analytic processes are executed inside the database alongside the data, data movement is eliminated. Performance is thereby improved by reducing the overall time to obtain analytic results. Second, the parallelization of analytic algorithms and processes found in data warehousing platforms improves performance by dramatically speeding up computations, compared to many analytic applications that have only single-threaded computations. Third, large-scale scoring performance with parallelized predictive models is greatly enhanced in the database environment. We can readily point to real-world use cases in which overall performance has improved by one to two orders of magnitude - from days or hours to minutes or seconds - as computational processes have been moved into the database.

Another valuable benefit of in-database analytics is improved analyst productivity and better analytic results, stemming from database-centric management of data resources. Too often, business analysts believe that their analytic models would be better if they could incorporate certain data into their work, only to find that obtaining that data is prohibitively difficult due to siloed data sources. Even when analysts can access the data they need, they typically spend most of their time preparing the data for analysis. Many data transformation and preparation steps are common to many different analysis needs and can be automated within the database, contributing to both analyst productivity and quality of results.

Finally, performing analytics directly inside the database enhances the accessibility of predictive models and other results by business users and applications throughout the enterprise. With all results and models available in the database, information does not have to be moved from an analytic environment to a separate reporting environment. Many BI reporting tools can access existing analytic results or call modeling algorithms and other queries on-the-fly to create customized results as needed by a business user.

In summary, the in-database approach to analytics offers substantial benefits in performance, analyst productivity, and quality and timeliness of analytic results. In addition, many BI applications and analyst tools integrate well with the in-database analytic environment, offering greater flexibility and value in analytic solution design, implementation, and usage.

Describe the key BD&A components required to support a real-time operational analytics.
With emphasis on the following:

This section describes the high-level structure of the components of the Big Data and Analytics Real-Time Analytical Processing. The Figure 1 represents the key components that are commonly used to build and deploy a Big Data and Analytics Solution.

Data Sensor & Data Capture Services

Modern sensor and information technologies make it possible to continuously collect sensor data, which is typically obtained as real-time and real-valued numerical data. Examples such as the Internet, web logs, chat, sensor networks, social media, telecommunications call detail records, biological sensor signals (such as ECG and EEG), astronomy, images, audio, medical records, military surveillance, and eCommerce. Other examples include vehicles driving around in cities or a power plant generating electricity, which can be equipped with numerous sensors that produce data from moment to moment.

The Data Sensor and Data Capture Services provide the capabilities to sensor and capture the data (as it is being generated) and to deliver into a "data streaming pipeline". This process can leverage change capture data tools to capture data from relational data sources. Other sources of information can be from a message systems, web services, or a sensor device.

Data Streaming Pipeline

The Data Streaming Pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. To avoid data latency during streaming processing, data processing is performed in-memory between the different elements of an application.

A Data Streaming Pipeline might also be expanded to execute in parallel across multiple system (aka a grid of processing computers). In that case, there is high speed network interconnecting the different systems in the grid to minimize data latency.

Data Integration Services

As data is collected from multiple sources, a design pattern with Real-Time Analytical Processing might need to filter (reject), join, transform, categorize, transform, correlate data in real-time. Ideally hundreds of data sources are ingested, and categorized to consumers. The data integration services might also want to store the incoming streaming data into a repository (hadoop or data warehouse) to be used on further analysis or to be used to produce new predictive models or improve existing models.

As part of the data integration services, there might be a need to augment (also referred to as "enrichment" at times) the incoming data stream to provide extra benefit by retrieving additional information, for instance from external databases, and to incorporate this information into the outgoing data streaming message.

Predictive Analytics Services

Predictive Analytics services provides the ability to leverage a predictive model to score new data (transactions, events) in the data streaming process about the likelihood or probability of an event against expected results. For example, when an online payment transaction takes place, a predictive model processes the input data and provides a predictive score that gives a probability that the transaction is either genuine or fraudulent. Real-Time score allows the predictive model to be applied to transactions as they are happening.

One example to leverage predictive analytics in a data streaming process: A credit card provider deploy a predictive analytics application to detect fraud on credit card transactions. The application generates a predictive score in real-time during authorization, or in near real-time shortly after authorization, that indicates the likelihood that the transaction is fraudulent. The fraud decisions is applied at the point of interaction that ultimately reduce fraud losses.

Decision Management Services

Decision management is the process to optimize and automate high-velocity decisions and consistently generate results. It leverages business rules engines, optimization and predictive scoring to automate the decision process. By gaining detailed insight into present conditions and evaluating the best possible future events and outcomes. The quality of decisions is improved.

In the Credit Card fraud application exampled (described previously), the decision making algorithm reads the score result for each transaction and decides whether to deny the transaction if it qualifies as fraudulent or approve the transaction if it not considered fraudulent. In case of fraud, the action could be a message to the operator in the point of sales followed by another action to block the credit card.

Application Integration Services

One of the last steps in the real-time analytics processing is to perform the actions based upon on insights gained with the predictive analytics services. These actions need to be translated into decisions and integrated into the operational business processes.

There are many ways to integrate the business decisions (actions) into the operational business processes. An important element is the data delivery which is the mechanism that physically moves data from application to application. It provides features such as transaction integrity and data prioritization

Depending on the target application requirement there are might different services that can be utilized. For example, a message system can be used to deliver the information to the point of interaction. In a Fraud detection solution for credit cards, the action needs to be delivered via message to the point of sale in order to approve or deny the transaction.

Data Delivery Services

The Data Delivery Services ensure the physical movement of data through the system. In the data streaming process the data is passed to the next set of services.

In order to automate the decisions, the properties of the transport layer in use reflect the data delivery assurance, confirmation of delivery reports, transactional, and possibly audit trailing or logging.

IBM data delivery solutions include the IBM WebSphere MQ family products are firmly based on MQSeries (now known as WebSphere MQ) messaging with its assured delivery and transactional features as well as its wide platform coverage. Other variety of transports mechanism such as JMS, CORBA, HTTP, and provides its own support for assured delivery and transactional, independent of transport.

Administration and Monitoring Services

The Administration and Monitoring Services provide the capabilities for the administrator to manage and monitor the solution, including real-time supervision of active instances, exception handling, historical analysis, simulation, etc. Typically, a graphical tool is used to model business processes.

Identify the components that support text analytics.
With emphasis on the following:

Explain what is IBM Watson Content Analytics?

IBM Watson Content Analytics provides an Unstructured Information Management Architecture compliant analytics engine and rich platform for deploying text analytic solutions.

IBM Watson Content Analytics is a platform to derive rapid insight. It can transform raw information into business insight quickly without building models or deploying complex systems enabling all knowledge workers to derive insight in hours or days � not weeks or months. Flexible and extensible for deeper insights, Content Analytics enables better decision making by deriving valuable insights from your enterprise content regardless of source or format. It allows deep, rich text analysis of your information. Solutions built on Content Analytics can help organizations surface undetected problems, fix content-centric process inefficiencies, improve customer service and corporate accountability, reduce operating costs and risks and discover new revenue opportunities.

UIMA is an open framework for building analytic applications - to find latent meaning, relationships and relevant facts hidden in unstructured text. UIMA defines a common, standard interface that enables text analytics components from multiple vendors to work together. It provides tools for either creating new interoperable text analytics modules or enabling existing text analytics investments to operate within the framework.

Although UIMA originated at IBM, it has now moved on to be an Open Source project at the Apache Software Foundation. UIMA is the only recognized standard for semantic search and content analytics, by Organization for the Advancement of Structured Information Standards.

IBM Watson Content Analytics can analyze documents, comment and note fields, problem reports, e-mail, web sites and other text-based information sources. Sample Applications 

-Product defect detection. Analysis of service and maintenance records provides early insight into product defects and service issues before they become widespread, thus enabling quicker resolution and lower after market service and recall costs.

-Insurance fraud analysis. Analysis of claims documentation, policies, and other customer information allows organizations to identify patterns and hidden relationships in claims activities to reduce incidents of fraud and unnecessary payouts.

-Advanced intelligence for anti-terrorism and law enforcement. Analysts can uncover hidden patterns and identify potential criminal or terrorist activity by better analysis of various information sources such as field analyst reports, surveillance transcripts, public records and financial transactions.

-Customer support and self-service. Analysis of call center logs, support e-mails, and other support documentation provides more accurate problem identification to better identify appropriate resolutions.

-e-Commerce and product finders. Customers can find both target and complementary products online, and sales reps can make the right cross-sell offer by identifying relationships and concepts from analysis of product content, customer profiles and sales notes.

Explain what is IBM InfoSphere BigInsights Text Analytics?

IBM InfoSphere BigInsights Text Analytics is a powerful system for extracting structured information from unstructured and semi-structured text by defining rules to create extractor programs.

-InfoSphere BigInsights Text Analytics lifecycle

-Use the interactive graphic to explore the end-to-end lifecycle of developing and using Text Analytics extractors against the InfoSphere BigInsights cluster and visualizing its results. 

-Text Analytics framework

-The Text Analytics framework consists of the modules and AQL files that comprise extractors, your input data in the supported document formats, document language specifications, and developer tooling. 

-Developing Text Analytics extractors

-InfoSphere BigInsights includes pre-built extractor libraries that you can use to extract a fixed set of entities. You can build custom extractors by using the Text Analytics Workflow perspective in your Eclipse development environment, or by extending the extractors that are found in the pre-built extractor libraries. 

-Improving the performance of an extractor

-You can use these basic guidelines to help you to design high performance extractors. 

-Using Text Analytics on the cluster

-When you are satisfied with the levels of quality and performance on the developed extractor, you can publish the extractor into the InfoSphere BigInsights Console as an application. This application can then be deployed by an administrator for use across InfoSphere BigInsights. 

-Running Text Analytics extractors
-In addition to running an extractor in a Text Analytics application, there are programmatic Jaql and Java APIs for compiling and running an extractor.

Describe the role of information and corporate governance on IBM BD&A
With emphasis on the following:

Information Governance for traditional IT has a maturity model that includes Outcomes, Enablers, Core Disciplines, and Supporting Disciplines. It is best for these to be in place prior to engaging Big Data but what ever is in place will evolve further for a Big Data initiative. Corporate Governance is the management of business policy and processes from a corporate standpoint and Information Governance should be aligned with it to guide Big Data as well.

Outcomes establish why governance is needed and include Data Risk Management and Compliance with Value Creation.

Enablers include Organization Structures and Awareness, Policy, and Stewardship.

Core Disciplines include Data Quality Management, Information Lifecycle Management, and Information Security and Privacy.

Supporting Disciplines include Data Architecture, Classification and Metadata, and Audit Information Logging and Reporting.

Information Governance for Big Data should build upon what gets established for traditional IT and account for the new levels of scale, big data means bigger risks and bigger rewards depending upon the effectiveness of governance programs. A root principle for Big Data Information Governance is The objective of governing information for an organization is to move information as quickly as is practical while keeping the quality as high as is practical and as secure as is practical. The balance of speed with quality and security requires high quality automation solutions to achieve optimal performance.

The capabilities to integrate and govern big data are core components of and the foundation for the IBM Big Data Platform. These capabilities help you build confidence in big data, make well-founded decisions, and take decisive actions to accelerate your information-intensive projects. The primary products for information integration and governance are IBM InfoSphere Information Server, IBM InfoSphere Data Replication, IBM InfoSphere Federation Server, IBM InfoSphere Master Data Management, IBM InfoSphere Optim, and IBM InfoSphere Guardium. These products have served traditional IT for many years and have distinct value to support Big Data projects as well.

Information Server supports high performance integration on large volumes of information for Data Warehouses, Hadoop systems, streaming information sources, and transactional systems. It also incorporates a broad range data cleansing and transformation functions to ensure the appropriate integration of data with the correct level of quality. InfoSphere Information Server also benefits from balanced optimization capabilities that enable organizations to choose the deployment option (such as ETL or ELT) that works best for their architecture

InfoSphere Master Data Management is a complete and flexible MDM solution that creates trusted views of party, product, or other domains to improve operational business processes, big data, and analytics. MDM creates context for big data by providing trusted information about how incoming unstructured data fits into the business environment. MDM can also help identify and structure big data that is used in controlled environments. Conversely, big data creates context for MDM by providing new insights from social media and other sources, which helps companies build richer customer profiles.

InfoSphere Optim Solutions help organizations meet the requirements for information governance and address challenges that are exacerbated by the increasing volume, variety, and velocity of data. InfoSphere Optim Solutions focus on three core governance areas: Information lifecycle management, Test data management, Data privacy.

IBM InfoSphere Guardium solutions help organizations implement the data security and privacy capabilities that they need for big data environments. With these solutions, organizations can protect against a complex threat landscape, including insider fraud, unauthorized changes, and external attacks, while remaining focused on business goals and automating compliance.

Governance Risk and Compliance: Organizations must identify risk and weigh them against business objectives. With big data analysis, stakeholders can strike a balance between risk and opportunity (IBM Algorithmics software, IBM OpenPages Software).

Impact: The cascading impact of intrusions highlight the need to understand risk holistically, rather than managing it in silos. Use of relevant external data, such as social media, becomes a critical element in operational risk management.

Velocity: Stock Markets and currency exchanges are impacted by a broad range of inputs. Real-time analytics are a crucial element in reducing the potential cost of losses and manage financial risk.

Regulations: New risk elements are created with the introduction of cloud computing, social media, enterprise mobility and big data. All impact regulatory compliance and complicate overall risk management.

Compliance: Policy definition for big data requires expanded policy definitions with more diverse compliance frameworks. Tracking compliance, and dealing with violations becomes increasingly important.

Section 5: IBM Big Data & Analytics Infrastructure Considerations

Describe how infrastructure matters in enabling Big Data & Analytics.
With emphasis on the following:

Systems components / capability affect overall performance.

CPU capacity provides computing power to address workloads.

Memory capacity is critical for in-memory computing.

I/O capacity required for high ingestion workloads.

Network capacity defines efficiency of data flow.

Infrastructure determines the effectiveness of the system to the application and critical operations.

High systems availability and recovery characteristics to perform the tasks when it is needed.

Scalability in, out and up supports the varying demands of the application.

Efficient virtualization and resource management.

Optimized data storage and access.

Data tiering and compression.

Parallel processing.

Optimized compute.

Infrastructure impacts how effectively and efficiently clients can manage and use data and perform analytics to secure competitive advantage.

Access, speed and availability matters for BD&A workloads.

Access matters to get new levels of visibility into customers and operations.

Speed matters to accelerate insights in real-time at the point of impact.

Availability matters to consistently deliver insights to the people and processes that need them.

Considerations include the ability to support large volumes of data and the ease to manage the data with different temperature.

The infrastructure must be able to provide real-time insights while sharing secured access to all relevant data, no matter its type or where it resides.

Lack of proper infrastructure design can impede or even prevent attainment of business goals.

Key client concerns include delivering real-time analytics, inadequate performance, the governance model, data latency, data completeness, working with multiple platforms, many security boundaries, and many points of failure with challenging recovery scenarios.

Appliances - Pure Data Systems

With the challenge of growing volume, velocity and variety of data used today in all aspects of the business - using a multi-purpose system for all data workloads is often not the most cost effective or low risk approach, and definitely not the fastest to deploy. The new PureData System is optimized exclusively for delivering data services to today's demanding applications. Like each of the IBM PureSystems, it offers built-in expertise, integration by design, and a simplified experience throughout its life cycle.

-Built-in expertise - Codified data management best practices are provided for each workload. PureData System delivers automated pattern-based deployment and management of highly reliable and scalable database services.

-Integration by design - Hardware, storage and software capabilities are designed and optimized for specific high performance data workloads such as patented data filtering using programmable hardware (FPGAs) for ultrafast execution of analytic queries without the need for indices.

-Simplified experience - The PureData system provides single part procurement with no assembly required (ready to load data in hours), open integration with 3rd party software, integrated management console for the entire system, single line of support, integrated system upgrades and maintenance.

-The new PureData system comes in different models that have been designed, integrated and optimized to deliver data services to today's demanding applications with simplicity, speed & lower cost

-Reference: PureData

PureData for Analytics - PureData system for analytics presents solutions targeted at big data, warehousing and analytics applications. These systems are designed to handle all types of big data workloads and will leverage Netezza and DB2 technologies.

PureData for Operational Analytics - is an expert integrated data system that is designed and optimized specifically for the demands of an operational analytics workload. The system is a complete, out-of-the-box solution for operational analytics that provides both the simplicity of an appliance and the flexibility of a custom solution. Designed to handle 1000+ concurrent operational queries1 , it delivers mission-critical reliability, scalability and outstanding performance.

PureData for Hadoop - is a purpose-built, standards-based, expert integrated system that architecturally integrates IBM InfoSphere BigInsights Hadoop-based software, server, and storage into a single, easy-to-manage system.

Powered by IBM Netezza
-IBM Netezza Analytics is an embedded, purpose-built, advanced analytics platform that fuses data warehousing and in-database analytics into a scalable, high-performance, massively parallel analytic platform that is designed to crunch through petascale data volumes. It's high-performance - In-database, parallelized algorithms take advantage of IBM Netezza's Asymmetric Massively Parallel Processing architecture.

IBM Netezza powered analytics offerings:
-IBM PureData system for analytics, powered by Netezza technology.

-A simple appliance for serious analytics that delivers optimized performance of complex analytics and algorithms.

-IBM Netezza Replication Services 

-Resilient, high-speed data replication from a primary system to one or more geographically dispersed target systems with robust disaster recovery.

-IBM Netezza 100 

-Extremely versatile, compact, and easy to install, the IBM Netezza 100 data warehouse appliance is designed for test and development. It offers organizations 100 gigabytes to 10 terabytes of user data capacity and fast time-to-value.

-IBM DB2 Analytics Accelerator 

-A workload optimized appliance add-on enabling the integration of business insights into operational processes driving winning strategies.

IBM Big Data solutions that work with IBM Netezza includes:
-IBM SPSS® Modeler - IBM SPSS Modeler is a high-performance predictive analytics workbench.

-IBM InfoSphere BigInsights - A core component of IBM's platform for big data, IBM InfoSphere™ BigInsights is inspired by, and is compatible with, open source Hadoop and used to store, manage, and gain insights from Internet-scale data at rest. When paired with IBM Netezza Analytics via a high-speed connection, massive volumes of distributed data and content, and ad-hoc analytics can be processed quickly and efficiently to find predictive patterns and glean valuable insights.

-IBM InfoSphere Streams - A core component of IBM's platform for big data, IBM InfoSphere Streams allows you to capture and act on all the business data as the business events unfold. When paired with IBM Netezza Analytics massive volumes of stream data are ingested into IBM Netezza and analyzed over a longer period of time to find predictive patterns and glean valuable insights.

-Reference: Nztechnology

Describe the role of storage and storage management software in Big Data and Analytics
With emphasis on the following:

Storage considerations for Big Data and Analytics

Storage characteristics offer varying levels of value add to enhance performance in BD&A applications.
-Data Acceleration with Flash.

-Flash based storage systems drive speed of access.

-Drive as much as 12x faster analytics results on certain workloads.

-Data Workload Diversity and Flexibility.

-Workload management with specialized storage solutions such as XIV delivers predictable performance that scales linearly without hotspots delivering insights from analytics faster with tuning-free data distribution.
-Scale-out, parallel processing of GPFS Elastic Storage software and integration with IBM FlashSystem dramatically accelerates performance of Analytics clusters 

-IBM SmartCloud Virtual Storage Center with SAN Volume Controller (SVC) automatically optimizes data warehouse performance and cost across Flash and Disk 

-Data Protection and Retention

-High speed encryption on every drive type secures data

-IBM FlashSystem 840 offers AES-XTS 256-bit encryption with zero performance impact. 

-Reduce the amount of data to be stored by up with data deduplication solutions such as ProtecTIER.

-Remote clusters and mirroring enable near-continuous data availability.

-Application snapshot management reduces the risk of data loss.

-Reference: Analytics

Compute environments also impact storage characteristics to enhance performance in Big Data and Analytics applications
-Data Compute Environments

-Data and compute integration, such as mainframe integration with DB2 and specialty analytics "engines" leveraging DS8870 delivers 4x reduction in batch times.

-Storage dense integration such as Power8 also provides enhanced performance characteristics.

-Systems compute architecture can make a huge difference in managing and processing Big Data and Analytics workloads.

Virtual environments and OpenSource technologies like OpenStack and IBM Gridscale enable the rapid creation of Data Warehouses.

IBM Flash storage is a key enabler for a variety of Analytics applications

IBM FlashSystem technologies enable rapid data movement within infrastructures to target new data sources for analytics insights.

IBM FlashSystem for Streams processing enhances and accelerates real-time analytics and actions supporting instant decisions from the analysis of multiple data streams for both structured and unstructured data.

IBM Flash storage achieves its acceleration by reducing latency to its lowest levels.

-IBM FlashSystem "extreme" performance enables real-time business decisions.

-Accelerates analytical applications improving time to value of the data. 

-Maintains quick response times as data sets and/or concurrency needs grow without performance degradation.

-Boosts response time to enable faster decision making with IBM MicroLatency.

-Reference: http://www.ibm.com/systems/storage/flash/840/index.html

IBM FlashSystem efficiency drives data ecosystem and business value

Increased performance of applications in a smaller footprint and reduced operational costs.

Creates value with rapid deployment, efficient use of IT staff as well as power and cooling savings.

Allows you do more with less.

IBM FlashSystem resiliency delivers data availability and access where you need it, when you need it.

Leverages high availability and hot-swappable components to make sure the data is accessible when it's needed.

Offers concurrent code upgrades to maximize uptime and availability.

IBM XIV has unique capabilities that can make it an ideal choice platform for Big Data and Analytics.

IBM XIV provides tuning-free performance across diverse and dynamic workloads.

-Grid scale architecture accommodates any workload anytime delivering consistent and predictable performance.

-Optimally tuned at all times to address demanding workloads and increasing numbers of users, scaling performance linearly with capacity. 

-Enhanced SSD caching options provide up to 4.5x boost in access speed.

IBM XIV delivers exceptional efficiency with industry-leading simplicity and inherent environmental friendly attributes.

-Grid scale architecture automatically handles optimal data placement and enables optimal utilization.

-Intuitive interface and automated management allow hundreds of IBM XIV systems to be managed with minimal resources.

-Scales far, fast and with ease, handling massive loads of data and diverse workloads requiring minimal operator intervention. 

-Virtual infrastructure enables rapid provisioning of new services in minutes.

IBM XIV provides enterprise resiliency for uncompromised data availability and business continuity.

-Grid scale architecture with full disk redundancy, fast disk rebuild capability and site mirroring deliver uninterrupted access to data. 

-Provides resilient performance by handling unexpected workload surges better than any other storage.

-Ability to predict and mitigate failure through self-healing.

-Provides tune-free scaling with robust mobility and automated volume migration across XIV systems.

-Supports data encryption for security and data integrity.

What is the role of a Storage hypervisor

Pooled physical resources are consumed by virtual machines resulting in high asset utilization.

Virtual machines are mobile giving CIO's their choice of physical storage devices

A common set of value capabilities and centralized management are provided for virtual machines regardless of what physical server they are sitting on.

Combines storage virtualization and management capabilities.

Provides cost savings and flexibility.

What is the role of IBM SmartCloud Virtual Storage Center (VSC)

It is IBM's storage hypervisor.

Software Defined Storage that is dynamic, service-oriented and cost effective.

IBM SmartCloud VSC delivers to customers under one licensed software product the complete set of advanced functions available in the IBM Tivoli Storage Productivity Center, all the set of functions available with the virtualization, remote-mirroring and FlashCopy capabilities of the IBM System Sstorage SVC, and complete use of the IBM Tivoli Storage FlashCopy Manager.

Capabilities include:

-Pooled physical resources from multiple arrays, vendors, and datacenters - pooled together and accessed anywhere. 

-Common capabilities regardless of storage class or vendor

-Mobility of storage volumes on the fly based on workload balancing policies without disruption of service.

-Centralized management to optimize your people for the challenges of day-to-day operations.

-Pay-per-use storage resources - end users are aware of the impact of their consumption, service level choices.

What are some of the storage optimization techniques for Big Data and Analytics.

Virtualization that allows for choice of storage devices and capabilities for efficiency and improved utlilization of storage resources.

Thin provisioning satisfies storage requests without allocating large blocks of data and improves capacity utilization across the infrastructure.

IBM Easy Tier flash optimization enables most active data moves to the fastest storage, automatically, moves small amounts of data (sub-volume) for SSD optimization, and enables efficient high performance for critical applications.

IBM Storage Analytics Engine tiered storage optimization provides analytics based tier recommendations for multiple storage systems, analyzes file system data, disk performance, and capacity utilization and reduces user cost for storage up to 50% without complexity.

IBM Real time compression minimizes storage space by applying compression to block and file data; compressing data on existing capacity or externally virtualized storage.

How does software defined storage impact performance for Big Data and Analytics

Software defined storage provides enterprise class storage that uses standard hardware with all the important storage and management functions performed in intelligent software.

Software defined storage delivers automated, policy-driven, application-aware storage services through orchestration of the underlining storage infrastructure in support of an overall software defined environment.

Software Defined Storage Benefits include:

-Save acquisition costs by using standard servers and storage instead of expensive, special purpose hardware.

-Realize extreme scale and performance through linear, building block, scale-out.

-Increase resource and operational efficiency by pooling redundant isolated resources and optimizing utilization.

-Achieve greater IT agility by being able to quickly react, provision and redeploy resources in response to new requirements.

-Lower data management costs through policy driven automation and tiered storage management.

Software Defined Storage utilizes elastic data infrastructure services.

How does Elastic Storage Server help manage vast volumes and variety of data

Elastic Storage is a proven, scalable, high-performance data and file management solution (based upon GPFS technology).Reference: GPFS Technology

An Elastic Storage Server environment demonstrates how technology adapts to application and business needs. 

-An elastic storage environment provides the technology to match the specific needs of applications and data usage characteristics to perform efficiently. It integrates optimized storage systems with storage optimization technologies and storage software to deliver the agility, performance & scalability features for any data need. It is IBM's implementation of software define storage.

Elastic storage infrastructure environments leverage optimized storage systems that enable data independence and provide easy to use, resilient and optimized data delivery environments.

Elastic storage incorporates IBM technologies supporting Hadoop. Storing large amounts of non-structured data and with IBM Storage technologies like GridScale and GPFS enable simple, rapid deployments.

Key features of IBM Elastic Storage software
-Extreme Scalability
-Cutomers with 18 PB file system 
-Maximum file system size of 1 Million Yottabytes / 8 quintillion files per file system
-High Performance
-Parallel file access
-Distributed, scalable, high performance metadata
-Flash acceleration
-Proven Reliability, Availability and Integrity
-Snapshots, Replication
-Rolling upgrades
-Fast rebuild times for disk failures
-Ease of administration
-Full Data Lifecycle management
-Policy-driven automation and tiered storage management
-Match the cost of storage to the value of data
-Storage pools to create tiers of storage (Speed vs capacity)
-Integration with IBM Tivoli Storage Manager and IBM Linear Tape File Sytem (LTFS) Enterprise Edition (EE) 
-Reference: https://www.youtube.com/watch?v=hpkpXE6KkTw&feature=youtube

How does IBM Storwize V7000 capabilities support efficient data economics for an Analytics environment?

Designed to complement virtualized environments, IBM Storewize 7000 Unified and IBM Storewize V7000 are enterprise- class, flash optimized modular storage systems.

Provides improved economics of data storage with hardware- accelerated real time data compression.

Has integrated support for file and block data to consolidate workloads.

Optimizes performance with fully automated storage tiering.

Improves network utilization for remote mirroring with innovative replication technology.

Deploy storage quickly with management tools and built-in support for leading software platforms.

Reference:http://www.ibm.com/partnerworld/wps/servlet/ContentHandler/TSD03111USEN

Describe how customers with data already managed by System z can extend the platform to integrate Analytics.
With emphasis on the following:

What are System z Hybrid Transaction and Analytics Processing (HTAP) capabilities?

Hybrid Transaction and Analytics processing vision on System z.

HTAP empowers application leaders to innovate via greater situational awareness and improved business agility. Gartner published concept on HTAP.

HTAP addresses the four major drawbacks of traditional IT approaches:

-Architectural and technical complexity - Data doesn't need to move from operational databases to separated data warehouses / data marts to support analytics.

-Analytic latency - Transactional data is readily available for analytics when created.

-Synchronization - Drill-down from analytic aggregates always points to the "fresh" data.v
-Data duplication - The need to create multiple copies of the same data is eliminated (or at least reduced).

How does IDAA fit in to customer analytics projects?

DB2+IDAA provides a hybrid architecture for the wide spectrum of analytic workloads by simplifying the environment and reducing data duplication.

IBM's latest technology is the IBM DB2 Analytics Accelerator (IDAA) for z/OS, which is a high performance accelerator appliance for the DB2 for z/OS environment that delivers dramatically faster complex business analysis, transparently to all users. It is Netezza powered. IDAA process queries quickly / IDAA role in System z environment for accelerating access to data.

There are specific workloads suitable for IDAA - including workloads that were not previously viable.

IDAA enablesbusiness critical analytics:

-Query data at high speeds–to significantly improve response times for unpredictable, complex and long-running query workloads.

-Extend the capabilities of DB2 for z/OS–to support a cost-effective analytics solution for data warehousing, business intelligence and predictive analytics.

-Lower operating costs–by reducing System z disk requirements and offloading query workloads to a high-performance platform.

-Reference - http://www.ibm.com/software/products/en/db2analacceforzos/

What characteristics does system z provide for business critical analytics?

Unique: execute predictive models inside transactional database, with little data movement results in multiple times performance improvement compared to moving data for analytics. It avoids data proliferation across multiple systems and can enable faster response times with fewer compute resources and reduce potential security exposures and outages.

Achieve huge scales of execution without performance degradation.

Leverage historical and current transaction data to produce most accurate results.

Provides best of breed security with System z infrastructure.

High level of transaction level auditing and logging for governance.

Tested High Availability/Disaster Recovery capabilities already configured with System z.

The unique ability to integrate IBM SPSS predictive fraud scoring into the DB2 on z/OS database.

Optimized compute intensive analytics processing with zEnterprise System performance for floating point operations.

Superior performance and scalability of Cognos on the zEnterprise platform.

Business intelligence and predictive analytics solutions are available on System z include:
-Cognos Business Intelligence for Linux on System z
-Cognos Business Intelligence for z/OS
-SPSS Analytical Decision Management for Linux on System z
-SPSS Collaboration and Deployment Services for Linux on System z
-SPSS Modeler for Linux on System z
-SPSS Statistics for Linux on System z
-DB2 Query Management Facility (QMF)

Data warehousing solutions are available on System z include:
-DB2 Analytics Accelerator
-DB2 for z/OS Value Unit Edition (VUE)
-InfoSphere Information Server on System z
-IBM zEnterprise Analytics System 9700 / 9710
-References: http://www.ibm.com/software/os/systemz/badw/
http://www.ibm.com/systems/z/solutions/data.html

How does System z excel at analytics to provide superior fraud protection?

Most banks, insurers, health plan providers and government tax agencies accept a certain percentage of fraud as a cost of doing business, adopting a "pay and chase" approach that does not optimize business results.

When operational systems are on System z, System z is uniquely positioned to integrate anti-fraud business analytics with these systems - enabling fraud to be detected and stopped pre-payment, optimizing business results.

The System z anti-fraud architecture enables real-time decision making.

IBM Smarter Analytics - Anti-fraud, Waste and Abuse Solutions on zEnterprise System provide a comprehensive set of fraud solutions to fit specific needs in banking, insurance, healthcare and tax audit and compliance.
-References:IBM zEnterprise Smarter Analytics for Banking
IBM Smarter Analytics Signature Solution - Anti-Fraud, Waste and Abuse for Healthcare
IBM Smarter Analytics: Anti-Fraud, Waste and Abuse Solution for Insurance
IBM Smarter Analytics Signature Solution - Anti-Fraud, Waste and Abuse for Tax

How does System z excel at analytics to provide superior customer insights?

When operational systems are on System z, System z is uniquely positioned to integrate customer insight analytics with improving customer satisfaction and loyalty, and increasing customer lifetime value.

The integration of predictive scoring, rules and processes with systems of record enables real-time decision-making.

The IBM Signature Solution - Next Best Action is a comprehensive approach to creating an exemplary customer experience by gathering and utilizing actionable insights, with a focus on driving new revenue. This approach provides a coherent customer contact strategy that helps organizations build and develop long-term customer relationships that lead to a lifetime of high customer value.

Reference: http://www.ibm.com/common/ssi/cgi-bin/ssialias?subtype=SP&infotype=PM&appname=STGE_ZS_ZS_USEN&htmlfid=ZSS03115USEN#loaded

Section 5: IBM Big Data & Analytics Infrastructure Considerations cont.

Explain why Power 8 is the Big Data & Analytics platform of choice for customers with distributed platforms
With emphasis on the following:

What capabilities does Power 8 provide for data analytics?

First generation of systems built on POWER8 innovative design, optimized for big data & analytics.

Linux on Power Hadoop assets and its integration with BigInsights (Platform Symphony, GPFS) and Streams. Linux on Power Hadoop reference architecture and its design principles optimizes unstructured big data performance.

IBM DB2 with BLU acceleration solution optimized on Power hardware. 
-Dynamic In-Memory 

-In-memory columnar processing with dynamic movement of data from storage.

-Actionable Compression

-Patented compression technique that preserves order so data can be used without decompressing.

-Parallel Vector Processing

-Multi-core and SIMD parallelism (Single Instruction Multiple Data)

-Data Skipping.

-Skips unnecessary processing of irrelevant data.

Provides an optimized platform for analytics applications such as SPSSS, Cognos, SAS., SAP. Enables rapid deployment of business and predictive analytics.

-Integrated solutions greatly improves customers time to value by reducing the time to install, verify and configure software stacks.

-Accelerates complex Cognos queries and reports.

-Delivers faster scorign results for predictive recommended actions to decision makers.

-BLU Acceleration on Power Provides Better Throughput than In-Memory Database Competitor for Operational Analytics.

-Proven platform for Security, Reliability and Simplicity.

Power 8 provides a storage-dense integrated big data platform optimized to simplify & accelerate big data analytics.
-Core/disk mix of Power on Linux cluster can be readily adapted vs x86 cluster with fixed core/disk ratio.
-Accelerate ROI: easy to procure, deploy, use and manage.
-Higher ingest rates delivers faster insights than competitive hadoop solutions.
-Better reliability and resiliency with 73% fewer outages and 92% fewer performance problems over x86**.
-Tailor cluster resources to meet specific workload CPU, memory, I/O requirements.

Elaborate on Power 8 performance benefits

Innovations in Power8 processor design delivers performance improvements for analytics applications that leads to faster time to insights.

CAPI - Coherent Accelerator Processor Interface .

-Smart, simplified attach for accelerators: GPUs, flash memory, networking & FPGAs .

-Connects directly to processor, sharing the same address space

-Improves performance, reduces latency, and provides more workload for your dollar.

POWER8 exploits additional cores, more threads, larger caches, memory bandwidth.

-It is 4 times faster than the previous generation. It combines the computing power, memory bandwidth and I/O for a faster route to big data reinvention.

-It has 4 times the threads per core vs. x86.

-It has 4 times the memory bandwidth vs. x86.

-It has 2.4 times the data bandwidth that Power7.

Multi-threading; High I/O performance; processor and cache; Java optimization.

Understand measurement results of Power 8 over x86 with various workloads.

-Produces insights up to 50 times faster running Cognos BI reports and analytics on POWER8 with DB2 with BLU Acceleration versus commodity x86 with a traditional database.

Why Power Architecture is ideal for Big Data & Analytics Workloads

Multi-Threading
-Twice the thread capability to use to run parallel Java workloads within the same system unit.

-More threads enables faster BLU Acceleration Single-Instruction, Multiple Data vector processing.

-More threads enable faster Hadoop workload processing.

High I/O Performance
-High memory and I/O bandwidth enables faster Cognos ad hoc queries and better SPSS scoring throughput.

-High I/O performance enables Cognos faster ad hoc queries and better SPSS scoring throughput.

-Higher speed that significantly outperforms competitive x86 Hadoop configurations - also key for Watson.

Processor and Cache
-Large processor cache per core, large numbers of cores and high DRAM capacity on a single server benefits Cognos Cubes performance

-Extensive Soft-Error Recovery, self-healing for solid faults, and alternate processor recovery is ideal for Hadoop-based workloads.

Java Optimization
-IBM's Java Virtual Machine is specifically optimized for the POWER architecture to deliver optimal performance of big data and analytics solutions.

Integrated solutions (pre-loaded) for Cognos, SPSS and DB2 BLU

-Cognos BI is optimized for Power.

-3 to 10x performance improvements.

-Thread caching memory allocator.

-Kernel settings to enable Cognos to better leverage AIX.

-Pre-loaded, pre-optimized AIX "auto-tune".

-Cognos Dynamic Cubes run better on Power.

-BLU Acceleration for Power Systems Editon.

-Speed of thought analytics.

-Operational simplicity.

-Business Agility.

-SPSS is optimized for Power.

-SPSS Modeler is optimized for high performance reading tens of millions of records.

-SPSS Collaboration & Deployment Services is optimized to execute scoring of large data volumes for real-time results.

Simple acquisition, deployment and implementation.

Performance advantage for these applications over other platforms.

Linux on Power solutions for Big Data and Analytics

Linux on Power scaleout solution for BigInsights with GPFS (Elastic Storage) and Platform Computing Symphony (Adative MapReduce capability).

Linux on Power architecture provides enhancement for InfoSphere BigInsights and InfoSphere Stream applications vs x86 internal storage based architecture.

IBM Watson Foundation is built upon Linux on Power and optimized for analytic accelerators, including real-time analytics for data in motion and leveraging Apache Hadoop technology. IBM Watson Foundations exploits POWER architecture. The core components include: IBM InfoSphere BigInsights, IBM InfoSphere Streams and IBM Power Systems.

Reference: Big Data AnalyticsBig Data Analytics Solutions on IBM Power Systems

Section 6: IBM Big Data & Reference Architecture

Describe the benefits of IBM integrated BD&A platform.

IBM offers a comprehensive, integrated Big Data & Analytics Platform capabilities to address any use case, no matter where you start or end. Varying analytic scenarios will require different components of a broad and deep architecture to efficiently address customer's problem. A one-size-fits-all model (for example: in memory data store, engineered system, proprietary coding language requirement) doesn't give you the flexibility to design an analytic application that efficiently matches what you're required to do. IBM acknowledges that organizations have already made investments in IT infrastructure; the Big Data & Analytics Platform embraces this fact and enables clients to start small, on a single pain point, and extend the architecture as the business demands.

IBM Services brings an unmatched industry and domain experience to help you forge your big data and analytics strategy and roadmap. We take an outcomes-driven approach prioritizes high-impact initiatives to help you outperform your peers. GBS BAO Services - 60 use cases across 17 industries, 10 signature solutions, 9 analytics solution centers, 9,000+ consultants, 30,000+ analytics-driven client engagement.

Cognitive systems like Watson that transform how organizations think, act, and operate in the future: Learning through interactions, they deliver evidence based responses driving better outcomes. Watson: First of a kind cognitive system, Watson for engagement, Watson in healthcare, Watson in finance.

Streaming analytics that process data in real time as it flows within and from outside the enterprise: This enables nimble assessment, analysis, and action�in the moment. The real-time capability opens up myriad use cases that other vendors cannot even consider, such as real-time fraud detection, health monitoring, and machine maintenance. For example: Would do you want to know when a machine is likely to fail? By analyzing real-time data from sensors, you can anticipate when a problem is likely to occur�before it does. That's the value of streaming analytics. InfoSphere Streams: Only real-time analytics processing, Analyze data in motion, providing sub-millisecond response times, Simplify development of streaming applications using Eclipse-based IDE, Extend the value of existing systems–integrates with your applications, and supports both structured and unstructured data sources, including SPSS Modeler for predictive capabilities and real-time scoring.

Enterprise class Hadoop: IBM has taken open source Hadoop and augmented it for the enterprise by adding such things as a SQL engine, visual console interface, development, provisioning, and security features. The result: InfoSphere BigInsights. IBM has also infused the latest cutting edge technology from IBM Research such as super-charged performance, text analytics and visualization tools to round out the package. InfoSphere BigInsights: Provides advanced analytics built on Hadoop technology; 4x faster than any other Hadoop distribution; Enterprise-ready: BigSQL, security, reliability, designed for usability. Integrates with IBM and other information solutions to help enhance data manipulation and management tasks.

Analytics everywhere: While many people consider "analytics" as simply viewing a dashboard, running a report, or exploring data interactively, the game has changed. Now, analytics must be applied everywhere–informing every decision, optimizing every process, fueling ever interaction. Code-free predictive analytics and decision-management, when fused with the data in real-time or right-time, extends the analytic metaphor from simply reflecting performance to strategically driving performance at all levels of the organization. Analytic Catalyst accelerates analytics by automatically uncovering key insights and predictive drivers in big data without the need for programming or advanced statistical knowledge. SPSS Modeler: Visual modeling, automated algorithm selection; Model once, deploy anywhere: Real-time, Hadoop, Warehouse, Appliance; Version and deployment management of multiple models including 3rd party algorithms and R: with update in-place; SPSS Analytic Catalyst: Automatic discovery of interesting patterns delivered in interactive visuals and plain language. Catalyst to know where to dig deeper. IBM "Brand" Modular, open, integrated; Deploy what you need, when you need it; Open APIs and OpenStack; Tightly integrated to work together; All data, all analytics, all decisions, all perspectives.

High performance, strategic infrastructure that matters : Organizations recognize that you need a highly flexible, scalable IT infrastructure tuned for today's big data and analytic environments that enables shared, security-rich access to trustworthy information--on premise, in the cloud, or anywhere in between. IBM infrastructure allows for scale out for distributed needs like Hadoop, scale up for large scale workloads, and scale in bringing processing closer to the data. Best economics in industry with workload-driven choice: Power Systems, System z, System xFlash storage for real-time processing; Data optimization to prioritize data needed most often to be readily available with rest in background; Parallel processing with massively parallel to address 1 query with multiple threads connected as one.

Governance and trust for all data & analytics:To make sure the data you rely on is the right data to ensures you'll "be more right, more often." InfoSphere Information Integration and Governance: Information Server; Data Replication; Optim for privacy; Guardium for data protection; Master Data Management

Define the Acquire, Grow, Retain customers imperative and related use cases
With emphasis on the following:

Explain what are the key business values of the Acquire, Grow, Retain customers imperative

Personalization: Ensure each customer interaction is unique and tailored to the buying journey by predicting the best communication method, channel, message and time of delivery for each customer.

Profitability: Improve a customer's lifetime value through advanced association methods that optimize marketing resources and deliver targeted up-sell and cross-sell offers in real time.

Retention: Improve retention and customer satisfaction by detecting anomalies in desired behavior through sentiment analysis and scoring to proactively make more attractive, tailored offers.

Acquisition: Improve accuracy and response to marketing campaigns, reduce acquisition costs and predict lifetime value using granular micro segments based on profitable customers.

Describe what are the key use cases and/or IBM solutions of the Acquire, Grow, Retain customers imperative

Attract new customers - Marketing analytics solutions

Enhanced 360 view of the customer

Better understand customers - Customer Analytics solutions

Provide a consistent customer experience - Predictive Customer Intelligence

Define the Transform Financial Processes imperative and related use cases.
With emphasis on the following:

Explain what are the key business values of the Transform Financial Processes imperative

Planning & performance management: Create plans and forecasts based on historical trends and future predictions, while aligning resources with profit and growth opportunities.

Disclosure management and financial close: Manage the last mile of finance by monitoring and automating the financial close process and disclosure to external constituents.

Incentive compensation Management: Align your sales performance with your sales strategy by enabling new kinds of compensation plans that drive desired sales behavior.

Human capital management: Leverage data to maximize employee productivity and satisfaction, as well as to attract and retain the best employees.

Describe what are the key use cases and/or IBM solutions of the Transform Financial Processes imperative

Monitor and automate financial close and reporting processes - Solutions for financial close, regulatory and management reporting

Drive sales performance with innovative compensation plans - Sales Performance Management solutions

Make smarter decisions by analyzing performance and profitability - Financial analysis solution

Define the Improved IT Economics imperative and related use cases.
With emphasis on the following:

Explain what are the key business values of the Improved IT Economics imperative

Harness and analyze all data: Explorer your data, at rest or in motion, closer to where it resides for near-real-time analysis and insight.

Infuse a full range of analytics throughout your organization: Deliver insights into every decision, every business process and every system of engagement to drive better business outcomes.

Be proactive about privacy and governance: Ensure that all data you analyze is safe, secure and accurate.

Describe what are the key use cases and/or IBM solutions of the Improved IT Economics imperative

Real-time actionable insight - IBM SPSS Modeler Gold, IBM InfoSphere Streams

Cloud services to improve IT economics - Big Data and analytics software-as-a-service, Business process-as-a-service, Infrastructure as a service, IBM BlueMix cloud platform

Governance and security for trusted data - Information Integration and Governance, IBM Security Intelligence with Big Data

The infrastructure to maximize insights - IBM Solution for Hadoop - Power System Edition, IBM BLU Acceleration Solution - Power Systems Edition, IBM Solution for Analytics - Power Systems Edition

Define the Manage Risk imperative and related use cases
With emphasis on the following:

Explain what are the key business values of the Manage Risk Processes imperative

Financial risk: Understand the potential impact of financial risks, enabling more risk-aware strategic decision making.

Operational risk and compliance: Employ an integrated and comprehensive approach to managing operational risk and meeting regulatory requirements.

Enterprise risk management: Manage business risks to strike the right balance between risk taking and commercial gain.

Describe what are the key use cases and/or IBM solutions of the Manage Risk Processes imperative

Leverage risk to achieve better outcomes - IBM risk analytics solutions

Identify and manage operational risks proactively - IBM PureData System for Analytics

Reduce litigation and compliance risk with defensible disposal - IBM Defensible Disposal solution

Define the Optimize Operations and Reduce Fraud imperative and related use cases.
With emphasis on the following:

Explain what are the key business values of the Optimize Operations and Reduce Fraud imperative

Business process optimization: Increase operating margins with improved process efficiency by basing real-time decisions on an optimized blend of predictive models, new data sources and business rules.

Infrastructure and asset efficiency: Reduce costs, improve service levels and prevent failures, using machine data with predictive root-cause analysis to optimize operations and identify repair actions

Counter fraud: Identify fraudulent activity, financial crimes and improper payments, and determine appropriate actions to mitigate these events with the intent to reduce loss and improve customer experience.

Public safety and defense: Anticipate threats and risks, uncover crime patterns and trigger factors, derive and use new insights to make better decisions, resolve problems proactively and manage resources effectively.

Describe what are the key use cases and/or IBM solutions of the Optimize Operations and Reduce Fraud imperative

Gain insight into the health of your assets - IBM Predictive Maintenance and Quality

Detect and prevent fraud to shrink costs - IBM Signature solution - Anti-fraud, waste and abuse

Use fact-based insights to inform decisions in real time - IBM predictive operational analytics solutions

Expand visibility to see patterns, link cases and fight fraud - IBM Intelligent Investigation Manager

Define the Create new business model imperative and related use cases.
With emphasis on the following:

Explain what are the key business values of the Create new business model imperativeUse new perspectives to explore strategic options for business growth:

Data-driven products and services: Harness customer, sensor and location data to create new data-driven products and services

Mass experimentation: Change the practice and behavior of innovation by using vast amounts of data. Link the right people with the right high-performance computing capabilities to prototype new ideas, prove or disprove hypotheses, or predict best-outcome scenarios.

Non-traditional partnership: Forge lucrative alliances with non-traditional partners to create new revenue streams.

Describe what are the key use cases of the create new business model imperative

Create innovative products using new sources of data - IBM SPSS Analytic Catalyst.

Manage patients proactively by analyzing real-time data - Stream computing solutions.

Offer innovative insurance policies based on telematics data - Connected Vehicle.

Define the role of a business decision engine.
With emphasis on the following:

The featured capabilities of a business decision management system include:

Operational decision management: Automate and manage your repeatable business decisions with next-generation business rules.

Decision optimization: Make better decisions by applying advanced analytics to automatically prescribe actions or strategies.

Supply chain management: Anticipate, control and react to demand and supply volatility within the supply chain.

Define the analytical sources.
With emphasis on the following:

What are Analytics Sources?

Analytics sources provide the information for different types of analytics processing. 

-Some of this analytics processing occurs inside of the systems hosting the analytics sources. 

-Some analytics processing occurs in the provisioning engines as information is moved between the analytics sources. 

-Some analytics processing occurs in the information interaction systems.

Describe what technologies are available as Analytics Source repositoriesThe analytics sources include shared operational systems such as master data hubs, reference data hubs, activity data hubs, and content management hubs. The analytics sources also include systems such as data warehouses, map-reduce (Hadoop), files, databases, and data marts that host historical information harvested from many sources.

Describe the role of each Analytics Source repository

Map-reduce processing (Hadoop): Map-reduce processing (Hadoop) provides a flexible storage system that can hold data in many formats. Schemas and other forms of annotations can be mapped onto the data after it is stored, allowing it to be used for multiple purposes.

Warehouses: Warehouses are still needed to provide efficient access to consolidated and reconciled information for analytics, reports, and dashboards

Operational Data Stores (ODS): ODS is another type of implementation with time-sensitive operational data that needs to be accessed efficiently for both simple queries along with complex reporting to support tactical business initiatives. In the traditional architecture, the analytical sources are created based on specific business needs. As new business requirements arise, a new data process is needed to generate the data and make it available for consumers (users and/or applications)

Files: Files are used for many purposes particularly for moving large amounts of information.

Databases: Databases offer structured storage for real-time access.

Data marts: Data marts still provide subsets of information specifically formatted for a particular team or style of processing. They are read-only copies of information that are regularly refreshed from the analytical sources.

Master data hubs: Master data hubs consolidate and manage master data such as customer, supplier, product, account, and asset, as part of a master data management (MDM) program.

Reference data hubs: Reference data hubs manage code tables and hierarchies of values used to transform and correlate information from different sources.

Content hubs: Content hubs manage documents and other media such as image, video, and audio files that must be controlled and managed through formal processes.

Activity data hubs: Activity data hubs manage consolidated information about recent activity, including analytical decisions that are related to the entities in the master data and content hubs. This type of data is normally managed by applications. However, because the organization is processing information about activity from outside of the scope of its applications, a new type of hub is needed to manage the dynamic nature of this type of information.

Analytics hubs: Analytics hubs appear among the operational hubs to create additional insight in real time. These hubs are supporting advanced analytics such as predictive analytics and optimization, along with business rules.

Define the key "information zones" on a BD&A architecture.
With emphasis on the following:

Explain what is an Information Zone

An information zone defines a collection of systems where information is used and managed in a specific way.

Explain the relationship between different Information Zones

The Information Zones overlap with one another when the same information is being used for multiple purposes. This approach is necessary when there is a high volume of information making it uneconomical for each team to have its own private copy of the information. This architecture ensures the availability of suitable information for all of the teams in the organization, with maximum reuse and flexibility to support new use cases.

Explain typical Information Zones available on a BD&A architecture

Landing Area zone.Landing area zone manages raw data just received from the applications and the new sources. This data has had minimal verification and reformatting performed on it.

Shared Analytics Information Zone.Shared analytics information zone contains information that has been harvested for reporting and analytics.

Deep Data & Map Reduce Zone.Deep data zone contains detailed information that is used by analytics to create new insight and summaries for the business. This data is kept for some time after the analytics processing has completed to enable detailed investigation of the original facts if the analytics processing discovers unexpected values.

Integrated Warehouse & Marts Zone.Integrated warehouse and marts zone contains consolidated and summarized historical information that is managed for reporting and analytics.

Exploration Zone.Exploration zone provides the data that is used for exploratory analytics. Exploratory analytics uses a wide variety of raw data and managed information.

Shared Operational Information Zone.Shared operational information zone has systems that contain consolidated operational information that is being shared by multiple systems. This zone includes the master data hubs, content hubs, reference data hubs, and activity data hubs.

Information Delivery Zone.The information delivery zone contains information that has been prepared for use by the information interaction solutions. The information delivery zone typically contains read only information that is regularly refreshed to support the needs of the systems using it. It provides some of the authoritative information sources that are used in information virtualization where the original source of information is not suitable for direct access.

Section 6: IBM Big Data & Reference Architecture cont.

Describe the methods to load data into different types of data reservoir repositories.
With emphasis on the following:

A Data Reservoir is a data lake that provides data to an organization for a variety of analytics processing including:

Discovery and exploration of data

Simple ad hoc analytics

Complex analysis for business decisions

Reporting

Real-time analytics

A data reservoir consists of four types of major subsystems:

Data reservoir repositories
-Hadoop providing the generic store for all types of data.
-Specialized repositories for specific workloads.

Data Refineries
-Information provisioning and information preparation capabilities.

Information Governance
-Cataloging and policy based management.

Information Virtualization
-Simplified access to Information in the data lake.

Explain what are data reservoir repositories
Type Name Description Product Pattern
Harvested Data Information Warehouse A repository optimized for high speed analytics. This data is structured and contains a correlated and consolidated collection of information. PureData for Analytics; Industry Models Information Warehouse
"" Deep Data A repository optimized both for high volumes and variety of data. Data is mapped to data structures after it is stored so effort is spend as needed rather than at the time of storing. InfoSphere Big Insights; Industry Models Map-Reduce Node
Descriptive Catalog A repository and applications for managing the catalog of information stored in the data reservoir. InfoSphere Information Server; Industry Models Information Identification
Data Information Views Definitions of simplified subsets of information stored in the data reservoir repositories. These views are created with the information consumer in mind. Relational Database; InfoSphere MDM; InfoSphere Federation Server Virtual Information Collection
Deposited "" Information collections that have been stored by the data reservoir information users. These information collections may contain new types of information, analysis results or notes. InfoSphere Big Insights Physical Information Collection
Data "" "" "" ""
Shared Operational Data Asset Hub A repository for slowly changing operational master data (information assets) such as customer profiles, product definitions and contracts. This repository provides authoritative operational master data for the real-time interfaces, real-time analytics and for data validation in information ingestion. It is a reference repository of the operational MDM systems but may also be extended with new attributes that are maintained by the reservoir. When this hub is taking data from more than one operational system, here may also be additional quality and deduplication processes running that will improve the data. These changes are published from the asset hub for distribution both inside and outside the reservoir. InfoSphere MDM Advanced Edition Information Asset and Information Asset Hub
"" Activity Hub A repository for storing recent activity related to a master entity. This repository is needed to support the real-time interfaces and real-time analytics. It may be loaded through the information ingestion process and through the real-time interfaces. However, many of its values will have been derived from analytics running inside the data reservoir. InfoSphere MDM Custom Domain Hub; Industry Models Information Activity and Information Activity Hub
"" Code Hub A repository of code tables and mappings used for joining information sources to create information views. InfoSphere Reference Data Management Hub (RDM) Information Code and Information Code Hub
"" Content Hub A repository of documents, media files and other content that has been managed under a content management repository and is classified with relevant metadata to understand its content and status. Filenet ""
"" Operational Status A repository providing a historical record of the data from the systems of record. Database Operational Status Node

Explain what is "Information Ingestion" Data Reservoir Services and IBM offering in support of this capability -

Information Ingestion is where data from the information sources is loaded into the data reservoir. This data is treated as reference data (read only) by the processes in the data reservoir. The information ingestion component is responsible for validating the incoming data, transforming relevant structured data to the data reservoir format and routing it to the appropriate data reservoir repositories. IBM InfoSphere Information Server very well serves the required capability.

Describe the differences between on-premise cloud and hybrid cloud with respect to Big Data.
With emphasis on the following:

Explain what is Big Data on Cloud

Big Data requires large amounts of data storage, processing, and interchange. The traditional platforms for data analysis, such as data warehouses, cannot easily or cheaply scale to meet Big Data demands. Further, most of the data is unstructured and unsuitable for traditional relational databases and data warehouses.

Platforms to process Big Data require significant up-front investment. The methods for processing Big Data rely on parallel-processing models such as MapReduce(MR), in which the processing workload is spread across many CPUs on commodity compute nodes. The data is partitioned between the compute nodes at run time, and the management framework handles inter-machine communication and machine failures. The most famous embodiment of a MR cluster, Hadoop, was designed to run on a large number of machines that don't share memory or disks (the shared-nothing model we mentioned earlier).

Cloud computing, on the other hand, is the perfect vehicle to scale to accommodate such large volumes of data. Cloud computing can divide and conquer large volumes of data through the use of partitioning (storing data in more than one region or availability zone). Further, cloud computing can provide cost efficiencies by utilizing commodity compute nodes and network infrastructure, and requiring fewer administrators (thanks to standardizing the available offerings through the Cloud Service catalog), and programmers (through the use of well-defined APIs). However, as we have also discussed, cloud computing environments are built for general purpose workloads and use resource pooling to provide elasticity on demand.

So it seems that a cloud computing environment is well-suited for Big Data, provided the shared-nothing model can be honored. But there is another big difference, namely, the acute volatility of Big Data work-loads compared to typical workloads in a cloud computing environment.

Explain how to make Big Data and Cloud work together

If the cloud computing environment can be modified appropriately, Big Data and Cloud can come together in beneficial ways:

-The Cloud engine can act as the orchestrator providing rapid elasticity. 

-Big data solutions can serve as storage back-ends for the Cloud image catalog and large-scale in-stance storage. 

-Big data solutions can be workloads running on the cloud. 

-For Big Data and the Cloud to truly work together, a number of changes to the Cloud are required:

CPUs for Big Data processing: 

-A Graphics Processing Unit (GPU) is a highly parallel computing device originally designed for rendering graphics. GPUs have evolved to become general purpose processors with hundreds of cores, and are considered more powerful than typical CPUs for executing arithmetic-intensive (versus memory-intensive) applications in which the same operations are carried out on many data elements in parallel fashion. Recent research has explored tightly integrating CPUs and GPUs on a single chip. So one option is to create a resource pool with special compute chips for high performance computing for Big Data. 

-Another option of boosting computing capacity for Big Data in the Cloud is to create a resource pool with multi-core CPUs, which can achieve greater performance (in terms of calculations per second) for each unit of electrical power that is consumed than their single-core equivalents. With quad-core and hex-core CPUs now commonplace, this is the most attractive and cheapest way to create dedicated resource pools for Big Data processing in a Cloud.

Networking for Big Data processing 

-With a need to handle, potentially, petabytes of multi-structured data with unknown and complex relationships, the typical network design in a cloud infrastructure is no longer sufficient. Special considerations are required for ingesting the data into a Hadoop cluster, with a dedicated network to allow parallel processing algorithms such as MR to shuffle data between the compute nodes. At least three types of network segments are required: 

-Data: Dedicated to MR applications with a bandwidth of 10 GB for lower latency and higher bandwidth.

-Admin: A separate and dedicated network for management of all compute nodes and traffic not related to MR. 

-Management: A platform for an Integrated Management Module (IMM) (can optionally share the same VLAN subnet as the Admin segment).

Storage for Big Data processing

-One of the biggest changes is to the storage subsystem. There are two ways these changes can be addressed:

-Disk Attached Storage: The compute nodes are designed with multi-core commodity hardware with a large array of local disks. The local disks to not employ RAID and are utilized as just a box of disks. In this case, built-in redundancy of Big Data file systems such as HDFS is utilized, because they are replicating blocks across multiple nodes. 

-A second option is to use a new type of storage architecture that allows storing and accessing data as objects instead of files. Rather than using traditional enterprise storage (such as SAN or NAS, which is then pooled and provisioned on-the-fly), IaaS clouds are extended to provision rack-aware workloads and include support for object storage.

With these changes to the cloud architecture, we can finally bring together Big Data and cloud computing. The following scenarios are now possible:

-Provision a Hadoop cluster on bare-metal hardware. 

-Operate a hybrid cloud (part hypervisor for VM provisioning, part bare-metal for the data store), which is the most common cloud configuration today.

-Reconfigure the entire cloud on demand.

Explain how to make the right decision for deploying Big Data on the cloud in terms of the placement of Big Data solution components.

The issue here is where should Big Data solution components be placed?RecommendationConsider some of the relevant architectural decisions being discussed here and place the components to meet the needs in those areas:

-Data sensitivity 

-Performance 

-Scalability 

-Financial 

-Availability 

-Backup and recovery 

-Disaster Recovery

The decisions you make in these areas will each provide some guidance on your placement choices and constraints. If the decision comes down to a choice between on-premise and off-premise cloud locations for a component, then the guiding factor is usually cost.

Describe the benefits of parallel and distributed processing for BD&A.
With emphasis on the following:

Parallel Processing and Distributed Computing Concepts

Parallel Processing
-Parallel processing is the simultaneous use of more than one CPU or processor core to execute a program or multiple computational threads. Ideally, parallel processing makes programs run faster because there are more engines (CPUs or cores) running it. In practice, it is often difficult to divide a program in such a way that separate CPUs or cores can execute different portions without interfering with each other. Most computers have just one CPU, but some models have several, and multi-core processor chips are becoming the norm. There are even computers with thousands of CPUs.With single-CPU, single-core computers, it is possible to perform parallel processing by connecting the computers in a network. However, this type of parallel processing requires very sophisticated software called distributed processing software.

Massive Parallel Processing (MPP)
-Massively parallel processing refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel.

Distributed Computing
-Distributed computing is a field of computer science that studies distributed systems. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.

Benefits of MPP and Distributed Computing for Big Data and Analytics

One of the key benefits of MPP and Distributed Computing are scalability and performance for analytical applications.

The principles of MPP and processing data close to the source are equally applicable to advanced analytics on large data sets. Netezza appliances simply process on a massively parallel scale complex algorithms expressed in languages other than SQL, with none of the intricacies typical of parallel and grid programming. Running analytics of any complexity on stream against huge data volumes eliminates the delays and costs incurred moving data to separate hardware. It accelerates performance by orders of magnitude, making Netezza the ideal platform to converge data warehousing with advanced analytics.

The Netezza architecture combines the best elements of Symmetric Multiprocessing and MPP to create an appliance purpose-built for analyzing petabytes of data quickly. Every component of the architecture, including the processor, FPGA, memory, and network, is carefully selected and optimized to service data as fast as the physics of the disk allows, while minimizing cost and power consumption. Then Netezza software orchestrates these components to operate concurrently on the data stream in a pipeline fashion, thus maximizing utilization and extracting the utmost throughput from each MPP node. In addition to raw performance, this balanced architecture delivers linear scalability to more than a thousand processing streams executing in parallel, while offering a very economical total cost of ownership.

InfoSphere Information Server is a parallel engine with MPP topology, each processing node has its own CPUs, disks, and memory. In processing engine workload, these hardware resources are not shared across the nodes. This way is often referred to as a shared nothing architecture.

-Benefits and considerations of MPP engine tier topology for Data Integration. All engine nodes are available to do data integration work while available. A failure of single server will have a corresponding fractional impact on the total server resources that are available.

-Benefits and considerations of grid topology

-Offers massive scalability of the engine tier.

-Significant infrastructure consolidation is possible, compared to silos for departmental servers which have high idle time.

-Offers customizable workload management.

-Provides a shared service or a virtualized platform- a clear separation of the platform from customer projects.

-Offers the possibility to develop project-charging models.

- Multiple environments and projects can share a pool of compute nodes.

-Administration and complexity scales more linearly compared to MPP.

-Encourages organizational rather than departmental standards.

-Compute node pool is inherently highly available; loss of a compute node does not cause a complete service outage. 

-Compute nodes can be added without a service outage. 

-Multiple environments (development, test, and preproduction) can share a pool of compute nodes with workload management to make the most effective use of the available hardware resources at any point the project lifecycle.

IBM PureData System for Operational Analytics is an expert integrated data system designed for operational analytic workloads across the enterprise. The system is leverage an MPP and it is designed for high performance, it can handle up to 1000 concurrent operational queries.

PureData System for Operational Analytics can provides:

-Fast performance using parallel processing technology and other advanced capabilities.

-Built-in expertise and analytics to help you expertly manage database workloads at lower cost.

-Simpler administration for easier management and lower cost of ownership.

Define the role of MDM on BD&A
With emphasis on the following:

Master data is the key facts describing your core business entities: customers, partners, products, services, locations, accounts, and contracts. Master data is the high value common information an organization uses repeatedly across many business processes.

Master data domains are often used as the dimensions in star or snowflake schema for data warehouses or data marts.

Reference data includes codes and tables of reference information commonly used throughout the enterprise. These may include country, province, and state codes and abbreviations, industry standard codes like the ICD10 diagnostic codes in medicine or SIC codes to classify kinds of businesses. Reference data can be used to standardize data from multiple sources to improve correlation and data quality.

Since customer-centric initiatives are driving adoption of Big Data & Analytics, and Master Data Management provides an authoritative source for information within an enterprise for the key facts about customers, MDM provides a core of confidence around which an organization can scale customer information to Big Data volumes.

Watson explorer can leverage MDM to provide authoritative data about customers, products, and business partners. This data can be correlated and presented with data about those same entities gathered from external sources including social media.

MDM Probabilistic Match for BigInsights - IBM Big Match scales MDM using Hadoop Hbase. Big Match is the MDM Probabilistic Matching Engine and it's pre-built Algorithms running natively within Big Insights for Customer Data Matching, The same probabilistic matching engine used inside InfoSphere MDM is now available to execute on billions of records in BigInsights. Because this matching algorithm has a basis in decision theory and statistics, it is well adapted to searching and matching data with inconsistently populated fields, or data that has not been cleansed or standardized.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


Abraços

Luciano Caixeta Moreira - {Luti}
luciano.moreira@srnimbus.com.br
www.twitter.com/luticm
www.srnimbus.com.br