CPA Business and Data Analytics pdf notes and Videos


Table of Contents




Recommended tool: Excel, R UNIT DESCRIPTION

This course is aimed at enabling the candidate to use information technology to support decision making through business analytics. The candidate is expected to demonstrate digital competency in preparation and analysis of financial statements, forecasting and related areas in data analytics.


To attempt this paper, a candidate shall be required to have passed all other examination papers within the CPA qualification.
Candidates will be required to have core knowledge of quantitative techniques, financial accounting and reporting and financial management. Candidates are also expected to have knowledge in their specialisation areas of management accounting, audit, tax and public financial management.

The paper will be attempted over three hours in a controlled, computerized environment (examination centres with computer laboratories).


A candidate who passes this paper should be able to:

  • Discuss fundamental aspects of big data and data analytics from the CRISP (cross- industry standard process for data mining) framework, data visualisation and emerging issues.
  • Apply data analytics in preparation of financial statements, financial statements analysis and forecasting, carrying out sensitivity/scenario analysis and presenting financial data and metrics using dashboards.
  • Apply data analytics in financial management principles that include time value of money analysis, evaluate capital projects, carry out sensitivity/scenario analysis and present information using dash boards.
  • Apply data analytics in management accounting to estimate product costs, breakeven analysis, budget preparation, sensitivity/scenario analysis and flexible budgets.
  • Apply data analytics in auditing techniques including key financial trends, fraud detection, tests of control, model reviews and validation issues.
  • Apply data analytics in estimating tax payable and in public sector financial management.


1.0 Introduction to Excel

– Utilising the keyboard shortcuts in Excel
– Conducting data analysis using data tables, pivot tables and other common functions
– Using advanced formulas and functions to enhance the functionality of financial models

2.0 Introduction to data analytics

2.1 The CRISP (cross-industry standard process for data mining) framework for data analytics

  • Data concepts – conceptual, logical, physical data models
  • Stages in data lifecycle: identifying data sources, modeling data requirements, obtaining data, recording data, using data for making business decision, removing data

2.2 Big data and data analytics

  • Definition of big data
  • The 5Vs of big data
  • Types of data analytics: descriptive analytics, prescriptive analytics and predictive analytics

2.3 Tools for data analytics

  • Data cleaning tools (Alteryx, SSIS, Datastage, others)
  • Data Management (Storage/DBA): SQL, Oracle, Cloud Computing (AWS,AZURE), others
  • Reporting/Visualization : Excel, PowerBI, Tableau, Microstrategy, others

2.4 Data visualization in Excel

  • Definition of data visualization
  • Benefits of data visualization
  • Types of visualization; comparison, composition and relationships
  • Qualities of good data visualization

3.0 Core application of data analytics

3.1 Financial accounting and reporting

  • Prepare financial statements; statement of profit or loss, statement of financial position and statement of cash flow for companies and groups
  • Analyse financial statements using ratios, common size statements, trend and cross-sectional analysis, graphs and charts
  • Prepare forecast financial statements under specified assumptions
  • Carry out sensitivity analysis and scenario analysis on the forecast financial statements
  • Data visualization and dash boards for reporting

3.2 Financial Management

  • Time value of money analysis for different types of cash flows
  • Loan amortization schedules
  • Project evaluation techniques using net present value – (NPV), internal rate of return (IRR)
  • Carry out sensitivity analysis and scenario analysis in project evaluation
  • Data visualisation and dashboards

4.0 Application of data analytics in specialised areas

4.1 Management accounting

  • Estimate cost of products (goods and services) using high-low and regression analysis method
  • Estimate price, revenue and profit margins
  • Carry out break-even analysis
  • Budget preparation and analysis (including variances)
  • Carry out sensitivity analysis and scenario analysis and prepare flexible budgets

4.2 Auditing

  • Analysis of trends in key financial statements components
  • Carry out 3-way order matching
  • Fraud detection
  • Test controls (specifically segregation of duties) identifying combinations of users involved in processing transactions
  • Carry out audit sampling from large data set
  • Model review and validation issues

4.3 Taxation and public financial management

  • Compute tax payable for individuals and companies
  • Prepare wear and tear deduction schedules
  • Analyse public sector financial statements using analytical tools
  • Budget preparation and analysis (including variances)
  • Analysis of both public debt and revenue in both county and national government
  • Data visualisation and reporting in the public sector

5.0 Emerging issues in data analytics

  • Skepticism and challenges in data analytics
  • Ethical issues in data analytics
  • Data Security / Data Protection
  • Performance (Limitations within analytic tools)




Although many of you are already aware of Excel, let’s have a small introduction anyways. Microsoft Excel is an application developed Microsoft that is used to record, analyze, and visualize data. Excel, a spreadsheet application, was developed Microsoft in 1985.

Fig: Microsoft Excel Spreadsheet
Data in Excel is in the form of rows and columns. Excel is commonly used to record and analyze data, perform mathematical operations, and visualize structured data in charts and graphs. Finally, another important application of Excel is that it helps in automating tasks through excel macros.
To perform the tasks mentioned above quickly, Excel has a set of shortcuts. Various operations can be achieved with a few simple keyboard strokes. Let’s dive deep into the Excel shortcuts that can help us work better on an Excel spreadsheet.
Need for Excel Shortcuts
Excel supports a plethora of keyboard shortcuts that help you work efficiently and increase productivity. Instead of accessing the toolbar with a mouse, two or three keystrokes are used to perform significant functions. Isn’t that easier and time-saving? Using Excel shortcuts drastically increases the speed, and thus reduces work-time.

Now the question is, if you have to memorize these shortcuts, the answer is no. However, it would be an advantage if you can remember a few of them. With regular practice, you will be able to remember most of the common Excel shortcuts.
Let’s now look at the Excel shortcuts cheat sheet, which you should know when working on Microsoft Excel. In this article, we have categorized 50 Excel shortcuts based on their operations. First, we will look at the workbook shortcut keys.

In this section, we will understand the basics of operating a workbook. We will learn how to create a new workbook, open an existing workbook, and save a spreadsheet so that you don’t lose any data or calculations that you have done. We will then go through how you can toggle between several different sheets in a workbook.

Excel Shortcuts

1. To create a new workbook        Ctrl + N

2. To open an existing workbook         Ctrl + O

3. To save a workbook/spreadsheet      Ctrl + S

4. To close the current workbook      Ctrl + W

5. To close Excel       Ctrl + F4

6. To move to the next sheet     Ctrl + PageDown

7. To move to the previous sheet   Ctrl + PageUp

8. To go to the Data tab      Alt + A

9. To go to the View tab         Alt + W

10. To go the Formula tab        Alt + M

Those were the Excel shortcuts that can help you navigate through your spreadsheet. Once the workbook creation is done, the next key step is cell formatting.


A cell in Excel holds all the data that you are working on. Several different shortcuts can be applied to a cell, such as editing a cell, aligning cell contents, adding a border to a cell, adding an outline to all the selected cells, and many more. Here is a sneak peek into these Excel shortcuts.

Excel Shortcuts

11. To edit a cell  F2

12. To copy and paste cells       Ctrl + C, Ctrl + V

13. To italicize and make the font bold   Ctrl + I, Ctrl + B

14. To center align cell contents Alt + H + A + C

15. To fill color  Alt + H + H

16. To add a border  Alt + H + B

17. To remove outline border    Ctrl + Shift + _

18. To add an outline to the select cells    Ctrl + Shift + &

19. To move to the next cell   Tab

20. To move to the previous cell Shift + Tab

21. To select all the cells on the right         Ctrl + Shift + Right arrow

22. To select all the cells on the left       Ctrl + Shift + Left Arrow

23. To select the column from the selected cell to the end of the table Ctrl + Shift + Down Arrow

24. To select all the cells above the selected cell  Ctrl + Shift + Up Arrow

25. To select all the cells below the selected cell Ctrl + Shift + Down Arrow

In addition to the above-mentioned cell formatting shortcuts, let’s look at a few more additional and advanced cell formatting Excel shortcuts, that might come handy.
We will learn how to add a comment to a cell. Comments are helpful when giving extra information about cell content. We will also learn how to find value and replace it with another value in the spreadsheet.
After this, we will look into how to insert the current time, current date, activate a filter, and add a hyperlink to a cell. Finally, we will see how to apply a format to the data in a cell.

Excel Shortcuts

26. To add a comment to a cell  Shift + F2

27. To delete a cell comment     Shift + F10 + D

28. To display find and replace        Ctrl + H

29. To activate the filter      Ctrl + Shift + L Alt + Down Arrow

30. To insert the current date      Ctrl + ;

31. To insert current time   Ctrl + Shift + :

32. To insert a hyperlink        Ctrl + k

33. To apply the currency format   Ctrl + Shift + $

34. To apply the percent format      Ctrl + Shift + %

35. To go to the “Tell me what you want to do” box        Alt + Q

After working with cell formatting Excel shortcuts, the next step is to understand how to work with an entire row/column in Excel.

Row and Column Formatting Shortcut Keys
In this section, we’ll look at some critical row and column formatting shortcuts.

Excel Shortcuts

36. To select the entire row   Shift + Space

37. To select the entire column      Ctrl + Space

38. To delete a column     Alt+H+D+C

39. To delete a row        Shift + Space, Ctrl + –

40. To hide selected row  Ctrl + 9

41. To unhide selected row   Ctrl + Shift + 9

42. To hide a selected column   Ctrl + 0

43. To unhide a selected column  Ctrl + Shift + 0

44. To group rows or columns     Alt + Shift + Right arrow

45. To ungroup rows or columns        Alt + Shift + Left arrow

We will understand how to delete rows and columns, hide and unhide the selected rows and columns, and group and ungroup rows and columns.
Now that we have looked at the different shortcut keys for formatting cells, rows, and columns, it is time to jump into understanding an advanced topic in Excel, i.e. dealing with pivot tables. Let’s look at the different shortcuts to summarize your data using a pivot table.



2.1 THE CRISP (Cross industry standard process for data mining) FRAMEWORK FOR


Data modeling (data modelling) is the process of creating a data model for the data to be stored in a database. This data model is a conceptual representation of Data objects, the associations between different data objects, and the rules.

Data modeling helps in the visual representation of data and enforces business rules, regulatory compliances, and government policies on the data. Data Models ensure consistency in naming conventions, default values, semantics, security while ensuring quality of the data.

Data Models in DBMS
The Data Model is defined as an abstract model that organizes data description, data semantics, and consistency constraints of data. The data model emphasizes on what data is needed and how it should be organized instead of what operations will be performed on data. Data Model is like an architect’s building plan, which helps to build conceptual models and set a relationship between data items.
The two types of Data Modeling Techniques are

1. Entity Relationship (E-R) Model
2. UML (Unified Modelling Language) We will discuss them in detail later.

What is Data Modelling?

Data modeling is the process of creating a data model for the data to be stored in a Database. This data model is a conceptual representation of

  • Data objects
  • The associations between different data objects
  • The rules.

Data modeling helps in the visual representation of data and enforces business rules, regulatory compliances, and government policies on the data. Data Models ensure consistency in naming conventions, default values, semantics, security while ensuring quality of the data.

Data model emphasizes on what data is needed and how it should be organized instead of what operations need to be performed on the data. Data Model is like architect’s building plan which helps to build a conceptual model and set the relationship between data items.

The two types of Data Models techniques are

  1. Entity Relationship (E-R) Model
  2. UML (Unified Modelling Language)

Why use Data Model?

The primary goal of using data model are:

  • Ensures that all data objects required the database are accurately represented. Omission of data will lead to creation of faulty reports and produce incorrect results.
  • A data model helps design the database at the conceptual, physical and logical levels.
  • Data Model structure helps to define the relational tables, primary and foreign keys and stored procedures.
  • It provides a clear picture of the base data and can be used database developers to create a physical database.
  • It is also helpful to identify missing and redundant data.
  • Though the initial creation of data model is labor and time consuming, in the long run, it makes your IT infrastructure upgrade and maintenance cheaper and faster.

Types of Data Models

There are mainly three different types of data models:

  1. Conceptual: This Data Model defines WHAT the system contains. This model is typically created Business stakeholders and Data Architects. The purpose is to organize, scope and define business concepts and rules.
  2. Logical: Defines HOW the system should be implemented regardless of the DBMS. This model is typically created Data Architects and Business Analysts. The purpose is to developed technical map of rules and data structures.
  3. Physical: This Data Model describes HOW the system will be implemented using a specific DBMS system. This model is typically created DBA and developers. The purpose is actual implementation of the database.

Conceptual Model

The main aim of this model is to establish the entities, their attributes, and their relationships. In this Data modeling level, there is hardly any detail available of the actual Database structure.

The 3 basic tenants of Data Model are

Entity: A real-world thing

Attribute: Characteristics or properties of an entity

Relationship: Dependency or association between two entities

For example:

  • Customer and Product are two entities. Customer number and name are attributes of the Customer entity
  • Product name and price are attributes of product entity
  • Sale is the relationship between the customer and product

Characteristics of a conceptual data model

  • Offers Organisation-wide coverage of the business concepts.
  • This type of Data Models are designed and developed for a business audience.
  • The conceptual model is developed independently of hardware specifications like data storage capacity, location or software specifications like DBMS vendor and technology. The focus is to represent data as a user will see it in the “real world.”

Conceptual data models known as Domain models create a common vocabulary for all stakeholders establishing basic concepts and scope.

Logical Data Model

Logical data models add further information to the conceptual model elements. It defines the structure of the data elements and set the relationships between them.

The advantage of the Logical data model is to provide a foundation to form the base for the Physical model. However, the modeling structure remains generic.

At this Data Modeling level, no primary or secondary key is defined. At this Data modeling level, you need to verify and adjust the connector details that were set earlier for relationships.

Characteristics of a Logical data model

  • Describes data needs for a single project but could integrate with other logical data models based on the scope of the project.
  • Designed and developed independently from the DBMS.
  • Data attributes will have datatypes with exact precisions and length.
  • Normalization processes to the model is applied typically till 3NF.

Physical Data Model

A Physical Data Model describes the database specific implementation of the data model. It offers an abstraction of the database and helps generate schema. This is because of the richness of meta-data offered a Physical Data Model.

This type of Data model also helps to visualize database structure. It helps to model database columns keys, constraints, indexes, triggers, and other RDBMS features.

Characteristics of a physical data model:

  • The physical data model describes data need for a single project or application though it maybe integrated with other physical data models based on project scope.
  • Data Model contains relationships between tables that which addresses cardinality and nullability of the relationships.
  • Developed for a specific version of a DBMS, location, data storage or technology to be used in the project.
  • Columns should have exact datatypes, lengths assigned and default values.
  • Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are defined.

Advantages and Disadvantages of Data Model:

Advantages of Data model:

  • The main goal of a designing data model is to make certain that data objects offered the functional team are represented accurately.
  • The data model should be detailed enough to be used for building the physical database.
  • The information in the data model can be used for defining the relationship between tables, primary and foreign keys, and stored procedures.
  • Data Model helps business to communicate the within and across organizations.
  • Data model helps to documents data mappings in ETL process
  • Help to recognize correct sources of data to populate the model

Disadvantages of Data model:

  • To develop Data model one should know physical data stored characteristics.
  • This is a navigational system produces complex application development, management. Thus, it requires a knowledge of the biographical truth.
  • Even smaller change made in structure require modification in the entire application.
  • There is no set data manipulation language in DBMS.


  • Data modeling is the process of developing data model for the data to be stored in a Database.
  • Data Models ensure consistency in naming conventions, default values, semantics, security while ensuring quality of the data.
  • Data Model structure helps to define the relational tables, primary and foreign keys and stored procedures.
  • There are three types of conceptual, logical, and physical.
  • The main aim of conceptual model is to establish the entities, their attributes, and their relationships.
  • Logical data model defines the structure of the data elements and set the relationships between them.
  • A Physical Data Model describes the database specific implementation of the data model.
  • The main goal of a designing data model is to make certain that data objects offered the functional team are represented accurately.
  • The biggest drawback is that even smaller change made in structure require modification in the entire application.


Data Lifecycle Management (DLM)

Data Lifecycle Management (DLM) is the process that follows data from creation to destruction, with each phase controlled a set of policies customized to your business needs. Your data lifecycle management policies should reflect your compliance regulations, privacy standards, and your degree of data accessibility.

1. Data creation

This one may be beyond obvious but take a moment to see where the bulk of your data is generated. You would hope that valuable active data from research and development efforts, customer interactions on your website, data entry, shared/purchased data, financial data, and transactional data constitute the bulk of your data creation phase. But if you see that your employees are saving memes, Instagram screenshots, or YouTube videos on your servers and taking up cloud space, then you need to tighten your data policies and enforcement.

2. Data maintenance & storage

The quality and accuracy of your data are as important as its accessibility. Bad data can directly impact revenue, as poor data hygiene is a source of inefficiency. 77% of leaders don’t trust the data available to them to make complex business decisions. Once you have curated the data that will drive business decisions, how do you ensure it remains accurate and accessible? It seems like circular logic because it is. The constant cycling of data generation, analysis, integration, storage, and elimination gives Executives the quality data they need to make decisions. But that data maintenance cycle needs governance.

3. Data usage

What is the value of your data? How are you synthesizing the results of data analytics? This is the phase where you align value with action. How is your data used and moved around your enterprise? Maybe you incorporate feedback from end-users into product enhancement opportunities? Roles need to be defined around who has access to sensitive data.

4. Data publication

Data publication and sharing can create issues around compliance and security restrictions. While it is important to share your valuable insights and research, you need to control the way your data leaves your enterprise. And the way recipients engage with your publications needs to be tracked and evaluated.

5. Data archiving

Data archiving strategies should be built around the utility and sensitivity of the data stored. Consider privacy, data ownership, legal requirements, and the length of time you need to keep that data. Archiving removes your data from your active environment but keeps it in deep storage should you need it again.

6. Data destruction

Free yourself deleting active and archived data that no longer hold value to your organization. Not only does the data need to be appropriately destroyed, but you must adhere to internal governance policies and legal standards depending on the sensitivity of the data.


The quantities, characters, or symbols on which operations are performed a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.
Big data refers to data sets that are too large or complex to be dealt with traditional data- processing application software. Data with many fields (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.

Types Of Big Data
Following are the types of Big Data:

1. Structured
2. Unstructured
3. Semi-structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Examples Of Structured Data

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000

Shushil Roy
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc. Now day organizations have wealth of data available with them but unfortunately, they don’t know how to derive value out of it since this data is in its raw form or unstructured format.

Examples Of Un-structured Data

The output returned ‘Google Search’

Example Of Un-structured Data

Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data Personal data stored in an XML file-

Prashant RaoMale35
Seema R.Female41
Satish ManeMale29
Subrato RoyMale26
Jeremiah J.Male35

Big data can be described the following characteristics:

• Volume
• Variety
• Velocity
• Variability
• Veracity

Big data can be described the following characteristics:
The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. The size of big data is usually larger than terabytes and petabytes.[35]
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared
to small data, big data is produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.[36] Veracity
The truthfulness or reliability of the data, which refers to the data quality and the data value. Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it. The data quality of captured data can vary greatly, affecting an accurate analysis. Value
The worth in information that can be achieved the processing and analysis of large datasets. Value also can be measured an assessment of the other qualities of big data.[39] Value may also represent the profitability of information that is retrieved from the analysis of big data.

The characteristic of the changing formats, structure, or sources of big data. Big data can include structured, unstructured, or combinations of structured and unstructured data. Big data analysis may integrate raw data from multiple sources. The processing of raw data may also involve transformations of unstructured data to structured data.

Whether the entire system (i.e., =all) is captured or recorded or not. Big data may or may not include all the available data from sources.

Fine-grained and uniquely lexical
Respectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.

If the data collected contains common fields that would enable a conjoining, or meta-analysis, of different data sets.

If new fields in each element of the data collected can be added or changed easily.

If the size of the big data storage system can expand rapidly.

Ability to process Big Data in DBMS brings in multiple benefits, such as-

1) Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.

2) Improved customer service
Traditional customer feedback systems are getting replaced new systems designed with Big Data technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency

3) Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps an organization to offload infrequently accessed data.


Data analytics (DA) is the process of examining data sets in order to find trends and draw conclusions about the information they contain. Increasingly, data analytics is done with the aid of specialized systems and software.

1. Predictive data analytics (What is likely to happen?)

Predictive analytics may be the most commonly used category of data analytics. Businesses use predictive analytics to identify trends, correlations, and causation. The category can be further broken down into predictive modeling and statistical modeling; however, it’s important to know that the two go hand in hand.
For example, an advertising campaign for t-shirts on Facebook could apply predictive analytics to determine how closely conversion rate correlates with a target audience’s geographic area, income bracket, and interests. From there, predictive modeling could be used to analyze the statistics for two (or more) target audiences, and provide possible revenue values for each demographic.

2. Prescriptive data analytics (What should be done?)

Prescriptive analytics is where AI and big data combine to help predict outcomes and identify what actions to take. This category of analytics can be further broken down
into optimization and random testing. Using advancements in ML, prescriptive analytics can help answer questions such as “What if we try this?” and “What is the best action?” You can test the correct variables and even suggest new variables that offer a higher chance of generating a positive outcome.

3. Diagnostic data analytics (Why did it happen?)

While not as exciting as predicting the future, analyzing data from the past can serve an important purpose in guiding your business. Diagnostic data analytics is the process of examining data to understand cause and event or why something happened. Techniques such as drill down, data discovery, data mining, and correlations are often employed.

Diagnostic data analytics help answer why something occurred. Like the other categories, it too is broken down into two more specific categories: discover and alerts and query and drill downs. Query and drill downs are used to get more detail from a report. For example, a sales rep that closed significantly fewer deals one month. A drill down could show fewer workdays, due to a two-week vacation.
Discover and alerts notify of a potential issue before it occurs, for example, an alert about a lower amount of staff hours, which could result in a decrease in closed deals. You could also use diagnostic data analytics to “discover” information such as the most-qualified candidate for a new position at your company.

4. Descriptive data analytics (What happened?)

Descriptive analytics are the backbone of reporting—it’s impossible to have business intelligence (BI) tools and dashboards without it. It addresses basic questions of “how many, when, where, and what.”
Once again, descriptive analytics can be further separated into two categories: ad hoc reporting and canned reports. A canned report is one that has been designed previously and
contains information around a given subject. An example of this is a monthly report sent your ad agency or ad team that details performance metrics on your latest ad efforts.
Ad hoc reports, on the other hand, are designed you and usually aren’t scheduled. They are generated when there is a need to answer a specific business question. These reports are useful for obtaining more in-depth information about a specific query. An ad hoc report could focus on your corporate social media profile, examining the types of people who’ve liked your page and other industry pages, as well as other engagement and demographic information. Its hyper specificity helps give a more complete picture of your social media audience. Chances are you won’t need to view this type of report a second time (unless there’s a major change to your audience).

Data analysis tools are software and programs that collect and analyze data about a business, its customers, and its competition in order to improve processes and help uncover insights to make data-driven decisions.

1. MonkeyLearn

MonkeyLearn is a no-code machine learning platform that provides a full suite of advanced data analysis tools to analyze unstructured data, in real-time and around the clock.

You can set up MonkeyLearn to automatically analyze large collections of data right away using pre-trained models, or you can create your own customized text analysis models in a simple point-and-click interface.

MonkeyLearn’s suite of data analysis tools allow you to classify data topic, sentiment, intent, and more, or extract relevant information, like names, locations, and keywords. Native integration and a robust API make it easy to connect the tools and apps you already use to MonkeyLearn’s machine learning tools.

To make it really simple to discover insights in your text data, MonkeyLearn Studio provides an in-app data visualization tool, so you can go from data analysis to data visualization all in one place.

2. RapidMiner

RapidMiner is a data science platform that helps companies build predictive machine learning models from data. It’s aimed at data analytics teams that want to tackle challenging tasks and handle large amounts of data, so you’ll need a technical background.

Depending on your needs, you can opt for different solutions, including TurboPrep, which allows you to clean and prepare your data; AutoModel, which provides different algorithms to build machine learning models; and DataStudio, to create a visual workflow and explore your data.


KNIME is a free, open-source platform to create data science workflows. It has an intuitive drag and drop interface that allows you to import data from different sources, build advanced machine learning solutions, and visualize data.

Like most open platforms, it’s constantly being updated and has an active community of contributors. KNIME allows users to visually create flows making it simple for even non- programmers.

4. Talend

Talend offers a suite of cloud apps for data integration. It’s designed to help businesses collect all their data in a single platform so that teams can access the right data when they need it.

The platform has a series of in-built machine learning components, which allow users to analyze data without the need to code. It uses classification, clustering, recommendation, and regression algorithms.

Talend offers a free open-source version and various commercial alternatives.

5. Excel

Microsoft Excel can be used to filter, organize, and visualize quantitative data, making it the perfect tool for performing simple data analysis. You can use a wide range of formulas, filters, and create pivot tables, charts and graphs, to synthesize the data you’ve gathered. But there’s a

limit to the amount of data that Excel can handle, so you may need more powerful tools if you’d like to analyze data at scale.

Explore common functions and formulas for data analysis in Excel.

6. Airtable

Airtable is a user-friendly cloud collaboration tool defined as “part spreadsheet, part database”. It provides data analysis and data visualization functions (like other traditional spreadsheet tools) but with a powerful database on the backend. By using “views”, you can easily interact with the database to manage, track, and find data. Plus, developers can connect Airtable with other apps through an API.

There’s a free plan available with the basic features for you to get started.

7. Power BI

Business intelligence tools, like Microsoft Power BI, are extremely important in the data analysis process because they make it easy for businesses to spot trends, patterns, and insights across large sets of data.

Microsoft Power BI allows users to import data from hundreds of sources, and drag and drop elements, to create real-time dashboards and reports. Equipped with AI, an Excel integration, and pre-built and custom data connectors, you can gain valuable insights and easily share them with the rest of your team.

8. Tableau

Tableau is a powerful analytics and data visualization platform that allows you to connect all your data and create compelling reports and interactive dashboards that update in real-time. It’s easy to use, supports large amounts of data, and can be run on-premise or in the cloud.

There’s a free trial available and different plans for individual users and organizations.

9. ClicData

ClicData is an end-to-end business intelligence platform with extensive data connectivity, data transformation, automation and visualization features. ClicData is 100% cloud-based and works on all operating systems and devices.

Within a day, you can easily connect, blend data from various sources and build dashboards with their drag-and-drop interface. They offer self-service BI with online resources as well as full- service BI with in-app support and expert services.

10. R

Free and open-source programming languages are a great option if you’re building your own data analysis tools. R is widely used for exploratory data analysis, statistical computing, and data visualization. At first, it was mainly used researchers and academics but has now branched out into the business world. Learning R is relatively easy, even if you don’t have a programming background.

11. Python

Python is one of the most in-demand programming languages today and it’s considered the preferred language for machine learning. It stands out for being very flexible, allowing you to build solutions for various use cases. Plus, it’s fairly straightforward to learn and write.

12. Qlik

With both cloud and on-premises deployment, Qlik offers helpful tools for those with expansive technical backgrounds or users that are not even fully computer literate. QlikView offers in- memory data processing for super fast results and the visualization of color-coded data relationships makes the results and insights easy to understand.

13. SAS Business Intelligence

The SAS Business Intelligence platform focuses on visualizations that can be easily understood and shared simply across an organization for insights with a clear path to implementing change, in order to streamline processes and improve customer satisfaction.

SAS BI aims to help clients answer specific questions, like “Where do my customers come from?” and “Where are most accidents occurring?”

14. Looker

Looker integrates with existing tools to introduce new, highly-focused data that can show previously unseen data relationships to help teams make more informed decisions.

Customizable programs and applications ensure that models are designed specifically for individual clients. And many of their “embedded analytics solutions” come pre-designed for industries like retail, healthcare, and more.

15. SQL Programming Language

Structured query language (SQL) is the standard language created to communicate with databases and is particularly useful when handling structured data. Used to search, add, update, and delete data, among others, SQL makes it easy to organize structured data.

Most structured data that exists, is stored in SQL, so programs written for the language can easily unlock data for powerful results.


Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way every time.


Data cleaning is the process that removes data that does not belong in your dataset. Data transformation is the process of converting data from one format or structure into another. Transformation processes can also be referred to as data wrangling, or data munging, transforming and mapping data from one “raw” data form into another format for warehousing and analyzing.

While the techniques used for data cleaning may vary according to the types of data your company stores, you can follow these basic steps to map out a framework for your organization.

Step 1: Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations. Duplicate observations will happen most often during data collection. When you combine data sets from multiple places, scrape data, or receive data from clients or multiple departments, there are opportunities to create duplicate data. De-duplication is one of the largest areas to be considered in this process. Irrelevant observations are when you notice observations that do not fit into the specific problem you are trying to analyze. For example, if you want to analyze data regarding millennial customers, but your dataset includes older generations, you might remove those irrelevant observations. This can make analysis more efficient and minimize distraction from your primary target—as well as creating a more manageable and more performant dataset.

Step 2: Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or


Having clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making. Benefits include:

• Removal of errors when multiple sources of data are at play.
• Fewer errors make for happier clients and less-frustrated employees.
• Ability to map the different functions and what your data is intended to do.
• Monitoring errors and better reporting to see where errors are coming from, making it easier to fix incorrect or corrupt data for future applications.
• Using tools for data cleaning will make for more efficient business practices and quicker decision-making.

DataStage is an ETL tool which is used to Extract the data from different data source, Transform the data as per the business requirement and Load into the target database. The data source can be of any type like Relational databases, files, external data sources, etc. Using the DataStage ETL tool we provide quality data, which in return used for the Business Intelligence. DataStage first launched Vmark, later it was acquired IBM. DataStage was called earlier as ‘Data Integrator’.
Why do we need DataStage?
Before going to the query ‘Why we need DataStage’. Let us know about traditional batch processing.
Below is the process that was followed in traditional batch processing: –
1. Load data from source to Disk
2. Disk to perform transformations and then save to disk.
3. Disk to Target.
In the traditional batch processing becomes impractical with big data volumes, Very complex to manage lots of small jobs needed to achieve the requirement.
To overcome the above drawbacks we needed batch processing that can be done parallelly. For this need, we got the ETL batch processing system to deal with large volume data-parallel.
Parallel processing can be done based on pipelining and partitioning.

How does DataStage works?
Datastage usually under goes below steps:
We design jobs for extraction, transformation, and loading in a sequential job manner or Parallel manner.
Schedule, run, and monitor the jobs. Create batch jobs.
Datastage usually has different components that would help us achieve the overall extraction, transform, and load.
Administrator: – Manages the global settings and interacts with systems.
Designer: – Here designer is used to create Datastage jobs, job sequences which in turn compiled into executable programs. Designer is mainly for the developers.
Director: – This is used to monitor and manage the Datastage jobs. Used DataStage support roles to monitor the jobs and fix job failures.
Manager: – It is used to manage, browse, and edit the data warehouse repository. The Terminology that we use are as below:
Project Job Stage Link
Types of Jobs: – Parallel jobs, job sequences, and server jobs. Parallel Jobs:
Stages and links combined in a shared container.
Reuse of instances of the shared container in various other parallel jobs. But the container can be used only within the job is defined.
Server Jobs:
Used to represent sources, conversion stages, or targets. We have two stages: – active or passive stages.

Links various stages in a job and indicate the flow of data when the job is run.
Processing Stage Types
Datastage job usually consists of the stages, links, and transform. The stages are nothing but the flow of data from a data source to the target data source. The stage can have a minimum single data source as input or multiple data sources and one or more data output.
Let us discuss the various stages that we use in DataStage: In Job design various stages you can use are:
1) Transform stage
2) Filter stage
3) Aggregator stage
4) Remove duplicates stage
5) Join stage
6) Lookup stage
7) Copy stage
8) Sort stage
9) Containers


Advantages Disadvantages
Connect to multiple types of data sources We need to either install of connecting to the server for the ETL work.
Large volume of data. Bulk transfer and complex transformation No automated mechanism for error handling and recovery.
Refresh and synchronize data as much as needed. We don’t have UNIX datastage client.
Reliable and Flexible to connect to different types of databases. Affording the software might go expensive for small or mid-size companies.
Partitioning algorithms
Easy integration and a single interface to integrate heterogeneous sources.

Performs well in both Windows and Unix servers.

Features of Datastage
1) It supports the transformation of large volume data.
2) Real time data integration which enables connectivity between data sources and application.
3) Optimize hardware utilization.
4) Supports collection and integration.
5) Powerful, Scalable, Speed, flexible, and effective to build, deploy, update, and manage your data integration.
6) Support big data and Hadoop.

Uses of DataStage in various fields or companies:
Presently the usage of the Datastage is gone worldwide. The fields or companies that use the DataStage are Cooper Companies, SAS, etc.
To know more about this, use the below link which would give a picture:
Career path for DataStage :
Current scenario, ETL tool usage is on rise. And we can see that ETL is not confined to a particular industry. ETL is used in each and every industry to manage the data and make it a usable format.
We do have other tools called Informatica, Talend ETL tool which is cheaper than datastage.
To be more specific to the career path we can learn data analytics which would be easier to handle and be a career milestone in the career path since you already have good knowledge in ETL tools.
Things that need to be remembered from the above session are the definition and flow of the datastage job.
DataStage is an ETL tool which is used to Extract the data form different data source, Transform the data as per the business requirement and Load into the target database. The data source can

be of any type like Relational databases, files, external data sources, etc. Using the DataStage ETL tool we provide quality data, which in return used for the Business Intelligence.
Datastage usually under goes below steps:
We design jobs for extraction, transformation, and loading in a sequential job manner or Parallel manner.
Schedule, run, and monitor the jobs. Create batch jobs.
Key aspects as below: –
1) Data transformation
2) Jobs
3) Parallel processing
4) DataStage has four main components,
5) Administrator
6) Manager
7) Designer
8) Director
Refresh and synchronize data as much as needed. Reliable and Flexible to connect to different types of databases. Partitioning algorithms Easy integration and a single interface to integrate heterogeneous sources.

Something as simple as presenting data in graphic format may seem to have no downsides. But
sometimes data can be misrepresented or misinterpreted when placed in the wrong style of data

visualization. When choosing to create a data visualization, it’s best to keep both the advantages and disadvantages in mind.

Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and squares
from circles. Our culture is visual, including everything from art and advertisements to TV and movies. Data visualization is another form of visual art that grabs our interest and keeps our eyes on the message. When we see a chart, we quickly see trends and outliers. If we can see something, we internalize it quickly. It’s storytelling with a purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t see a trend, you know how much more effective a visualization can be.

Some other advantages of data visualization include:

• Easily sharing information.
• Interactively explore opportunities.
• Visualize patterns and relationships.

While there are many advantages, some of the disadvantages may seem less obvious. For
example, when viewing a visualization with many different datapoints, it’s easy to make an inaccurate assumption. Or sometimes the visualization is just designed wrong so that it’s biased or confusing.

Some other disadvantages include:

• Biased or inaccurate information.
• Correlation doesn’t always mean causation.
• Core messages can get lost in translation.


The importance of data visualization is simple: it helps people see, interact with, and better understand data. Whether simple or complex, the right visualization can bring everyone on the same page, regardless of their level of expertise.

It’s hard to think of a professional industry that doesn’t benefit from making data more understandable. Every STEM field benefits from understanding data—and so do fields in government, finance, marketing, history, consumer goods, service industries, education, sports, and so on.

While we’ll always wax poetically about data visualization (you’re on the Tableau website, after all) there are practical, real-life applications that are undeniable. And, since visualization is so prolific, it’s also one of the most useful professional skills to develop. The better you can convey your points visually, whether in a dashboard or a slide deck, the better you can leverage that information. The concept of the citizen data scientist is on the rise. Skill sets are changing to accommodate a data-driven world. It is increasingly valuable for professionals to be able to use data to make decisions and use visuals to tell stories of when data informs the who, what, when, where, and how.

While traditional education typically draws a distinct line between creative storytelling and technical analysis, the modern professional world also values those who can cross between the two: data visualization sits right in the middle of analysis and visual storytelling.


As the “age of Big Data” kicks into high gear, visualization is an increasingly key tool to make sense of the trillions of rows of data generated every day. Data visualization helps to tell stories curating data into a form easier to understand, highlighting the trends and outliers. A good visualization tells a story, removing the noise from data and highlighting useful information.

However, it’s not simply as easy as just dressing up a graph to make it look better or slapping on the “info” part of an infographic. Effective data visualization is a delicate balancing act between form and function. The plainest graph could be too boring to catch any notice or it make tell a powerful point; the most stunning visualization could utterly fail at conveying the right message or it could speak volumes. The data and the visuals need to work together, and there’s an art to combining great analysis with great storytelling.


• Chart: Information presented in a tabular, graphical form with data displayed along two axes. Can be in the form of a graph, diagram, or map. Learn more.
• Table: A set of figures displayed in rows and columns. Learn more.
• Graph: A diagram of points, lines, segments, curves, or areas that represents certain variables in comparison to each other, usually along two axes at a right angle.
• Geospatial: A visualization that shows data in map form using different shapes and colors to show the relationship between pieces of data and specific locations. Learn more.

• Infographic: A combination of visuals and words that represent data. Usually uses charts or diagrams.
• Dashboards: A collection of visualizations and data displayed in one place to help with analyzing and presenting data. Learn more.
More specific examples
• Area Map: A form of geospatial visualization, area maps are used to show specific values set over a map of a country, state, county, or any other geographic location. Two common types of area maps are choropleths and isopleths.
• Bar Chart: Bar charts represent numerical values compared to each other. The length of the bar represents the value of each variable.
• Box-and-whisker Plots: These show a selection of ranges (the box) across a set measure (the bar).
• Bullet Graph: A bar marked against a background to show progress or performance against a goal, denoted a line on the graph.
• Gantt Chart: Typically used in project management, Gantt charts are a bar chart depiction of timelines and tasks.
• Heat Map: A type of geospatial visualization in map form which displays specific data values as different colors (this doesn’t need to be temperatures, but that is a common use).
• Highlight Table: A form of table that uses color to categorize similar data, allowing the viewer to read it more easily and intuitively.
• Histogram: A type of bar chart that split a continuous measure into different bins to help analyze the distribution.
• Pie Chart: A circular chart with triangular segments that shows data as a percentage of a whole.
• Treemap: A type of chart that shows different, related values in the form of rectangles nested together.

Leave a Reply

Your email address will not be published. Required fields are marked *