**1. Why should we hire you as a Data Analyst?**

Data Analysts can collect, run and crunch data for insight that helps the organization make good decisions. They look for relationships and can discuss their results well. Their main job role is to use the data to grab opportunities for preventive measures. Data Analyst requires critical thinking and creative skills. I have all the above qualities of a Data Analyst which will help the organization to grow.

**2. Explain the process of Data Analysis.**

Data Analysis is the process of assembling, cleansing, understanding, renovating and modeling data to collect insights and produce reports to gain business profits.

**Collect Data:**The data is collected from different sources and is stored so that it can be cleaned and prepared.**Analyze Data:**Once the data is prepared, the next step is to analyze the data. A model is run continually for improvements. Then, the mode is authenticated to check whether it fulfills the business requirements.**Create Reports:**Lastly, the model is executed and then reports created are forwarded onto the stakeholders.

**3. What do you think is the responsibility of a Data Analyst?**

Responsibility of a Data Analyst is as follows-

- Provide support to all Data Analysis and coordinate with end users and team
- Solving business related issues for clients and carrying out audit on data
- Analyze results and understand data using statistical techniques and create reports
- Find new process or areas for development opportunities
- Obtain data from data sources and preserve databases/data systems
- Filter and “clean” data, and review reports
- Safeguarding database by managing access system

**4. What skills are essential to become a Data Analyst?**

This is one of the most important Data Analyst interview question and answer. Following skills are required to become a Data Analyst–

- Knowledge on reporting packages like Business Objects, programming languages like XML, Java script, or ETL frameworks and databases like SQL, SQLite, etc.
- Strong skills with the capability to examine, form, assemble and distribute big data with accuracy
- Technical skills like database design, data models, data mining and segmentation

Robust knowledge on statistical packages for analyzing huge datasets (SAS, **Excel**, SPSS, etc.)

**5. What do you understand by Data Cleansing?**

Data cleaning is also called as data cleansing, it finds and removes errors and irregularities from data in order to improve the quality of data.

**6. State the difference between Data Mining and Data Profiling.**

**Data Mining: **Data Mining means analyzing of data with respect to finding relations that have not been revealed earlier. It mainly emphases on the recognition of unusual records, dependencies and cluster analysis.

**Data Profiling: **Data Profiling is the process of analyzing individual characteristics of data. It mainly focuses on sharing valued information on data features such as data type, frequency etc.

**7. According to you what are the best practices for Data Cleaning?**

Some of the best practices for data cleaning are as follows-

- Sort data by various characteristics
- For large datasets cleanse data stepwise and improve the data quality
- For large datasets, split them into small data. Working with less data will improve your repetition speed
- Arrange data by projected frequency and detect the most common problems
- Analyze the summary for every column
- Keep track of every date cleaning process, to make changes or remove operations if required

**8. Explain the important steps in the Data Validation process.**

Data Validation is the process of validating data. It consists of two processes involved in it. They are Data Screening and Data Verification.

**Data Screening:**Various types of algorithms are used in this step to screen the entire data to find out any wrong values.**Data Verification:**All the distrusted values are evaluated on different use-cases, and then a final decision is taken on whether the value has to be included in the data or not.

**9. List the important steps in an analytics project.**

Different steps in an analytics project include-

- Problem definition
- Data exploration
- Data preparation
- Modelling
- Validation of data
- Implementation and tracking

**10. Which tools can be used for Data-Analysis?**

Tools that are used for Data Analysis are as follows-

- Tableau
- RapidMiner
- Open Refine
- KNIME
- Google Search Operators
- Solver
- NodeXL
- io
- Wolfram Alpha’s
- Google Fusion tables

**11. What do you mean by Logistic Regression?**

Logistic regression is a statistical technique for investigative a dataset in which there are one or more independent variables that defines the result.

**12. State the difference between Data Mining and Data Profiling.**

**Data profiling:**It focuses on the instance analysis of individual features. It gives information on various characteristics like value range, discrete value and their frequency, occurrence of null values, data type, length, etc.**Data mining:**It targets on cluster analysis, discovery of unusual records, dependencies, sequence detection, relation holding between numerous attributes, etc.

**13. What are the criteria to identify whether a developed data model is good or not?**

Data Analyst should have a good knowledge to identify the developed data model as this is the tricky Data Analyst interview question and answer frequently asked.

Below are a few criteria which needs to be considered to decide whether a developed data model is good or not-

- A model designed for the dataset should have expectable performance.
- A model is said to be a good model if it can smoothly adjust to changes according to business requirements
- If the data gets transformed, the model should be able to measure according to the data
- The model should also be able to easily accepted by the clients for actionable and profitable results

**14. What are common problems faced by Data Analyst?**

Below given are some of the common problems faced by Data Analyst-

- Common misspelling
- Duplicate entries
- Missing values
- Illegal values
- Varying value representations
- Identifying overlapping data

**15. Which framework was developed by Apache for processing large data set in a distributed computing environment?**

Hadoop and MapReduce are the framework developed by Apache for processing large data set in a distributed computing environment.

**16. List the missing patterns that are generally observed?**

Following are the missing patterns that are generally observed-

- Missing totally at random
- Missing at random
- Missing that rest on the missing value itself
- Missing that depends on unnoticed input variable

**17. List a few problems that Data Analyst usually find out while performing the analysis?**

The following are the problems that are usually encountered while performing Data Analysis-

- Duplicate entries and spelling mistakes, harm the data quality
- Refering data from a poor source may create a problem as it would require a lot of time to clean the data
- Extracting data from various sources will differ in representation. When combined, it may happen that the difference in representation could result in a delay
- Finally, if there is an inadequate data, then that could be a problem to perform analysis of data

**18. What do mean by KNN imputation method?**

In **KNN** imputation method, the missing attribute values are credited by using the attributes value that are quite similar to the attribute whose values are missing. By using a distance function, the comparison of two attributes is examined.

**19. State the Data Validation methods used by Data Analyst.**

Data Validation methods used by Data Analyst are-

- Data screening
- Data verification

**20. What can you do with suspected or missing data?**

- Prepare a validation report that reveals information of all alleged data. It should give information like validation measures that it failed and the date and time of manifestation
- Experience personnel should examine the doubtful data to regulate their acceptability
- Invalid data should be allocated and changed with a validation code
- To work on missing data, execute the finest analysis strategy like deletion method, single imputation methods, model-based methods, etc.

**21. Define a Waterfall Chart.**

The waterfall chart displays both positive and negative values which holds the final result value. For instance, if you are analyzing a company’s net income, then you can have the statistics of all the cost values. With such kind of statistics, you can visualize how the value from revenue to the net income is obtained when all the costs are abstracted.

**22. How can you handle the multi-source problems?**

Multi-source problems can be handled by-

- Reformation of schemas to complete a schema integration
- Find similar records and combine them into single record containing all related attributes without redundancy

**23. What do you understand by the term Outlier?**

The outlier is a frequently used terms by Data Analysts. It refers for a value that appears far away and separates from an overall pattern in a sample. Outliers are of two types-

**Univariate****Multivariate**

**24. Define Hierarchical Clustering Algorithm.**

Hierarchical clustering algorithm is the algorithm that can combine and divide present groups, producing a hierarchical structure that displays the order in which groups are divided or combined.

**25. Explain K-mean Algorithm.**

K mean is a well-known segregating method. Objects are categorized as belonging to one of K groups. K-mean algorithm has spherical clusters i.e. the data points in a cluster are positioned around that cluster. Also, the variance/spread of the clusters is similar in K-mean algorithm.

**26. Define a Pivot Table.**

A Pivot Table is a most useful feature in Microsoft Excel which summarizes huge datasets within a fraction of second. It is simple to use as it requires dragging and dropping rows/columns headers to generate reports.

**27. List the different sections of a Pivot Table.**

The different sections of a Pivot table are as follows:

**Values Area****Rows Area****Column Area****Filter Area**

**28. Can you create a Pivot Table from multiple tables?**

Yes, you can make one Pivot Table from numerous other tables when there is a connection between these tables.

**29. List the key skills required for Data Analyst.**

Key skills required for a Data Analyst are-

**Database knowledge**

- Database management
- Data blending
- Querying
- Data manipulation

**Predictive Analytics**

- Basic descriptive statistics
- Predictive modeling
- Advanced analytics

**Big Data Knowledge**

- Big data analytics
- Unstructured data analysis
- Machine learning

**Presentation skill**

- Data visualization
- Insight presentation
- Report design

**30. What details are required from client before creating a dashboard?**

Clients should give the below details before creating a dashboard in Excel-

- Purpose of the Dashboards
- Different data sources
- Usage of the Excel Dashboard
- The version of Office the client uses

**31. Define Collaborative Filtering.**

Collaborative filtering is a simple process to make a recommendation system based on user behavioral data. The most important mechanisms of collaborative filtering are **users- items- interest**.

**32. List the tools used in Big Data.**

Data Analyst should be smart enough and should be technically sound to know all the tools used in Big data. Since it can be a repeatedly asked for Data Analyst interview questions and answers. Following are the tools used in Big Data-

- Hadoop
- Hive
- Pig
- Flume
- Mahout
- Sqoop

**33. Define KPI, design of experiments and 80/20 rule.**

**KPI**: It is an abbreviation for **Key Performance Indicator**. It is a metric that includes any combination of spreadsheets, reports or charts regarding business process

**Design of experiments**: It is the preliminary process used to split your data, sample and set up of a data for statistical analysis

**80/20 rules****:** It states that 80 percent of your income derives from 20 percent of your clients

**34. What is Normal Distribution?**

Normal Distribution is one of the most important and commonly used distributions in statistics. Usually known as the Bell Curve or Gaussian curve, Normal Distributions, measure how much values can vary in their means and in their standard deviations.

**35. Explain A/B Testing?**

A/B testing is the statistical theory testing for a randomized research with two variables A and B. It is also called as the split testing. It is an analytical method that evaluates population parameters based on sample statistics. This test associates two web pages by displaying two variants A and B, to a similar number of visitors, and the variant that gives improved conversion rate wins. The goal of A/B Testing is to find out if there are any changes to the web page.

**36. What is the statistical power of sensitivity?**

The statistical power of sensitivity is used to authenticate the precision of a classifier. This classifier can be any of the Logistic Regression, Support Vector Machine, Random Forest etc.

Sensitivity is the ratio of Forecasted True Events to Total Events. True Events are the events that are actually true and the model also forecasts them as true.

**37. Explain the term Alternative Hypothesis.**

To explain the term Alternative Hypothesis, you should know what the Null Hypothesis is. Null Hypothesis is a statistical phenomenon that is used to test for probable denial under the statement that result of chance would be true.

The Alternative Hypothesis is again a statistical phenomenon which is conflicting to the Null Hypothesis. Generally, it is measured that the observations are outcome of an effect with some casual of variation.

**38. State the difference between univariate, bivariate and multivariate analysis.**

The difference between univariate, bivariate and multivariate analysis is as follows:

**Univariate:**An expressive statistical method that can be distinguished depending on the count of variables involved at a given occurrence of time**Bivariate:**It is used to find the alteration between two variables at a time**Multivariate:**It is the study of more than two variables. This analysis is used to recognize the effect of variables on the comebacks

**39. Define Eigenvectors and Eigenvalues.**

**Eigenvectors:** Eigenvectors are fundamentally used to recognize linear transformations. These are measured for a correlation or a covariance matrix. Eigenvectors are the instructions along which a specific linear transformation acts either by flipping, compressing or stretching.

**Eigenvalue: **Eigenvalues can be referred to as the strength of the transformation or the aspect by which the compression occurs in the direction of eigenvectors.

**40. What is Map Reduce?**

Map-reduce is a framework to develop large data sets, dividing them into subsets, processing each subset on a different server and then combining results gained on each.

**41. Explain Clustering** **and list the properties for clustering algorithms.**

Clustering is a classification technique that is applied to data. Clustering algorithm splits a data set into natural groups or clusters.

Below are the properties for clustering algorithm-

- Hierarchical or flat
- Iterative
- Hard and soft
- Disjunctive

**42. List the statistical methods that are useful for Data-Analyst.**

Following are the statistical methods that are useful for Data Analyst-

- Bayesian method
- Markov process
- Spatial and cluster processes
- Rank statistics, percentile, outlier’s detection
- Imputation techniques, etc.
- Simplex algorithm
- Mathematical optimization

**43. What do you understand by Series Analysis?**

**Time series analysis** can be carried out in two fields, **frequency and time**. In Time series analysis the productivity of a specific process can be estimated by evaluating the preceding data by the help of several methods like exponential smoothening, log-linear regression method, etc.

**44. What do you mean by Correlogram Analysis?**

A correlogram analysis is the method of spatial analysis in geography. It contains of a series of predictable autocorrelation coefficients measured for a different spatial relationship. It can be used to build a correlogram for distance-based data, when the raw data is stated as distance rather than values at individual points.

**45. Explain different types of Hypothesis Testing.**

Following are the different types of hypothesis testing:

**T-test:**T-test is used when the standard deviation is indefinite and the sample size is comparatively small**Chi-Square Test for Independence:**These tests are used to find out the significance of the association between categorical variables in the population sample**Analysis of Variance (ANOVA):**This kind of hypothesis testing is used to analyze differences between the means in various groups. This test is often used similarly to a T-test but, is used for more than two groups**Welch’s T-test:**This test is used to find out the test for equality of means between two population samples

**46. State the difference between variance and covariance.**

Variance and Covariance are two mathematical terms which are used frequently in statistics. Variance basically refers to how apart numbers are in relation to the mean. Covariance, on the other hand, refers to how two random variables will change together. This is basically used to calculate the correlation between variables.

In case you have attended any Data Analytics interview in the recent past, do paste those interview questions in the comments section and we’ll answer them ASAP. You can also comment below if you have any questions in your mind, which you might have faced in your Data Analytics interview.

**47. Explain interleaving in SAS.**

Interleaving in SAS means combining individual sorted SAS data sets into one sorted data set. You can interleave data sets using a SET statement along with a BY statement.

**48. Define Hash Table.**

In calculating, a hash table is a map of keys to standards. It is a data structure used to implement an associative array. It uses a hash function to calculate an index into an array of slots, from which chosen value can be fetched.

**49. How can you avoid hash table collisions?**

A hash table collision occurs when two dissimilar keys hash to the same value. Two data cannot be preserved in the same slot in array.

Below techniques can avoid hash table collision-

**Separate Chaining**:

It uses the data structure to preserve multiple items that hash to the similar slot

**Open addressing**:

It finds for another slot using a second function and preserve item in first empty slot that is found

**50. Which imputation method is preferred more?**

Although single imputation is widely used, it does not reflect the uncertainty created by missing data at random. So, multiple imputation is more favorable then single imputation in case of data missing at random.

**51. Explain the functioning of PROC SQL. **

PROC SQL is nothing but a simultaneous process for all the observations. The following steps occur when a **PROC SQL** gets executed:

- SAS scans each and every statement in the SQL procedure and checks the syntax errors.
- The SQL optimizer detects the query into the statement. So, the SQL optimizer basically decides how the SQL query should be executed in order to minimize the runtime.
- If there are any tables in the FROM statement, then they are loaded into the data engine where they can then be accessed in the memory.
- Codes and Calculations are executed.
- The Final Table is created in the memory.
- The Final Table is sent to the output table described in the SQL statement.

**52. What is the default port for SQL?**

The default TCP port assigned by the official Internet Number Authority (IANA) for SQL server is 1433.

**Hierarchical DBMS:**As the name suggests, this type of DBMS has a style of predecessor-successor type of relationship. So, it has a structure similar to that of a tree, wherein the nodes represent records and the branches of the tree represent fields.**Relational DBMS (RDBMS):**This type of DBMS, uses a structure that allows the users to identify and access data*in relation*to another piece of data in the database.**Network DBMS:**This type of DBMS supports many to many relations wherein multiple member records can be linked.**Object-oriented DBMS:**This type of DBMS uses small individual software called objects. Each object contains a piece of data and the instructions for the actions to be done with the data.

**53. What is Normalization?**

**Normalization** is the process of organizing data to avoid duplication and redundancy. There are many successive levels of normalization. These are called **normal forms**. Each consecutive normal form depends on the previous one. The first three normal forms are usually adequate.

**First Normal Form (1NF)**– Here, every non-key column value is dependent on the whole primary key**Second Normal Form (2NF)**– It is reliant uniquely on the primary key and no other non-key column value**Third Normal Form (3NF)****Boyce- Codd Normal Form (BCNF)***– BCNF is the advanced version of 3NF. A table is said to be in BCNF if it is 3NF and for every X ->Y, relation X should be the super key of the table*

**54. State the different types of Joins.**

The various types of joins used to retrieve data between tables are** Inner Join**, **Left Join**, **Right Join** and **Full Outer Join**. Refer to the image on the right side.

**Inner join:** Inner Join in MySQL is the most common type of join. It is used to get back all the rows from various tables where the join condition is fulfilled.

**Left Join:**Left Join in MySQL is used to get back all the rows from the left table, but only the identical rows from the right table where the join condition is satisfied.**Right Join:**Right Join in MySQL is used to return all the rows from the right table, but only the identical rows from the left table where the join condition is satisfied.**Full Join:**Full join gets back all the records when there is a similarity in any of the tables.

**55. Explain NVL and NVL2 functions in SQL.**

NVL(exp1, exp2) and NVL2(exp1, exp2, exp3) are functions which check whether the value of exp1 is null or not.

If we use NVL(exp1,exp2) function, then if exp1 is not null, then the value of exp1 will be returned; else the value of exp2 will be returned. But, exp2 must be of the same data type of exp1.

Similarly, if we use NVL2(exp1, exp2, exp3) function, then if exp1 is not null, exp2 will be returned, else the value of exp3 will be returned.

**56. How can you create stories in Tableau?**

Stories are used to narrate a sequence of events or make a business use-case. The Tableau Dashboard provides various options to create a story. All the story points can be based on a dissimilar view or dashboard, or the entire story can be based on the same conception.

**To create a story in Tableau you can follow the below steps:**

- Click the
**New Story tab**. - In the
*lower-left corner*of the screen, choose a size for your story. Choose from one of the predefined sizes, or set a custom size, in pixels. - To
**edit**, double-click on the title. You can also change your title’s font, color, and alignment. Click Apply to view your changes. - To start building your story,
*drag a sheet*from the**Story tab**on the left and drop it into the**center of the view.** - Click
**Add**a caption to**summarize**the story point. - To highlight the main idea, you can modify a filter or sort on a field in the view, then
**save**by clicking**Update**above the navigator box.

**57. State the major difference between joining and blending in Tableau.**

The **Joining** term is used when you are combining data from the same source, for example, worksheet in an Excel file or tables in an Oracle database. While **blending** needs two entirely defined data sources in your report.

**58. Define aggregation and disaggregation of data.**

**Aggregation of data:** Aggregation of data refers to the process of viewing numeric values or the measures at a higher and more summarized level of data. When you place a measure on a shelf, Tableau will automatically aggregate your data. You can determine whether the aggregation has been applied to a field or not, by simply looking at the function. This is because the function always appears in front of the field’s name when it is placed on a shelf.

**Disaggregation of data:** Disaggregation of data allows you to view every row of the data source which can be useful while analyzing measures.

**59. What is n-gram?**

An n-gram is an adjoining sequence of n items from a given order of text or speech. It is a kind of probabilistic language model for forecasting the next item in such an order in the form of a (n-1).

**60. List the criteria for a good data model.**

Criteria for a good data model includes

- It can be easily consumed
- Large data varying in a good model should be scalable
- It should provide predictable performance
- A good model can adjust to changes in requirements