0

Hi readers,Job basket providing you the updated interview questions ,if any query related to same technology or any other ,we are there for your help ,if you required more questions or the another interview question kindly comment or answer on the same questions ,Kindly share the questions link with you Friends and other persons too Thanks

  1. What is Big Data?

Big data is defined as the voluminous amount of structured, unstructured or semi-structured data that has huge potential for mining but is so large that it cannot be processed using traditional database systems. Big data is characterized by its high velocity, volume and variety that requires cost effective and innovative methods for information processing to draw meaningful business insights. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not

  • What is the difference between Hadoop and Traditional RDBMS?

Criteria
Hadoop
RDBMS

Datatypes
Processes semi-structured and unstructured data.
Processes structured data.

Schema
Schema on Read
Schema on Write

Best Fit for Applications
Data discovery and Massive Storage/Processing of Unstructured data.
Best suited for OLTP and complex ACID transactions.

Speed
Writes are Fast
Reads are Fast

Hadoop vs RDBMS

  1. What do the four V’s of Big Data denote?

Simple explanation for the four critical features of big data:
a) Volume –Scale of data
b) Velocity –Analysis of streaming data
c) Variety – Different forms of data
d) Veracity –Uncertainty of data

  1. Differentiate between Structured and Unstructured data.

Data which can be stored in traditional database systems in the form of rows and columns, for example the online purchase transactions can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Facebook updates, Tweets on Twitter, Reviews, web logs, etc. are all examples of unstructured data

  1. On what concept the Hadoop framework works?

Hadoop Framework works on the following two core components-

1)HDFS – Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture.

2)Hadoop MapReduce-This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples. The reduce job is always performed after the map job is executed.

6) What are the main components of a Hadoop Application?

Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems.

Core components of a Hadoop application are-

1) Hadoop Common

2) HDFS

3) Hadoop MapReduce

4) YARN

Data Access Components are – Pig and Hive

Data Storage Component is – HBase

Data Integration Components are – Apache Flume, Sqoop, Chukwa

Data Management and Monitoring Components are – Ambari, Oozie and Zookeeper.

Data Serialization Components are – Thrift and Avro

Data Intelligence Components are – Apache Mahout and Drill.

  1. What is Hadoop streaming?

Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers.

  1. What are the most commonly defined input formats in Hadoop?

The most common Input Formats defined in Hadoop are:

  • Text Input Format- This is the default input format defined in Hadoop.
  • Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
  • Sequence File Input Format- This input format is used for reading files in sequence.

9 Explain the difference between NameNode, Backup Node and Checkpoint NameNode.

NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-

fsimage file- It keeps track of the latest checkpoint of the namespace.

edits file-It is a log of changes that have been made to the namespace since checkpoint.

Checkpoint Node-

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.

BackupNode:

Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

10 What is your favourite tool in the hadoop ecosystem?

The answer to this question will help the interviewer know more about the big data tools that you are well-versed with and are interested in working with. If you show affinity towards a particular tool then the probability that you will be deployed to work on that particular tool, is more.If you say that you have a good knowledge of all the popular big data tools like pig, hive, HBase, Sqoop, flume then it shows that you have knowledge about the hadoop ecosystem as a whole.