Sqoop Interview Questions and Answers For Freshers Part-2
–exclude-tables table1,table2 ..
This will import all the tables except the ones mentioned in the exclude-tables clause.
12.When the source data keeps getting updated frequently, what is the approach to keep it in sync with the data in HDFS imported by sqoop?
sqoop can have 2 approaches.
a − To use the –incremental parameter with append option where value of some columns are checked and only in case of modified values the row is imported as a new row.
b − To use the –incremental parameter with lastmodified option where a date column in the source is checked for records which have been updated after the last import.
13.What is the usefulness of the options file in sqoop.
The options file is used in sqoop to specify the command line values in a file and use it in the sqoop commands.
For example the –connect parameter’s value and –user name value scan be stored in a file and used again and again with different sqoop commands.
14.Is it possible to add a parameter while running a saved job?
Yes, we can add an argument to a saved job at runtime by using the –exec option
sqoop job –exec jobname — — newparameter
15.How do you fetch data which is the result of join between two tables?
By using the –query parameter in place of –table parameter we can specify a sql query. The result of the query will be imported
16.How can we slice the data to be imported to multiple parallel tasks?
Using the –split-by parameter we specify the column name based on which sqoop will divide the data to be imported into multiple chunks to be run in parallel.
17.How can you choose a name for the mapreduce job which is created on submitting a free-form query import?
By using the –mapreduce-job-name parameter. Below is a example of the command.
sqoop import \
–connect jdbc:mysql://mysql.example.com/sqoop \
–username sqoop \
–password sqoop \
–query ‘SELECT normcities.id, \
FROM normcities \
JOIN countries USING(country_id) \
WHERE $CONDITIONS’ \
–split-by id \
–target-dir cities \
18.Before starting the data transfer using mapreduce job, sqoop takes a long time to retrieve the minimum and maximum values of columns mentioned in –split-by parameter. How can we make it efficient?
We can use the –boundary –query parameter in which we specify the min and max value for the column based on which the split can happen into multiple mapreduce tasks. This makes it faster as the query inside the –boundary-query parameter is executed first and the job is ready with the information on how many mapreduce tasks to create before executing the main query.
19.What is the difference between the parameters sqoop.export.records.per.statement and sqoop.export.statements.per.transaction
The parameter “sqoop.export.records.per.statement” specifies the number of records that will be used in each insert statement.
But the parameter “sqoop.export.statements.per.transaction” specifies how many insert statements can be processed parallel during a transaction.
20.How will you implement all-or-nothing load using sqoop?
Using the staging-table option we first load the data into a staging table and then load it to the final target table only if the staging load is successful.