Create Table like External – Hive
February 2, 2017 1 Comment
There is no end to the Technical Learning. I have been working on Oracle Database Technologies for almost 15 years and thought of learning something new. This thirst got me into Big Data and Hadoop. I started a kind of a self-learning and believe me, I found it very interesting. This blog is about one of my encounter during a practice session on Hive. Usually, I blog to help my readers to know about my experiences. This is again on the same line, but will need resolution or validation from experts across the globe.
My understanding on External and Internal Table is as under:
- Internal Tables store the data populated into a directory specified under metastore.warehouse.dir or if the location is explicitly specified during table creation. Once the table is dropped, the underlying data is removed as well. This gives you full control over the data. I can relate this to Oracle Tables with a difference that the data is stored inside the database and it is a logical structure.
- External Tables do not store data but points to the data. Hive doesn’t have the control over the data as it is shared by other tools like Pig etc. Dropping of External table does not remove the data from the storage. This is similar to the External Tables of Oracle, where we create the structure of the table similar to the format of the txt or csv file.
I created an Internal Table and could observe the behavior explained above. Dropping a table removes the data as well. External table, on the other hand, does not remove it and therefore, my understanding looks fine here. Further, I was also working on a CREATE TABLE syntax and the understanding here was :
- CREATE TABLE will create an Internal Table as this is the default.
- For External Table, we need to specify CREATE EXTERNAL TABLE command
- However, CREATE TABLE table_name like external_table_name will create an External table as I am creating a Table from an External Table.
The first 2 bullet points are fine. The problem was with the third bullet point. The outcome is as under :
## let me use the database that I created for my own practice use vivek; drop table emp; create table emp ( empno int comment 'This is Employee Number', ename string comment 'Employee Name', country string, city string, zipcode int) row format delimited fields terminated by ','; load data local inpath '/home/cloudera/emp.txt' into table emp; hive (vivek)> select * from emp limit 5; OK emp.empno emp.ename emp.country emp.city emp.zipcode 3980 Lancaster USA California 118718 3981 Fort Collins USA Colorado 118652 3982 Coral Springs USA Florida 117549 3983 Stamford USA Connecticut 117083 3984 Thousand Oaks USA California 117005 ## Describing the table. See the Bold and underlined text hive (vivek)> describe formatted emp; OK col_name data_type comment # col_name data_type comment empno int This is Employee Number ename string Employee Name country string city string zipcode int # Detailed Table Information Database: vivek Owner: cloudera CreateTime: Wed Feb 01 22:22:40 PST 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/vivek.db/emp Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE true numFiles 1 totalSize 3501 transient_lastDdlTime 1486016570 Time taken: 0.255 seconds, Fetched: 34 row(s) ## Now, lets create an External Table. The data for which is already copied onto Hadoop create external table emp_ext ( empno int comment 'This is Employee Number', ename string comment 'Employee Name', country string, city string, zipcode int) row format delimited fields terminated by ',' location '/user/cloudera/emp'; hive (vivek)> select * from emp_ext limit 5; OK emp_ext.empno emp_ext.ename emp_ext.country emp_ext.city emp_ext.zipcode 3980 Lancaster USA California 118718 3981 Fort Collins USA Colorado 118652 3982 Coral Springs USA Florida 117549 3983 Stamford USA Connecticut 117083 3984 Thousand Oaks USA California 117005 Time taken: 0.116 seconds, Fetched: 5 row(s) ## Describing the table. See the Bold and underlined text hive (vivek)> describe formatted emp_ext; OK col_name data_type comment # col_name data_type comment empno int This is Employee Number ename string Employee Name country string city string zipcode int # Detailed Table Information Database: vivek Owner: cloudera CreateTime: Wed Feb 01 22:31:25 PST 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://quickstart.cloudera:8020/user/cloudera/emp Table Type: EXTERNAL_TABLE Table Parameters: COLUMN_STATS_ACCURATE false EXTERNAL TRUE numFiles 0 numRows -1 rawDataSize -1 totalSize 0 transient_lastDdlTime 1486017085 Time taken: 0.216 seconds, Fetched: 37 row(s)
Both the tables contain same data with the difference that EMP is Internal Table and EMP_EXT is an External Table. Internal Table created a directory into HDFS as can be seen from the DESCRIBE FORMATTED OUTPUT for EMP. The directory is /user/hive/warehouse/vivek.db/emp. Dropping EMP will remove this data as well, which is tried and tested. Next, I will create another table from EMP_EXT and will not specify INTERNAL or EXTERNAL keyword.
hive (vivek)> create table emp_ext1 like emp_ext location '/user/cloudera/emp'; OK Time taken: 0.21 seconds hive (vivek)> select * from emp_ext1 limit 5; OK emp_ext1.empno emp_ext1.ename emp_ext1.country emp_ext1.city emp_ext1.zipcode 3980 Lancaster USA California 118718 3981 Fort Collins USA Colorado 118652 3982 Coral Springs USA Florida 117549 3983 Stamford USA Connecticut 117083 3984 Thousand Oaks USA California 117005 Time taken: 0.107 seconds, Fetched: 5 row(s) ## Next, Describe the new table. See the bold and underlined text hive (vivek)> describe formatted emp_ext1; OK col_name data_type comment # col_name data_type comment empno int This is Employee Number ename string Employee Name country string city string zipcode int # Detailed Table Information Database: vivek Owner: cloudera CreateTime: Wed Feb 01 22:41:23 PST 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://quickstart.cloudera:8020/user/cloudera/emp Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE false numFiles 0 numRows -1 rawDataSize -1 totalSize 0 transient_lastDdlTime 1486017683 Time taken: 0.275 seconds, Fetched: 36 row(s)
The table metadata says that its an Internal Table also referred as MANAGED Tables. My understanding that creating a table from an external table will implicitly create an External table looks to be wrong here. Now, in this case, the location is /user/cloudera/emp which is also shared by EMP_EXT table. I dropped the EMP_EXT1 table to check the impact of this on EMP_EXT.
hive (vivek)> drop table emp_ext1; OK Time taken: 0.482 seconds hive (vivek)> show tables; OK tab_name emp emp_ext Time taken: 0.097 seconds, Fetched: 2 row(s) hive (vivek)> select * from emp_ext limit 5; OK emp_ext.empno emp_ext.ename emp_ext.country emp_ext.city emp_ext.zipcode Time taken: 0.122 seconds
The drop command removed the data from HDFS and querying EMP_EXT fetched no results. This means, even if you create a table from an External table, without specifying EXTERNAL keyword for the CREATE TABLE command, it will create an INTERNAL TABLE. I reconstructed the data again i.e. dropped EMP_EXT and EMP_EXT1.
[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/emp Found 1 items -rw-r--r-- 1 cloudera cloudera 3501 2017-02-01 22:49 /user/cloudera/emp/emp.txt use vivek; hive (vivek)> create external table emp_ext ( > empno int comment 'This is Employee Number', > ename string comment 'Employee Name', > country string, > city string, > zipcode int) > row format delimited fields terminated by ',' > location '/user/cloudera/emp'; OK Time taken: 0.601 seconds hive (vivek)> select * from emp_ext limit 5; OK emp_ext.empno emp_ext.ename emp_ext.country emp_ext.city emp_ext.zipcode 3980 Lancaster USA California 118718 3981 Fort Collins USA Colorado 118652 3982 Coral Springs USA Florida 117549 3983 Stamford USA Connecticut 117083 3984 Thousand Oaks USA California 117005 Time taken: 0.698 seconds, Fetched: 5 row(s) hive (vivek)> describe formatted emp_ext; OK col_name data_type comment # col_name data_type comment empno int This is Employee Number ename string Employee Name country string city string zipcode int # Detailed Table Information Database: vivek Owner: cloudera CreateTime: Wed Feb 01 22:51:46 PST 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://quickstart.cloudera:8020/user/cloudera/emp Table Type: EXTERNAL_TABLE Table Parameters: COLUMN_STATS_ACCURATE false EXTERNAL TRUE numFiles 0 numRows -1 rawDataSize -1 totalSize 0 transient_lastDdlTime 1486018306 Time taken: 0.287 seconds, Fetched: 37 row(s) ## Will Create EMP_EXT1 hive (vivek)> create table emp_ext1 like emp_ext location '/user/cloudera/emp'; OK Time taken: 0.155 seconds hive (vivek)> select * from emp_ext1 limit 5; OK emp_ext1.empno emp_ext1.ename emp_ext1.country emp_ext1.city emp_ext1.zipcode 3980 Lancaster USA California 118718 3981 Fort Collins USA Colorado 118652 3982 Coral Springs USA Florida 117549 3983 Stamford USA Connecticut 117083 3984 Thousand Oaks USA California 117005 Time taken: 0.131 seconds, Fetched: 5 row(s) hive (vivek)> describe formatted emp_ext1; OK col_name data_type comment # col_name data_type comment empno int This is Employee Number ename string Employee Name country string city string zipcode int # Detailed Table Information Database: vivek Owner: cloudera CreateTime: Wed Feb 01 22:53:31 PST 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://quickstart.cloudera:8020/user/cloudera/emp Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE false numFiles 0 numRows -1 rawDataSize -1 totalSize 0 transient_lastDdlTime 1486018411 Time taken: 0.258 seconds, Fetched: 36 row(s)
The output is same as what it was during our previous execution. I am investigating it further whether my understanding “CREATE TABLE table_name like external_table_name will create an External table as I am creating a Table from an External Table” is wrong or whether it was valid for the earlier versions of Hive. Just for curiosity, I made following changes and Wow..dropping the table didn’t remove the underlying data from HDFS.
hive (vivek)> alter table emp_ext1 set TBLPROPERTIES('table type'='EXTERNAL_TABLE'); OK Time taken: 0.375 seconds hive (vivek)> alter table emp_ext1 set TBLPROPERTIES('EXTERNAL'='TRUE'); OK Time taken: 0.268 seconds hive (vivek)> describe formatted emp_ext1; OK col_name data_type comment # col_name data_type comment empno int This is Employee Number ename string Employee Name country string city string zipcode int # Detailed Table Information Database: vivek Owner: cloudera CreateTime: Wed Feb 01 22:53:31 PST 2017 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://quickstart.cloudera:8020/user/cloudera/emp Table Type: EXTERNAL_TABLE Table Parameters: COLUMN_STATS_ACCURATE false EXTERNAL TRUE last_modified_by cloudera last_modified_time 1486019158 numFiles 0 numRows -1 rawDataSize -1 table type EXTERNAL_TABLE totalSize 0 transient_lastDdlTime 1486019158 Time taken: 0.206 seconds, Fetched: 40 row(s) hive (vivek)> select * from emp_ext1 limit 5; OK emp_ext1.empno emp_ext1.ename emp_ext1.country emp_ext1.city emp_ext1.zipcode 3980 Lancaster USA California 118718 3981 Fort Collins USA Colorado 118652 3982 Coral Springs USA Florida 117549 3983 Stamford USA Connecticut 117083 3984 Thousand Oaks USA California 117005 Time taken: 0.157 seconds, Fetched: 5 row(s) hive (vivek)> drop table emp_ext1; OK Time taken: 0.207 seconds hive (vivek)> select * from emp_ext limit 5; OK emp_ext.empno emp_ext.ename emp_ext.country emp_ext.city emp_ext.zipcode 3980 Lancaster USA California 118718 3981 Fort Collins USA Colorado 118652 3982 Coral Springs USA Florida 117549 3983 Stamford USA Connecticut 117083 3984 Thousand Oaks USA California 117005 Time taken: 0.159 seconds, Fetched: 5 row(s)
The ALTER TABLE command changed the properties of the Table from Managed to External and droping this table didn’t drop tbe underlying data from HDFS, which was evident from the fact that I could query from EMP_EXT.
Looking for the answer to my query – CREATE TABLE table_name LIKE external_table – Will it create an Internal Table or an External Table ?