Create Table like External – Hive

There is no end to the Technical Learning. I have been working on Oracle Database Technologies for almost 15 years and thought of learning something new. This thirst got me into Big Data and Hadoop. I started a kind of a self-learning and believe me, I found it very interesting. This blog is about one of my encounter during a practice session on Hive. Usually, I blog to help my readers to know about my experiences. This is again on the same line, but will need resolution or validation from experts across the globe.

My understanding on External and Internal Table is as under:

  • Internal Tables store the data populated into a directory specified under metastore.warehouse.dir or if the location is explicitly specified during table creation. Once the table is dropped, the underlying data is removed as well. This gives you full control over the data. I can relate this to Oracle Tables with a difference that the data is stored inside the database and it is a logical structure.
  • External Tables do not store data but points to the data. Hive doesn’t have the control over the data as it is shared by other tools like Pig etc. Dropping of External table does not ¬†remove the data from the storage. This is similar to the External Tables of Oracle, where we create the structure of the table similar to the format of the txt or csv file.

I created an Internal Table and could observe the behavior explained above. Dropping a table removes the data as well. External table, on the other hand, does not remove it and therefore, my understanding looks fine here. Further, I was also working on a CREATE TABLE syntax and the understanding here was :

  • CREATE TABLE will create an Internal Table as this is the default.
  • For External Table, we need to specify CREATE EXTERNAL TABLE command
  • However, CREATE TABLE table_name like external_table_name will create an External table as I am creating a Table from an External Table.

 

The first 2 bullet points are fine. The problem was with the third bullet point. The outcome is as under :

## let me use the database that I created for my own practice
use vivek;

drop table emp;

create table emp (
empno int comment 'This is Employee Number',
ename string comment 'Employee Name',
country string,
city string,
zipcode int)
row format delimited fields terminated by ',';

load data local inpath '/home/cloudera/emp.txt' into table emp;

hive (vivek)> select * from emp limit 5;
OK
emp.empno       emp.ename       emp.country     emp.city        emp.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005

## Describing the table. See the Bold and underlined text

hive (vivek)> describe formatted emp;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:22:40 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/hive/warehouse/vivek.db/emp
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   true
        numFiles                1
        totalSize               3501
        transient_lastDdlTime   1486016570

Time taken: 0.255 seconds, Fetched: 34 row(s)

## Now, lets create an External Table. The data for which is already copied onto Hadoop

create external table emp_ext (
empno int comment 'This is Employee Number',
ename string comment 'Employee Name',
country string,
city string,
zipcode int)
row format delimited fields terminated by ','
location '/user/cloudera/emp';

hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.116 seconds, Fetched: 5 row(s)

## Describing the table. See the Bold and underlined text

hive (vivek)> describe formatted emp_ext;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:31:25 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486017085

Time taken: 0.216 seconds, Fetched: 37 row(s)

Both the tables contain same data with the difference that EMP is Internal Table and EMP_EXT is an External Table. Internal Table created a directory into HDFS as can be seen from the DESCRIBE FORMATTED OUTPUT for EMP. The directory is /user/hive/warehouse/vivek.db/emp. Dropping EMP will remove this data as well, which is tried and tested. Next, I will create another table from EMP_EXT and will not specify INTERNAL or EXTERNAL keyword.


hive (vivek)> create table emp_ext1 like emp_ext location '/user/cloudera/emp';
OK
Time taken: 0.21 seconds
hive (vivek)> select * from emp_ext1 limit 5;
OK
emp_ext1.empno  emp_ext1.ename  emp_ext1.country        emp_ext1.city   emp_ext1.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.107 seconds, Fetched: 5 row(s)

## Next, Describe the new table. See the bold and underlined text

hive (vivek)> describe formatted emp_ext1;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:41:23 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486017683

Time taken: 0.275 seconds, Fetched: 36 row(s)

The table metadata says that its an Internal Table also referred as MANAGED Tables. My understanding that creating a table from an external table will implicitly create an External table looks to be wrong here. Now, in this case, the location is /user/cloudera/emp which is also shared by EMP_EXT table. I dropped the EMP_EXT1 table to check the impact of this on EMP_EXT.

hive (vivek)> drop table emp_ext1;
OK
Time taken: 0.482 seconds
hive (vivek)> show tables;
OK
tab_name
emp
emp_ext
Time taken: 0.097 seconds, Fetched: 2 row(s)
hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
Time taken: 0.122 seconds

The drop command removed the data from HDFS and querying EMP_EXT fetched no results. This means, even if you create a table from an External table, without specifying EXTERNAL keyword for the CREATE TABLE command, it will create an INTERNAL TABLE. I reconstructed the data again i.e. dropped EMP_EXT and EMP_EXT1.

[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/emp
Found 1 items
-rw-r--r--   1 cloudera cloudera       3501 2017-02-01 22:49 /user/cloudera/emp/emp.txt

use vivek;

hive (vivek)> create external table emp_ext (
            > empno int comment 'This is Employee Number',
            > ename string comment 'Employee Name',
            > country string,
            > city string,
            > zipcode int)
            > row format delimited fields terminated by ','
            > location '/user/cloudera/emp';
OK
Time taken: 0.601 seconds

hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.698 seconds, Fetched: 5 row(s)

hive (vivek)> describe formatted emp_ext;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:51:46 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486018306

Time taken: 0.287 seconds, Fetched: 37 row(s)

## Will Create EMP_EXT1

hive (vivek)> create table emp_ext1 like emp_ext location '/user/cloudera/emp';
OK
Time taken: 0.155 seconds
hive (vivek)> select * from emp_ext1 limit 5;
OK
emp_ext1.empno  emp_ext1.ename  emp_ext1.country        emp_ext1.city   emp_ext1.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.131 seconds, Fetched: 5 row(s)

hive (vivek)> describe formatted emp_ext1;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:53:31 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486018411

Time taken: 0.258 seconds, Fetched: 36 row(s)

The output is same as what it was during our previous execution. I am investigating it further whether my understanding “CREATE TABLE table_name like external_table_name will create an External table as I am creating a Table from an External Table” is wrong or whether it was valid for the earlier versions of Hive. Just for curiosity, I made following changes and Wow..dropping the table didn’t remove the underlying data from HDFS.

hive (vivek)> alter table emp_ext1 set TBLPROPERTIES('table type'='EXTERNAL_TABLE');
OK
Time taken: 0.375 seconds
hive (vivek)> alter table emp_ext1 set TBLPROPERTIES('EXTERNAL'='TRUE');
OK
Time taken: 0.268 seconds
hive (vivek)> describe formatted emp_ext1;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:53:31 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        last_modified_by        cloudera
        last_modified_time      1486019158
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        table type              EXTERNAL_TABLE
        totalSize               0
        transient_lastDdlTime   1486019158

Time taken: 0.206 seconds, Fetched: 40 row(s)
 
hive (vivek)> select * from emp_ext1 limit 5;
OK
emp_ext1.empno  emp_ext1.ename  emp_ext1.country        emp_ext1.city   emp_ext1.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.157 seconds, Fetched: 5 row(s)
hive (vivek)> drop table emp_ext1;
OK
Time taken: 0.207 seconds

hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.159 seconds, Fetched: 5 row(s)

The ALTER TABLE command changed the properties of the Table from Managed to External and droping this table didn’t drop tbe underlying data from HDFS, which was evident from the fact that I could query from EMP_EXT.

Looking for the answer to my query – CREATE TABLE table_name LIKE external_table – Will it create an Internal Table or an External Table ?

%d bloggers like this: