How do I Purge Data Efficiently ?

Purging Data from a table or set of tables is a common practice and, in most of the cases, is done for Performance reason. “How do I efficiently purge the data ?” is a question normally asked to minimize the time it takes to purge the data. While there are various ways to purge data from a table, the time it takes to purge, depends on various factors, like :

  1. Strategy Selected
  2. Amount of Data to be Purged

In cases, where Purging is a frequency based regular activity, then Partitioning is an efficient solution, but then Performance needs to be evaulated. In most of the cases, implementing Partitioning for maintenance needs some compromise on Performance, as both cannot be achieved together. Moreover, in my opinion, Partitioning in an OLTP system will not yield much of the performance benefit as these queries are very lightweight and fetch very less data.

One time Purging, for performance, requires getting rid of data that is not accessed and therefore, can be safely removed. For this one time activity, partitioning for purging is obviously not a solution. In this case, one of the following strategy is commnly used and each of these have their own characteristics :

  1. Normal DELETE
  2. CTAS or INSERT INTO

While “Normal DELETE” is to be used for a very less %age of data, creating a new table with the data to be retained seems to be a best strategy for larger volumes. Again, the size of the table should also be taken into consideration before opting for any strategy. I will refrain from specifying any percentages (of data to be purged) to come up with a strategy, on the basis, that, there is no thumb rule to be considered for the same. The deciding factor has to be :

  1. Your own Knowledge about the Object
  2. Time and Resource it takes to Delete the Data
  3. Whether Normal Delete requires any Object (Table & Indexes) Re-org ?
  4. Time and Resouce it takes to Create a New Table and its corresponding Indexes.

Recently, one of my customer called upon to sought my opinion on a scheduled Purging Activity and to check the feasibility of optimization in the planned strategy.

SCENARIO
Data to be purged from a 50 GB partitioned table. The table is range partitioned, with each partition holding one months data. The partitions are from December 2004 till July 2009 and the data is to be purged, based on a WHERE condition, till March 2008. All the data from April 2008 till now is to be retained. So, of the total 68 Million rows till march 2008, total 54 Million rows have to be deleted. Total Number of rows in the table, as of current date, are 138 Million.

Assuming this to be a simple purging, the application team came up with the strategy of using NORMAL Delete, followed by a re-org. The Database size was not huge and the Server Capacity was 4 CPU Machine with 16 GB RAM. The delete statement to be used and the planned strategy for this activity was :

delete /*+ parallel(a,16) */ from itxn a
where	a.trdt <= to_date('31-Mar-2008 23:59:59','DD-Mon-YYYY HH24:MI:SS')
and	a.chid = 'TM';
  1. Drop all the Indexes : To Speed up the Deletion
  2. Delete from the Table : Expected to take 90 Minutes
  3. Table Re-Org : 30 Minutes
  4. Index Creation : 60 Minutes

The Total Downtime, as expected from the above timelines, is approx. 3 Hours. Based on the strategy, I could see wastage of resources (and time) and duplication of some steps. Eliminating this duplication, as explained in this blog, would optimize the purging process and accomplish the task in minimal downtime. For me, 90 Minutes for Deletion is Redundant and needs to be eliminated. Another step, that seems to be redundant is the Index Creation, but to speed up the purging, I thought of including this step and perform the activity in less than 1 hour.

OPTIMIZATION
The Critical Statistics for optimizing this purging process, as mentioned in the Scenario section above are :

  1. We have a table with Monthly Partitions
  2. Total Partitions : 56 (Dec 2004 to Jul 2009)
  3. Purging from 40 Partitions (Dec 2004 to March 2008)
  4. 16 Partitions to be untouched (April 2008 to Jul 2009)
  5. Total Rows in the Table 138 Million and to Purge 54 Million (39%)
  6. Out of 138 Million Rows, 68 Million rows are from Not-to-be-touched Partitions (49%)

Based on these Inputs, I had following two options, and the one that scores over another is the second approach.

APPROACH 1

  1. Create a New Partitioned Table with the data to be retained.
  2. Rename the OLD Table and all its Indexes
  3. Rename the New Table to ORIGINAL
  4. Create all necessary Indexes on the new Table.

APPROACH 2

  1. Create a New Non-Partitioned Table.
  2. Drop all the indexes from the to-be-purged table. (optional but was done)
  3. Insert the data to be retained from one partition at a time into the New Non-Partitioned Table.
  4. Exchange Partition.
  5. Create all necessary Indexes on the to-be-purged table.

The advantage of Approach 2 over Approach 1 is that 49% of the rows from not-to-be-touched partitions will not be accessed and therefore, this 2nd approach will take half the time as compared to Approach 1 as it access all the rows. Once this was finalized, it was just the matter of automating step 3 and 4 from approach 2. Automating eliminated any errors due to manual intervention. The automated pl/sql block is posted below :

create or replace procedure purge_process
as
cursor main is
select * from purge_job
where completed_yn is null
order by 1;
k_statement varchar2(30);
l_statement varchar2(1000);
l_count number;
begin
/* Get the total Number of Partitions to be accessed */
    select partition_position into p_count from dba_tab_partitions
    where table_owner=’GPI’
    and table_name=’ITXN’
    and partition_name=’ITXN_PART_MAR2008′;
    execute immediate ‘truncate table purge_job’;
    insert into ipgora.purge_job select partition_name, null from dba_tab_partitions
    where table_owner=’GIP’
    and table_name=’ITXN’
    and partition_position<=p_count;

 commit;

 dbms_application_info.set_module(module_name=>’Purging of ATM’,action_name=>’Job Table’);
    execute immediate ‘alter session set db_file_multiblock_read_count=128′;
    for i in main
      loop
       begin
         k_statement:=i.partition_name;
        dbms_application_info.set_module(module_name=>’Purging of TM’,action_name=>k_statement);
        execute immediate ‘truncate table ITXN_NEW’;
         /* Insert Rows to be Retained into Temp Table. Work on One Partition at a time. */
         /* Criteria ((Date<= ’31-Mar-2008′) and chid!=’TM’) To be Retained*/
         l_statement:=’insert /*+ append */ into ITXN_NEW select /*+ parallel(a,8) */ * from ITXN partition (‘||i.partition_name||’) a where (a.trdt     <= to_date(‘||””||’31-Mar-2008 23:59:59’||””||’,’||””||’DD-Mon-YYYY HH24:MI:SS’||””||’) and chid!=’||””||’TM’||””||’) OR (a.trdt > to_date(‘||””||’31-Mar-2008 23:59:59’||””||’,’||””||’DD-Mon-YYYY HH24:MI:SS’||””||’)))’;
        execute immediate l_statement;
        commit;
        l_statement:=’alter table itxn exchange partition ‘||i.partition_name||’ with table ITXN_NEW’;
        execute immediate l_statement;
        commit;
        update purge_job set completed_yn=’Y’ where partition_name=i.partition_name;
        commit;
        exception when others then
        update purge_job set completed_yn=’F’ where partition_name=i.partition_name;
        commit;
     end;
  end loop;
end;

On the production, the purging took 18 minutes and the Index creation took 45 Minutes. Therefore, the total downtime for the activity was approximately 1 hour. The steps involved, as per the procedure created were :

  1. Create a purge_job table for monitoring the progress.
  2. Get the Partition_position from dba_tab_partitions for the Last Partition to be considered. In this case, March 2008.
  3. Insert partition_name and NULL into purge_job table.
  4. Truncate the Temporary Table created for this purpose.
  5. In a Loop, get one partition at a time and insert the rows to be retained into Temporary Table.
  6. Exchange the current partition (from where the rows are fetched) with Temporary table. At this stage, the Temporary table will contain all the rows from the current partition and the current partition will contain the rows to be retained.
  7. Update the Purge_job table with Completed_YN status as ‘Y’. This marks this current partition as completed.
  8. Proceed with next partitions…continue till the March 2008 Partition as fetched into P_COUNT variable.

With this approach, the goal (purging) was accomplished in an efficient manner.

For unknown reasons, I am not able to access my previous blogspot url http://viveklsharma.blogspot.com. I have a blog, posted in March 2008, where I used ROWID Parallelism to accomplish a challenging purging from a 400 GB table in less than 3-4 hours. This was a non-partitioned table and based on the available resources, rowid parallelism (as introduced by Thomas Kyte) was effectively used.

Advertisements

About Vivek Sharma
I am an Oracle Professional from Mumbai (India). I blog on the issues that I feel is Interesting for my readers. Some of these are my real life examples, which I hope, you would find interesting. Comments are always a welcome. The Technical Observations & Views here are my own and not necessarily those of Oracle or its affiliates. These are purely based on my understandings, learnings and resolutions of various customer issues.

4 Responses to How do I Purge Data Efficiently ?

  1. Vikram More says:

    Hi Vivek ,
    Hope you are doing great.

    I was just going through various purging strategies one can come across , and managed to visit this site of yours.

    Glad that i came across this site 🙂

    Would want to know your latest updates too.

    Is there any site Or blog you update regularly so that i can be in sync 🙂

    Have a gr8 time

    Rgds
    Vikram More

    Like

  2. Pingback: The Urge to Purge | The Data Roundtable

  3. Scott says:

    It annoys me that people don’t research properly. Why didn’t you do this http://www.dba-oracle.com/t_oracle_drop_partition.htm

    Like

    • Vivek Sharma says:

      Scott, this is not a Partition Drop kind of a Purging. If you see the DELETE statement written by the Application Team, there is a WHERE condition CHID=’TM’. This means, not all the rows are to be purged from a partition. This is a conditional purging. Partition DROP will drop the entire partition.

      Hope this clarifies.

      Regards
      Vivek

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s