Working with High volume Data Migration SQL script

Data migration, Data fix is very common in the industry and we do it more often than we like. There can be many reasons which lead to data fix/migration for our system. Most of the time, such changes will be applied to the live database and if we are not careful enough, we might cause more problems than we were aiming to fix.

The following are a few points which we should keep in mind when we work with Data Migration Scripts. Some Points are in reference to Oracle Database migration but every database will have a similar term/process to achieve it.

Data Back-Up

It is always advisable to take a backup of the rows you are going to impact. Even though it is corrupt data. This comes in very handy if your fix does not work as you expected and it is always better to go back to a known issue rather than a new one.

Create an equivalent backup table based on the condition you want to apply your fix ( i.e. issue selection criteria)

Better to Iterate over the backup table

If you have taken back up of data you want to migrate, better to use a backup table in the selection criteria(i.e. while Cursor Creation).

This makes sure you do not touch any rows which you have not taken back up. It is very likely that by the time your execution completes, new rows could get added which would need migration. Now as your backup table does not have those rows, you should not touch them. Iterating over a backup table helps to avoid such rows and minimize the impact. You can always re-run the script for new rows after taking backup.

Use Cursor if possible

Cursors are great. A cursor is a pointer to a private SQL area that stores information about the processing of a SELECT or DML statement. When you want to migrate data, create a cursor to keep data. This becomes very handy when you process/iterate data. Makes your script more readable, flexible, and efficient.

Make Scripts re-runnable

Idempotency is helpful. Anything that can go wrong will go wrong.

The script may fail/partial-run due to any reason. If your script is not re-runnable, you will be in the middle of execution and will need to provide a quick fix without much testing and you will end up messing it.

Make sure you add proper conditions, cross-checks in SQL. If updating a row, better to use a merge statement or update with a condition to which will you are trying to set a new value.

Example :

update tableA set column1 = '1234' where column2 = 'ABCD';

vs

update tableA set column1 = '1234' where column2 = 'ABCD' and column1 <> '1234';

Here if your value is already correct, it avoids updating it. If you run the script for the first time, it will update column1. If you re-run it, column1 which is already migrated will not get an update. This makes it re-runnable.

Use Index and partition keys in the SQL

Your production database will always be large with a heavy load than your test environment. Due to this, your SQL performance can vary drastically between 2 databases.

If you are using the proper index and partition key in the SQL, it increases your chances to see better performance.

If you do not have an index, see if you can add one.

Query Execution Plan

Query Execution Plan tells you a lot of stuff that you can use and improve the performance of SQL by optimizing it. You will get to know how your SQL will perform on a given dataset.

Many times query that runs very well on small data set may not work with a large one. Query Plan will let you know the cost, cardinality, indexes/partition used, etc. This gives you an idea of what is going wrong or can go wrong. Understand the Query Execution Plan beforehand to avoid such mistakes.

Understand how the table is used in Production

This is extremely important when you are touching a live database. If you are not aware of how your system is using the table that you are trying to modify, you may end up with I/O starvation or deadlocks.

Your system should not suffer due to script execution. It should not need to spend too much time in wait because Script has acquired a lock on the row.

Knowing how read/write happens on the table, gives you better insights, and helps to write effective and efficient SQL.

Plan your commits/rollbacks

Always keep committing/rollback small changes. This will help with the following :

Less number of rows locked
Less I/O wait for the system
Less impact on rollback
Fewer/no chances of deadlock

There can be more factors that you need to consider for a commit/rollback strategy based on your system. Kindly evaluate those before making the decision.

Evaluate usage of bulk collect/insert/update/delete

With Oracle, bulk actions (bulk collect/insert/update/delete) provide a great way to batch it and optimize SQL. When writing a data script, always evaluate the feasibility of using bulk actions.

Use ROWID whenever possible

Every record has a unique ROWID within a database representing the physical location on the disk where the record lives. This gives a great boost to SQL performance.

While creating a backup table, if we collect the ROWID as well, the next update/delete can be done using ROWID. Having a physical location saves lots of table scan that we do for locating our row.

ROWID generally remains unchanged but in the case of a partitioned table, it can change if the row migrates from a partition to another one. Better to see how the table is partitioned and its impact on ROWID before using the ROWID in SQL.

Working With Date/Time

When working with Date/Time, be very careful about how it is used. Oracle supports many formats for date/time which is convenient when you are looking at data but the same can become very painful while modifying it.

When there is a comparison/update of date/time make sure it is compatible. It is very common that impact can be different than the intention.

The script needs to make sure it explicitly mentions what format it is going to use.

alter session set nls_date_format='dd-mm-yy hh24:mi:ss';

With the above line, we set the nls_date_format to avoid any conflicts/confusion in the date. There are many formats available and one can set as per the need.

Another option is to use a relative date. Say you want to set a date for next year,

update tableA set column_date = CURRENT_TIMESTAMP + interval '1' year ;

This will set the date to 1 year from the current time.

If you are comparing dates,

update tableA set column1='1234' where trunc( column_date ) < to_date(''01-DEC-20'')

It is always better to consider these items for date/time as your test environment may not have all possible combinations that can go wrong. If things are taken care of properly, you have 1 less thing to worry about.

Mimic Production Execution steps in test Environment

Sometimes it really matters how the script is executed. Your SQL tool sometimes takes care of a few things even if you miss them. But if you run the same via the command line, you will start seeing issues.

It is advised to see how a script will be executed in production and try to follow the same in the test environment also. This will help avoid last-minute surprises.

This can help to estimate overall expected run time. It is important to know how long it is going to take so avoid last minute anxiety.

Plan Back-up table cleanup

Once you are done with the data script, you need to handle clean-up as well. If you do not clean up backup data, you may end up with too much-unwanted data. It is okay to not delete backup data immediately as you might want to monitor how the fix went and if any issues come in near future.

You can put a process in place that cleans up such data based on conventions followed (say certain table name format) after a certain time period.

Revisit Constraints

It is very common that data corruption happens as proper database constraints were not available, missing proper code checks, etc. If you have a state which can define data to be valid, better to create constraints around it to avoid future data corruption.

Check existing Constraints /Triggers /Indexes on the table

When we update the row if there is a trigger that satisfies the criteria will also get initiated. This will add time to script execution. Sometimes those triggers are necessary and sometimes those can be skipped.

Evaluate if you need those triggers or not during data script execution. If not needed, you can temporarily disable triggers to get better performance and re-enable once changes are applied.

If you are updating a column that has an index on it, you might see a dip in the performance. With each update of value, the oracle will need to rebuild indexes. In such cases, you may want to disable the index to avoid performance overhead and re-enable once done.

Prepare Rollback Script

It is very important that we keep the rollback script ready. In the case of any issue, a rollback script can be used. If possible, a rollback script can be triggered automatically from the initial script based on condition/failure encounter.

Logs

Logs are every developers friend. Always ensure there is enough logging available for you understand how script is running and if any issues occurs, logs should be sufficient to identify root cause. As we are working with data, there can be a data anomaly which can cause things to break. Logs should be able to capture that for better understanding of thing going on.

Summary

These are some of the general practices that developers should follow to make such changes go smoothly.

With all said and done, you still may need to go to DBA for review but, if you have the above things checked and in place, it makes the life of DBA easy and they can focus more on other important stuff.

Things to Keep in Mind when working with Data Migration Scripts