POC to migrate legacy customer data to new data store.
Legacy data has duplicate customer records and needs to be de-duplicated based on Phone-number and address.
Approach is to insert all data into Elasticsearch and then query (using match_like_this) for each customer to get different entries for the same customers.
After deduplication save the customer to the new data store keeping the original legacy customer ids for exhaustive information of customers across all entries.
Uses Spring batch Jobs one each to
- generate dummy data of 5 million records
- add all records to Elasticsearch
- deduplicate the customer records and push to new data store.
Using Spring batch give ability to resume Job in case of failures.
- Takes 4m53s649ms to generate 5 million customers
- Takes 15m32s17ms for 5million records to push to elasticsearch.
- Process 10K records per minute. At this speed it will take 8 hours to complete all 5 million customers.
MacBook Pro with 2.3 GHz Quad-Core Intel Core i7 Processor, 16 GB 1600 MHz DDR3 RAM and 500 GB Flash Storage Postgres and Elasticsearch running in docker.
Update the Postgress and Elasticsearch endpoints in application.properties. Run following command to build.
./mvnw clean installFollowing are the commands to run for each Job after the previous one is finished.
# Generate Dummy data
java -jar -Dspring.profiles.active=generate_dummy target/mcd-0.0.1-SNAPSHOT.jar
# add all records to elasticsearch
java -jar -Dspring.profiles.active=save_es target/mcd-0.0.1-SNAPSHOT.jar
# Generate Dummy data
java -jar -Dspring.profiles.active=dedup target/mcd-0.0.1-SNAPSHOT.jar
- Search query ES can be improved better and faster result. Right now it is using match_like_this.
- Use Spring batch Partitioning to distribute deduplication tasks on other worker nodes.
- Run tasks again after the first migration with parameterized ItemReader.