Executing gradual imports is the winning data migration strategy

Executing gradual imports is the winning data migration strategy

Data imports are a common task in IT project implementations. Whether you're doing content migrations from a legacy Web Content Management System or importing millions of data points to your shiny new Customer Data Platform, it's likely that migrations are on the menu for the development team before the big go live.

The range of data migrations is wide, but the core is always the same; Move data from one system to another. In many cases the process is complicated by the fact that the existing system needs to run until the last minute before switching over to the new tools. The source data is often tainted by inconsistencies or may even contain false information, so doing data cleansing or scrubbing might be order in while you're at it.

Migrations are not an exact science, but in general you could consider there to be two approaches; A big bang and gradual migrations. Big bang migrations take the approach of executing the whole migration at a single point in time. Gradual migrations, sometimes called trickle migrations, take a piecemeal approach - continuously importing data from the live system to the system being developed.

Both methods can lead to a successful transition to a new production system, but the incremental method arguably carries less risk when you get to the finish line. Big bang migrations can be simulated beforehand with data snapshots, but those are often infeasible to run often - making the feedback cycle between execution and verification longer. With gradual imports you can get individual data points from production systems at a faster pace, increasing the agility of the entire process.

No piece of data is an island

A system is by definition a regularly interacting or interdependent group of items forming a unified whole. Whatever data you are importing with will eventually consumed by other systems. For example a Digital Experience Platform is a system that integrates functions such as data storage in a database and a file storage backend (both complex systems within themselves), and a management interface.

A lot can go wrong here. Let's say you're importing user generated content from an eCommerce site. Someone left a comment on a product with a thousand candy bar emojis (🍫🍫🍫🍫…) which is working fine on the source system. But maybe at some point in time there were some workarounds and the candy bars aren't really candy bars because the old system does not work with the UTF-8 character set. So instead of 🍫s  they're stored as placeholders like :candybar: in the source data.

No problem. Our import process can handle this, by converting them from the placeholders to real candy bars. The content field now looks like it should, but wait a minute... why is the user's avatar image now not showing up? Ah, we decided to store file based on the username and the file storage we have does not allow storing a file as 氷の男.png. Easy, just run it through a hash function and store it as f07db9de.png.

Perfect! The comment field and image show up just as you'd expect in the back office. But huh, now we're getting reports from the user acceptance testing crew that they're seeing boxes instead of candy bars? Ahh, no biggie, it was just that the font we've selected does not support emojis and they show up as unknown characters. Our design team and front end developers can switch to a variant that supports these.

And now finally, everything that could (and did) go wrong in our fictional case has been fixed and we're ready to go. The difference regarding process (big bang vs. gradual) is that all of these issues might have hit us at the same time with fifteen other similar issues. It would have been an overwhelming amount of work to fix "a simple thing like that". With daily incremental imports this is more likely to have surfaced earlier and with ample time to handle the issue before the production release.

Implementing trickle migrations with Ibexa DXP

Ibexa DXP exposes a wide array of open APIs to developers to implement data migration tasks. With them you've got the freedom to work by dumping in large data blobs or handling imports using granular incremental processes. Before you begin evaluating which of these is the best for your use case, you should look into what are the options available as the source data? Do they provide all the information needed? Should you import from a single source or from multiple systems? Do you need resources to develop something new for the source system?

Some common methods for exposing source data for migration processes are:

  • Standard public APIs (REST, SOAP, etc.)
  • Purpose built custom endpoints (exposing data using a custom JSON format, etc.)
  • Common content feed formats (ATOM, RSS, etc.)
  • A pool of files stored on disk (XML, CSV, etc.).

Once you've got your data sources figured out, you can consider what is the optimal way to implement the target system. The methods available for you via Ibexa DXP are:

  • Migration files
  • Public HTTP APIs
  • Server-side PHP APIs.

The migrations bundle introduced in Ibexa DXP 3.3 allows moving repository structures between environments using a standard YAML format. It is designed to be used moved data structure changes (adding fields to content items, creating root container objects, etc.), but can also be used for importing content. This method can be limiting for cases where complex logic is needed, but can be useful for modest amount of straightforward contents for supporting prototyping of data models.

The Ibexa DXP product family supports common HTTP based APIs out of the box. With our RESTful and GraphQL APIs, developers can gain read and write access to the data repositories on Ibexa DXP instances. Using these, developers are free to interact with the system in their specific programming and scripting languages such as JavaScript and Python. These APIs provide full flexibility, but performance can be a limiting factor if you're moving gigabytes of binary files back and forth for example. 

The Public PHP API gives developers direct access to our server-side APIs. These are the same APIs that power the administration interface and all the other features our core product uses. It is common to use the Symfony Console framework capability to expose these APIs for import and other scripting purposes. The capabilities in the PHP API provides the most low-level access to our safe APIs and is generally the way to go for the most complex integrations requiring multi-sequence runs to form relationships between migrated objects.

Conclusion

Migration processes where you run the old system and new systems in parallel well before going live has many benefits. With incremental data imports you can find issues faster and make sure all the pieces of the integration fit together perfectly. Running migrations continuously from early on until the last minute gives you confidence for the confidence for the final rollout.

In addition this might make you rethink parts of your project. Some tasks labelled as imports might be rendered irrelevant , for example when you're integrating multiple systems, the initial import might also be used as the primary integration method going forward. Let's say you want to expose product sales data to your e-commerce platform operators. This data is stored in the legacy online commerce solution, but the source data is actually your ERP. If your ERP platform will continue to work as it has in the past, it might make sense to import data directly from there and leave the legacy e-commerce platform out of it completely. 

For developers, Ibexa DXP offers flexibility and different methods to performing migrations. With our modular product structure your developers can also keep using the same familiar APIs as your organization graduates from digital sales with Ibexa Experience to full-blown transactional commerce with Ibexa Commerce.

 

Insights and News