Migrating WordPress data for a Ruby on Rails application

Feb 03, 2016

Table of contents:

Generating the Rake Task
Reading the WordPress XML Export
Creating the Data Objects
Importing the data
Conclusion

Last week we looked at converting WordPress HTML into normal HTML and Markdown (Rendering Markdown and HTML in Ruby).

This is a great first step, but we also need a way of taking the generated WordPress export and migrating the data to the new application.

This needs to be an automated process so I can keep periodically running it during development, and when I put the new version live I’ve already got a process and I just run.

There are a couple of existing solutions for converting WordPress data to a new format, but because I’m not migrating to an off-the-shelf CMS, none of them would work for me.

Fortunately it’s not too difficult to do the job ourselves.

In today’s tutorial I will be walking through how I wrote my migration task.

Generating the Rake Task

Whenever I want to migrate the data I’m going to want run a command in Terminal to kick things off. Rails uses Rake (Understanding and Using Ruby Rake) for command line stuff so we can write our own Rake task for running the migration.

First we need to generate a new Rake task:

rails g task wordpress import

This will generate a new import task that is namespaced under wordpress.

If you look under the tasks directory under the lib directory you should find a new file called wordpress.rake:

namespace :wordpress do
  desc 'Import WordPress data'
  task import: :environment do
  end
end

Reading the WordPress XML Export

Next we need to take the XML export that WordPress generates and read it into a structure we can work with.

In order to do this, I will be using the Nokogiri gem.

Add the following line to your Gemfile:

gem 'nokogiri'

And run the following command in Terminal:

bundle install

Next I’m going to create a new directory under lib called word_press and a new file called data.rb:

module WordPress
  class Data
  end
end

In order to pass the XML into Nokogiri, we first need to read the file.

I’ll handle this in the initialize method:

attr_reader :doc

def initialize
  file = File.expand_path('wordpress.xml')
  file = File.open(file)
  doc = Nokogiri.XML(file.read.gsub("\u0004", ''))
end

In this example I’m hard coding the path to the XML export. You could pass this as an option from the Rake command, but because this is specific to my application, and it’s never going to change, I don’t mind hard coding it.

Finally I’m going to provide a single method for getting the posts from the export:

def posts
  doc
    .xpath("//item[wp:post_type = 'post']")
    .collect { |post| WordPress::Post.new(post) }
end

Nokogiri provides an xpath interface for traversing the XML structure. I’m only interested in the posts so that’s the only bit I need.

I collect over the array of results from the xpath query and create an array of new Post objects that will be returned from this method.

For my application, I’m using the posts as an entry point for getting all of the data from the export.

Creating the Data Objects

The next step is to create Data Objects for each of the types of data you want to migrate.

module WordPress
  class Post
    def initialize(doc)
      @doc = doc
    end
  end
end

By wrapping the Nokigiri element in a Ruby class, I can make any customisations and conversions as the object is read.

For example, if you just want to pass the data on, you can simply provide a method and return the value:

def title
  @doc.xpath('title').text
end

def slug
  @doc.xpath('wp:post_name').text
end

But if you want to convert to a different format, you can encapsulate that in the method.

For example, in last week’s tutorial I was converting the WordPress HTML into Markdown and regular HTML.

I can deal with this conversion process inside of this class:

def content
  content = @doc.xpath('content:encoded').text
  content = format_syntax_highlighter(content)
  content.gsub(/[\n]{2,}+/, "\n\n")
end

def html
  Render::HTML.new.render(markdown)
end

def markdown
  return @markdown unless @mardown.nil?

  @markdown = Render::Markdown.new.render(content)
end

def format_syntax_highlighter(text)
  text.gsub(%r{\[(\w+)\](.+?)\[\/\1\]}m) { |match| "\n```#{$1}#{$2}```\n" }
end

To the outside world, this conversion process is completely hidden.

You can also create more classes to encapsulate related entities. For example, each post will have related comments so I can repeat the process of collecting these related items:

def comments
  @doc.xpath('wp:comment').collect { |comment| Comment.new(comment) }
end

Now I can deal with the comment specific formatting in it’s own object.

Importing the data

Finally back in the wordpress.rake task we can deal with the actual importing process.

This will basically mean taking each object from the WordPress data export and creating new Active Record objects and relations.

namespace :wordpress do
  desc 'Import WordPress data'
  task import: :environment do
    # Get the WordPress data
    data = WordPress::Data.new

    # Import the posts
    data.posts.each do |data|
      article = Article.new
      article.title = data.title

      # etc
      article.save!
    end
  end
end

The structure I’ve decided on for my CMS is more complicated than a regular blog and so this provides a nice opportunity to create the object graph for each article. That is something I definitely could not of done if I had used a general purpose solution.

Conclusion

In today’s tutorial we’ve covered a couple of interesting areas of Ruby development including creating Rake tasks as well as the very useful Nokogiri gem.

By encapsulating each chunk of data from the WordPress export as a class we can deal with whatever conversion details we require.

Although there are many existing solutions for migrating data from a WordPress blog, none of them came close to satisfying my requirements.

Hopefully if you are looking to do the same, you can use these last two posts as a foundation for building what you need.