The Fastest Way To Load Data Into Postgres With Ruby on Rails

By Eze Sunday Eze

December 14, 2021

Data migration is a delicate and sometimes complicated and time-consuming process. Whether you are loading data from a legacy application to a new application or you just want to move data from one database to another, you’ll most likely need to create a migration script that will be accurate, efficient, and fast to help with the process — especially if you are planning to load a huge amount of data.

There are several ways you can load data from an old Rails app or other application to Rails. In this article, I’ll explain a few ways to load data to a PostgreSQL database with Rails. We’ll go over their pros and cons, so you can choose the method that works best for your situation.

Postgres is an innovative database. According to a recent study by DB-Engines (PDF), PostgreSQL’s popularity rating increased by 65 percent from January 2016–January 2019, while the rating of MySQL, SQL Server, and Oracle decreased by 10–16 percent during the same period.

PostgreSQL has a strong reputation for handling large data sets. However, with the wrong tools and solutions, its powers can be undermined. So what’s the fastest way to load data to a Postgres database in your Rails app? Let’s look at four different methods, and then we’ll see which is the fastest.

Inserting one record at a time to load data to your Postgres database
- Pros of single-row inserts with Postgres
- Cons of single-row inserts with Postgres
Bulk Inserts with Active Record Import to load data to your Postgres database
- Pros of Bulk Inserts with Active Record in Ruby on Rails and Postgres
- Cons of Bulk Inserts with Active Record in Ruby on Rails and Postgres
Using PostgreSQL Copy with Activerecord-copy to load data to your Postgres database
- Pros of using PostgreSQL Copy with Activerecord-copy
- Cons of using PostgreSQL Copy with Activerecord-copy
4. Using background jobs to load data to your Postgres database
Final Thoughts About Loading Large Data Sets into a PostgreSQL Database with Rails
Speed comparison of different ways to load data into Postgres with Rails
Other articles and resources you might like

Diagram of Ruby on Rails Insert Methods

Inserting one record at a time to load data to your Postgres database

One easy way to load data to a Postgres database is to loop through the data and insert them one at a time.

Here’s a sample code to do this in Rails, assuming we have the source data in a CSV file:

# lib/tasks/one_record_at_a_time.rake
require 'csv'
require "benchmark"

namespace :import do
   desc "imports data from csv to postgresql"
   task :single_record => :environment do
       #This function loops over the content of the csv file and creates a new record for each of them.
       def insert_user
           CSV.foreach(filename, headers: true) do |row|
               User.create(row)
           end
       end
       puts Benchmark.realtime {insert_user } #Here we are using benchmark to measure the speed
   end
end

But there’s a problem with this approach. Inserting data one at a time into a PostgreSQL database is extremely slow. I ran this Rake task to insert over a million records and measured it with Benchmark. The report came back with a result of over 1.3 hours, that’s a long time. There's overhead in both the database and the application in processing rows one-by-one, and additional latency in waiting for the database round trip for each row.

We’ll see a better approach in the next section, but for now, here’s a summary of the pros and cons of single-row inserts:

Pros of single-row inserts with Postgres

Doesn’t require an external dependency

Cons of single-row inserts with Postgres

Very slow
Might lock your session for a long time
Not suitable for inserting large data sets
If one insert fails, you’re stuck with partially loaded data

Bulk Inserts with Active Record Import to load data to your Postgres database

Running a bulk insert query is a better and faster way to load data into your Postgres database, and the Rails gem activerecord-import makes it easy to load massive data in bulk in a way that the Active Record ORM can understand and manipulate.

Instead of hitting your database multiple times, processing transactions, and doing all the back and forth with your app and database, the Active Record Import gem allows you to build up large insert queries and run them at once.

You can install the Active Record Import gem by adding gem 'activerecord-import' to your Gemfile and running bundle install in your terminal. This gem adds import to Active Record classes. That means you’ll only need to call the import method on your model classes to load the data into your database.

Here is an example:

# lib/tasks/active_record_import.rake
require 'csv'
require "benchmark"

namespace :import do
   desc "imports data from csv to postgresql"
   users = []
   task :batch_record => :environment do
       CSV.foreach(filename, headers: true) do |row|
           users << row
       end
       newusers = users.map do |attrs|
           User.new(attrs)
       end
       time = Benchmark.realtime {User.import(newusers)}
       puts time
   end
end

Notice how we’re building up the record in an array—users—and passing the array to the import method on the User model— User.import(newusers).

That’s really all that needs to be done. However, you can choose to pass only some specific columns and the values in an array to the import method if you want to. For example, User.import columns values where the columns will be an array like ["first_name", "last_name"], while the values will be an array like [ ['Peter', 'Joseph'], ['Banabas', 'Bob Jones'] ].

I analyzed loading a million records into a Postgres database with Rails using this method, and it took only 5.1 minutes. Remember the first method took 1.3 hours? This method is 1,529% ( ~15x ) faster. That’s impressive.

Pros of Bulk Inserts with Active Record in Ruby on Rails and Postgres

Follows Active Record Associations, meaning Rails ORM is able to do its magic with the loaded data
Faster to load data into your PostgreSQL database
Doesn’t have per-row overhead
If insert fails, your transaction will rollback the insert

Cons of Bulk Inserts with Active Record in Ruby on Rails and Postgres

The activerecord-import gem might conflict with other gems that add .import method to the Active Record model. However, in cases where this might happen, you can use the .bulk_import method also attached to your model classes as an alternative.

See how batch import improved our speed by over 1,529%? That was incredible, right? There is still a faster way to load data to a Postgres database.

Using PostgreSQL Copy with Activerecord-copy to load data to your Postgres database

COPY is the fastest way to load data to a PostgreSQL database; it uses the combined power of a bulk insert and avoids some of the overhead of repeatedly parsing and planning an INSERT.

The gem activerecord-copy provides an easy-to-use interface for implementing COPY in your Rails app. You’ll need to add the line gem 'activerecord-import' to your Gemfile and run bundle install in your terminal to install the gem and get ready to use it.

Here is a sample Rake task showing how you can use it:

# lib/tasks/active_record_copy.rake
require 'csv'
require "benchmark"
namespace :copy do
   desc "imports data from csv to postgresql"
   task :data => :environment do
       def insert_user
           users = []
           CSV.foreach(filename, headers: true) do |row|
               users << row
           end
           time = Time.now.getutc
          
           User.copy_from_client [:first_name, :last_name, :email, :created_at, :updated_at] do |copy|
               users.each do |d|
                   copy << [d[:first_name], d[:last_name], d[:email] ,time, time ]
               end
           end
       end
       puts Benchmark.realtime {insert_user}
   end
end

The activerecord-copy gem adds a copy_from_client method to all your model classes, as shown in the snippet above (you’ll have to define the columns and their values as shown).

Note that when you use the activerecord-copy gem, the time stamp is not created for you automatically. You’ll have to create this yourself. You’ll also notice where I created the time stamp time = Time.now.getutc; that’s because Rails will not create time stamps for you automatically with COPY.

Pros of using PostgreSQL Copy with Activerecord-copy

Doesn’t have per-row overhead
If insert fails, your transaction will rollback the insert
Super fast

Cons of using PostgreSQL Copy with Activerecord-copy

Manually set time stamps (created_at, updated_at, etc.)

I analyzed the activerecord-copy performance with a transaction of over one million records, as I did for other methods, and the speed is about 1.5 minutes. Insanely fast compared to the other methods we’ve seen in this article.

4. Using background jobs to load data to your Postgres database

If you frequently load new data to your database, one great way to improve your app’s performance is to run your data loading using a background job. There are several tools that make this possible, for example, Rails’ delayed_job gem, sidekiq, and resque.

However, just like Active Record, Rails uses Active Jobs to allow you to use any of these supported adapters within your Rails app without bothering about job-specific implementation. So you could set up a script for Active Record and run the script in a background job using Active Jobs and the delayed_job adapter. That way, you'll be running your data loading in the background.

Let’s walk through how to set up your Active Job to run your background process:

Since you’re going to use the delayed_job adapter, install the delayed_job_active_record gem.
Add gem 'delayed_job_active_record' to your Gemfile.
Run bundle install on your terminal/command line.
Run the following command to create a delayed job migration for the delayed jobs table:

rails g delayed_job:active_record
rake db:migrate

Generate an Active Job by running the following command:

rails generate job import_data

Open the file created in your app/jobs directory—app/jobs/import_data_job.rb—and add your data loading code:

# app/jobs/import_data_job.rb
class ImportDataJob < ApplicationJob
   queue_as :default
   def perform(*args)
   # Write your code here to load records to the database. You can use any of the fast methods we've discussed.
   end
end

In order for Rails to be aware of the Active Job adapter you want to use, you need to add the adapter to your config file. Just add this line: config.active_job.queue_adapter = :delayed_job_active_record.

   # config/application.rb
   module YourApp
     class Application < Rails::Application
       # Be sure to have the adapter's gem in your Gemfile
       # and follow the adapter's specific installation
       # and deployment instructions.
       config.active_job.queue_adapter = :delayed_job_active_record
     end
   end

Depending on how often you want the job to run, you can set the job to be enqueued at a specific time or immediately, following the instructions in the Active Jobs documentation.

One way you can do this is to allow the job to run asynchronously. Create a Rake task, add ImportDataJob.perform_later to the task, and run it. Example:

namespace :active_jobs do
   desc "imports data from sql to postgresql"
   task :import => :environment do
       ImportDataJob.perform_later
   end
end

Once this is done, you can now run the task rake active_jobs:import on your terminal.

Final Thoughts About Loading Large Data Sets into a PostgreSQL Database with Rails

When considering how to optimize your database performance, it’s best to first figure out the optimization options the database has already provided. As you may have noticed, most of the tools and techniques in this article leverage the hidden power of the PostgreSQL database. Sometimes, it might just be your implementation slowing down your database performance.

Speed comparison of different ways to load data into Postgres with Rails

Here’s a table summarizing the various speeds of the methods discussed in this article.

Method	Speed	Amount of records
One record at a time insert	1.3 hours	1,000,000
Bulk inserts with Activerecord Import	5.1 minutes	1,000,000
PostgreSQL Copy with Activerecord-copy	1.5 minutes	1,000,000
Using Background Jobs	< 1 sec (perceived)	1,000,000

You’ve learned that if you’re loading a huge amount of data into your PostgreSQL database, one insert at a time is slow and shouldn’t even be considered. For ultimate performance, you want to use COPY. Of course, you’ve also seen the caveats of each method, and you should weigh all the pros and cons before making your final decision.

Share this article: If you liked this article we’d appreciate it if you’d tweet it to your peers.