How to Set Up a Read-Only PostgreSQL Slave Server for Data Analytics

By Matthew Rathbone on November 09, 2016

postgres analytics database dba

This article was co-authored by Anton Antonenko

replica city Image by Wenjie, Zhang

Let’s say you have a production PostgreSQL database containing a bunch of useful data and you want to give some co-workers access to this database for analytics, reporting, or just for poking around. A good way to do this is to set up a new slave server with read-only access - that way they won’t impact any production systems.

In this article I will guide you through the configuration of streaming replication from your master (production) PostgreSQL server and a brand new slave server. Streaming replication will keep your slave up to date in real-time, and is commonly used for ‘hot standby’ mode.

We’ll go through:

  1. Configuring the master server
  2. Setting up and configuring the slave server
  3. Copying data between the master and the slave
  4. Starting the slave and checking to make sure everything looks good

Variables we use in this article

<IP_MASTER> # ip address of the master server
<IP_SLAVE> # ip address of the slave server
<DATA_DIR> # data directory of the database

Configure the master

Update postgresql.conf

First we need to make some changes to the Master server’s postgresql.conf to support replication:

listen_addresses = '*' 
wal_level = hot_standby 
wal_keep_segments = 256 

# if you are using postgres <9.5 you need to set checkpoint_segments
# if you have newer postgres (>=9.5) you need to seet min&max wal_size instead
# NOTE: you need to set EITHER checkpoint_segments OR max&min_wal_size

# <=9.5
checkpoint_segments = 256  

# >= 9.5
max_wal_size = 256MB
min_wal_size = 128MB


max_wal_senders = 3   # this is max amount of concurrent connections from slave server
hot_standby = on 

The wal_level defines how much information we’ll write in the write-ahead log (WAL). The WAL contains every change made to your database from a particular checkpoint - IE the INSERT/ALTER/UPDATE statements which define your current database state from a particular point in time. Typically before actually applying these changes to your tables Postgres writes them in WAL log - hence the name write-ahead. Making this value equal to hot_standby tells Postgres that it needs to include as much information in the logs as required by a hot standy server. By default it only includes enough information to recover from a crash.

The checkpoint_segments value defines how large the WAL segments will be that are generated by the database. By default, the size of WAL segments is 16MB. If you have a very big database and you run UPDATE/ALTER queries often, changing the default WAL segment size can improve the speed of these queries. You can allow PostgreSQL to pick default values for checkpoint\_segments and min & max wal_size by not mentioning them in the postgresql.conf file.

Update pg_hba.conf

We need to add the following line to the pg_hba.conf:

host    replication   all   <IP_SLAVE>/32    trust

Set up master-slave ssh access

Next, we will generate SSH keys to allow free SSH access between the master and slave servers. This is an important part of setting up replicas, since we need to configure our servers for further file transfer from the master to the slave server (we’ll use rsync in a future step). This part can be skipped if you already have some intra-machine SSH credential set up – I present only a simple way to do this here.

Execute the following command on the Master to create SSH keys and copy them to the slave server

sudo su postgres
ssh-keygen
ssh-copy-id <IP_SLAVE>

When you run ssh-keygen the SSH keys generator will ask you where to save the new key-file and. By default it will save it inside of your PostgreSQL directory - in my case it is /var/lib/pgsql/.ssh/id_rsa. Simply save it to the default location and agree to overwrite the key-file, if it already exists (although it is probably good to check with co-workers what the existing key was used for).

So now you should be able to log on to the slave server via ssh (make sure you are the Postgres user) from the master, and vice versa.

Restart to make changes take effect

Finally let’s restart the master server, in order to apply the changes in config files.

/etc/init.d/postgresql-9.3 restart

Configure the slave server

We’re going to tell the slave server to behave as a hot-standby replica, so it requires knowlege of the master server IP address and the ability to connect to the master.

Update postgresql.conf

The changes here are pretty simple

hot_standby = on

Create a recovery.conf

Then you’ll need to create recovery.conf file in the PostgreSQL ‘data’ directory (in my case it is /var/lib/pgsql/9.3/data/). This is a configuration file which contains settings that will be used in case of data recovery. We’ll add the following content.

standby_mode = 'on'   #since we are configuring hot-standby replica we need leave this parameter 'on' 
primary_conninfo = 'host=<IP_MASTER>' # we specify our master's ip
trigger_file = '<DATA_DIR>/failover.start' #here we define our so-called failover(or trigger) file which allows us to 'fail over' to this machine, taking over from teh master and ending replication

The final step during slave configuration will be turning off the running, since we have to copy a bunch of stuff over from the master server which will confuse it if we don’t turn it off first.

/etc/init.d/postgresql-9.3 stop

Copy database from master to slave

Currently, our two servers are able talk with each other, but our master is full of data and the slave is empty. To bootstrap the slave we’ll copy all the data from the master.

IMPORTANT: Make sure that you are executing the following commands from the master server, otherwise you might lose all your data!

The current database running statuses are: Master ON; Slave OFF

1. Rsync

First, we need to copy the database from the master to the slave. We need a tool that can transfer files over the network. Rsync is a great tool to do this with. The below command will copy data to the slave whilst ignoring the master-specific configuration files.

rsync -av --delete --exclude pg_xlog --exclude postgresql.conf --exclude pg_hba.conf --exclude postgresql.pid --exclude failover.start <DATA_DIR>/* postgres@<IP_SLAVE>:<DATA_DIR>/

2. Checkpoint creation

Next, we need to create a checkpoint in the database and write a backup label file containing the minimum wal position required for recovery to start backup mode. There is a special tool in PostgreSQL that will help us to do this, called pg_start_backup. We do this as a precaution.

sudo -u postgres psql -c "select pg_start_backup('init of slave', true)"

3. Rsync again

By running Rsync again, we’ll copy two things: the freshly created backup label file and all changes that were made in the master database during the first rsync. We are simply updating the changed files and copying missing ones if such files exist.

rsync --av ...... #same as before

Once it is there, we need to exit backup mode on the master by running the following command.

SELECT pg_stop_backup();

Start the slave

So this is pretty much it! Our final step will be starting the database on the slave. Simply execute (on the slave):

/etc/init.d/postgresql-9.3 start

There you go! Now data replication is configured and active between your master and slave servers.

Checking on Replication Status

Once we’ve configured everything, we should check on how well everything is working as designed. First, we’ll check that read-only access is enforced on our slave server. Simply log in to the database and try to select something from any table (for this purpose I have special test_replica table in my database):

postgres=# SELECT * FROM test_replica ;
     test
--------------
 IT WORKS!!!!
(1 row)

Now, let’s pretend to be a naughty user. Try to insert something on the slave server, which should not be allowed since the slave has read-only access:

postgres=# INSERT INTO test_replica VALUES ('ima naughty boi');
ERROR:  cannot execute INSERT in a read-only transaction

Great!

Now let’s talk about replication monitoring. There are two special commands for monitoring replication status.

On the master:

postgres=# SELECT pg_current_xlog_location();

 pg_current_xlog_location
--------------------------
 0/30195B8
(1 row)

And for the slave server there are similar commands:

postgres=# SELECT pg_last_xlog_receive_location();

pg_last_xlog_receive_location
-------------------------------
 0/30195B8
(1 row)

By comparing these two output values you can see if the servers are in sync. If the servers are not in sync, you’ll have to sync them manually again by copying data from master to slave. It may be that they diverge over time, if this is the case your slave server might need beefing up a bit in order to keep up with the master better.

The other way to compare current replication status is by checking current xlog sender/reciever locations from command shell. To do this, simply execute the following command on the master. Note that xlog folders are located in <DATA_DIR> (in my case it is: /var/lib/pgsql/9.3/data/pg_xlog/). Each file in this folder is a WAL segment.

On the master:

 ps aux | egrep 'wal\ssender'

##command output will look like:
postgres  3251  0.0  0.1 325924  2788 ?        Ss   12:57   0:00 postgres: wal sender process postgres <IP_SLAVE>(39912) streaming 0/3019778

On the slave:

[root@Slave ~]$ ps aux | grep 'wal\sreceiver'

##command output will look like: 
postgres 14399  0.0  0.1 332352  3252 ?        Ss   12:57   0:00 postgres: wal receiver process   streaming 0/3019778

Wrap-up

At this point you have a live replica of your production database that is a much safer place for non-critical analytics and reporting queries. The benefit of settting up a streaming ‘hot-standby’ replica is that data is fresh and up to date, just like on the master. The negative is that it does require the slave to be performant enough to keep up with the master’s WAL. If you have a particuarly busy database this may be expensive.

Still Writing SQL in a Terminal?

Beekeeper is a free and full featured SQL IDE with query autocomplete, error detection, dataset previews, and result visualization. We support SparkSQL, Hive, Postgres and more!

Create your account »

Matthew Rathbone bio photo

Matthew Rathbone

CEO of Beekeeper Data. British. Data nerd.

Big Data Blog Email Twitter Github Stackoverflow