Starting a new project with PostgreSQL
Some tips and hints not to forget *from the start*
- tags
- #Tips
- published
- reading time
- 15 minutes
Foreword : a checklist attempt for new PG Projects
In a recent customer meeting, someone asked me: “How to start our new project with PostgreSQL ? Since we’re new at it, what shouldn’t we forget right from the start. The kind of things, you know, that may cost us a lot to fix once the project is up and running ?”.
This is a great question I have from time to time for a decade now. I felt that it could be a good topic for an article on my blog. I may update it from time to time, when I’ll (probably) realize that dammit I forgot that… ;-).
And be sure I’ll forget things. :-)
I’m not pretending everything here will be a good advice for everyone, nor I tell everything in details, it’s more a list of things I suggest, based on my experience. You may have even a better list somewhere, I don’t know. It’s just “one more” in the wild…
By the way, I’d be very curious to have your feedback and best practices here, don’t hesitate in engaging the conversation with me!
“Think cluster” from the start
Starting a new project with PostgreSQL is most often done on an unique PostgreSQL instance, aka “Stand Alone”.
So developers will write all the necessary SQL to insert
, update
,
delete
, select
the data in there, like the database as one unique source
of truth, i.e., and they will use an unique connector.
Later, you’ll realize that PostgreSQL has replication support, in many flavours, and you’ll want probably to use that to build a stronger infrastructure, with replicas, and so on.
Going a bit deeper, you’ll want to move some workload to replicas, because you have a lot of “read only” things, and that’s cool, because replicas in the PostgreSQL world can be open to read-only SQL.
So basically, the “hint” here, is thinking about that right from the start.
That can be as easy as declaring 2 connexion URLS, instead of one unique, one could be named “RW” for read/write, and the other “RO”, for read/only:
-
anything that is a write, or needs the most recent data to be read will be sent to the “RW” connexion, where
-
anything that is a read and can accept some delay, or let’s say “possible incomplete” data, will be sent to the “RO” connexion.
And it doesn’t matter if both of those RW and RO points to the very same PostgreSQL “Stand Alone” instance when you start the project!
The good habit here is to teach developpers that it can exist 2 sources of truth, and they should use the RO one as much as possible. And only them know, because they have the logic of the application DBA most often don’t have.
So when you’ll go from Stand Alone to cluster with PostgreSQL (one primary, many replicas, synchronous or not…), you’ll just have to change the connexion string, and the app will automatically starting to span the read only workload on the replica(s).
It’s really better to rely on developpers than on any sort of magical tool that will likely read, and try to understand what’s happening, so if the transaction and/or statement can be actually balanced to read-only replicas.
But for that, think about it right from the start of the project.
“Think cluster” and “High Availability” from start
Not only devs have to “think cluster from the start”, but also “high availability”.
Once you change from PostgrESQL Stand Alone to PostgreSQL Cluster, with a primary and many replicas, you may want to go HA with some added tools to PostgreSQL (the most known and used are Patroni+ETCD in this regard).
Those tools, among other things, will allow some automatic failover of the
Primary PG when it comes down. Most of the existing stacks will also use
things like HAProxy or such to manage ports used
(e.g. 5432
for the RW and 5433
for the RO). Those tools will automatically
do the failover, and modify the connexions so the RW
and RO
ports come
back available.
Those operations generally takes seconds to perform on modern infrastructures.
Give a chance to your application to benefit from that!
That means if you want the best user experience here, simply add to your application the FAIL AND RETRY method.
That means trap errors on DB when they happen, if it’s a sudden disconnection, try to reconnect, if it doesn’t work, try again a second or more after. Loop that process let’s say for 10 or 20 seconds. If it still doesn’t work, that is something else happening…
But for most failover scenarios, that will be enough. Your apps will see the downtime, so will the user, but it will last few seconds.
Of course, depending on your processes, and if that’s a synchronous replication or not, in the error handling you may do other checkups after a failover. But your app won’t have that “Database connexion failed” message directly thrown at your users while the failover is happening !
Dive into tuning too early
Strangely, one of the questions I have very often is “How do I tune PostgreSQL
given our server’s technical facts and the planned usage?”.. This mostly
means what values should one put in the postgresql.conf
(to simplify because
those configuration values can be customized by database and/or role and/or
… lots of things!).
Most newcomers to PostgreSQL are a bit affraid of all parameters
postgresql.conf
has, they ask me and alike about the magic tricks and so on.
And that’s one of the worst legacy in the usage of proprietary software.
It’s specially true for databases, where only some gurus knows about this
parameter existence, with that name and possible values. Whereas in
PostgreSQL, they are all listed in there. And completely commented.
But it’s not because there’s a caption on each of those buttons one knows how to use it and when. They just see that has a plane’s cockpit.
I may surprise you, but among this ton of parameters, actually, a very (very) few have to be changed from start.
If you’re seaching for good enough starting parameters I strongly recommend you take a look at PGTune. I use it for a long time now, and it always gave me a starting point.
Really, don’t spend much time here! It doesn’t worth it much. You can come back to it later, once you will do the following.
Rather try to:
-
buy a bigger box, with more CPUs, because PostgreSQL isn’t multithreaded, but it has very nice parallelizing when you most need it
-
buy more RAM, the fastest you can, with the bigger cache you can
-
buy SSDs or the fastest disks you can
And probably, the most important: double check your SCHEMA and the QUERIES you’re writing to use it.
It’s often read here and there that the “SQL part” of optimisation is 50% of the performances. I’d rather say it’s probably more than that now. I wouldn’t be surprise it is actually 2/3 the performances.
There are a ton of articles here to do that. I may write some, but those have been out there for a while…
Some hints though.
Some are focusing on data types you may use. Speaking of those, please, never ever use Large Objects, they are a legacy from the past, you should be using bytea instead. IF you really need to store binaries in a DB, what I don’t recommend much, except for some usecases. At least, try to keep those as little as possible.
Some articles are focusing on INDEXes. Those are my favorites. NEVER create INDEXes before hand, except those you need for primary or foreign keys, or anything you’re will use for sure.
You’d be surprised on how often we see that. Probably on each customer case I work on. There. Are. TONS… of unused indexes.
An index is costy to create and to manage, takes a lot of space on disk, etc. If you want PostgreSQL to do that for you, please at least be sure your queries will be using it!
Check your unused indexes. For that I recommend you create that query by
yourself rather than copy/pasting things from the web without understanding on
what they do. The best way is to read about The Cumulative Statistics
System, then
pg_stat_all_indexes
and pg_statio_all_indexes
… then the same thing, but
not the all
ones, rather the user
ones, because you want to focus on your
application’ schema, not the rest.
Finaly, before complaining like “Hey! PostgreSQL is not using my index ?!?”… There are actually lot of cases where it won’t use it. Or maybe not like you think… Small table? Wrong configuration parameters, … forgot to run ANALYZE, i.e. wrong statitics on the table… By default I always assume PostgreSQL optimizer is smarter than I am. And I must admit sadly I couldn’t demonstrate the contrary so far :-/
In short: “Don’t over tune from start”.
Rather go step by step when you need it. Don’t be affraid, PostgreSQL has all you need to gain performances when those you have likely by default won’t be sufficient enough.
Maybe the good thing to do also is to plan volumes of tables, usages, etc… To have a good idea of what the database will likely grow. You could do one thing here. I can say that on the projects I saw it, unfortunately not a lot, it helped a lot : try to loadtest your app.
There are a lot of tools there that can help you achieving this. pgSimload is one of those. And it has embedded the FAIL & RETRY ;-) (ok this was a very unfair advertising for my project, but hey, that’s my blog, isn’t?).
Frankly, don’t use JMeter or such :-| Your app deserves better. pgSimload at least can do multiple clients, multiple configs (like RW and RO JSON configs), think times (sleep and rsleep to be combined), etc…
Monitoring and Alerting
Set up directly from the start a decent monitoring and alterting stack.
Personally, I like pgMonitor because it is simple, reliable, and based on standards like sql_exporter, Prometheus, Grafana, etc..
You could also give a try to PostgreSQL Workload Analyzer. It has great features, among them, closeup looking of what’s going on on the servers, hypothetical indexes and indexes creation suggestions. So you don’t create unused indexes anymore.
Whatever the tool you choose, do setup and use it. But also, test it. Try to create problems on your system to check that the alerting is working like you think it should.
From the start, probably one of the most important thing to monitor with PostgreSQL is free disk space. That is happening a lot in customer cases!
Underprovisionning is another common mistake but’s that’s a bit aside the monitoring and alerting topic.
Practise restore. Again. And Again.
Because backup have to be functional from the very start. You have to practise restore too.
You can NEVER BE SURE backups are fine UNTIL they are proven to be restorable when you need it to be.
Practice that once at least per month. Every single person that possibly can restore or is responsible of databases should do that. Procedures have to be written, scripts too and tested.
Really, sounds like a joke, but it really isn’t. Countless users, not saying customers, lost data because they tought backups were fine. They weren’t on the end, and all you can do is crying when all that remains from a crash is an useless set of “backups”.
I strongly recommend you use a tool that is designed for backup and restore. Among the existing ones in the PostgreSQL world, the most used and known is, with not any doubt pgBackRest. I use to use and recommend it before I was actually working at Crunchy Data.
I’m hearing people in the back of tge room shouting:
“Bullshit!! The most known and used in the word is most probably pg_dump!!!”
Well. If you think that tool is a backup one, you’re 100% wrong. It’s a tool to actually create a logic backup of a PostgreSQL database… I call that an export of a database.
If you plan to use “nothing else than PostgreSQL”, rather use pg_basebackup instead..
But I personally would still recommend pgBackRest in all cases, starting from the simpliest!
Whatever the tool, use the one that best fits you, and please, again, practise restore at least once per month.
Go “FULL SSL” from start
Also configure everything SSL in your project from the start.
Not only between the PostgreSQL Server and Clients, but also between probes and the monitoring stack, between Patroni and etcd, between PostgreSQL and pgBackRest server. Etc. everywhere
Because that will have you learn around SSL and certificates, think about certificates rotation, root CA, and other topics that most often come later in the project. And that’s a real problem. You don’t imagine how much time is lost in emails and meetings around which port it’s using now, which one to open and why, etc…
Built it the most secure right from the start. And learn how to maintain it.
Create many roles (user) directly
I mean it. A modern RDBMS like PostgreSQL is really handles multiple roles.
You don’t have to have only one “user” or “dbuser” that does everything. And
please, don’t use the PostgreSQL’s superuser (postgres
most of the time)
in your applications!
Like it’s done in marketing with different “personas”, try to create many different roles. At least you should have one or many to be used in most of the OLTP scenarios, and one or many to be used in some statistics, and other big computations.
It will also allow you to use more advanced security features like Row Level Security when you will need it.
And finaly, it will also allow you to fine-grain tune some important configuration parameters “per role” (aka “per user”), with things like ALTER ROLE… SET configuration_parameter.
Prepare for frequent upgrades
CVEs, bugs, … minor and major versions… You’ll be amazed on how often you’ll have to upgrade. It’s even more surprising from users that formerly used proprietary software, where there are probably as much bugs and CVEs, but… less frequently updated?
As the official documentation states:
“To receive notifications about security releases or other security related news, you can subscribe to the pgsql-announce mailing list. If you set your subscription to only include the tag Security, it will exclude all other announcements that are sent to this list.”
Just do that. I strongly recommend it.
Also, don’t think it will be PostgreSQL only. It will be for every component, even the smallest, in your stack. Because a security problem can happen on any of those.
And if you’re using a more complex set of tools around PostgreSQL, then you most probably will need a PostgreSQL Vendor to monitor that for you, and prepare packages or images with updated software. It’s called a subscription … and you know where to have one I guess.
So, brace for impact: prepare for frequent upgrades!
Speaking of PostgreSQL, migrating from a minor version to the next one (eg. from 17.1 to 17.2), is easy as replacing the binaries. It will still need services to be restarted. And then, planned.
That’s the very key here, a best practise is to plan since the beggining of the story a maintenance window, an probably more than one… Barely: a monthly one, a quarterly one and a yearly one have to be planned in advance.
This will allow you to do the necessary updates at the best time for you, without asking prior for it. Because it’s your responsability I strongly encourage you to set those rules before hand, rather than asking for it after, where everybody tought the project will be seven nines …
Also, since less and less users can actually really afford many hours, if not, minutes of downtime, it will likely push you to automate things, and do tests in non-prod systems, etc. This is an happy consequence of setting maintenance windows IMHO.
And you’ll rapidly fall into CI/CD things. And, yes, I recommend that too. Try to think “Continuous Integration” from the start has better odds of success in this area, IMHO.
You have special needs? Look at EXTENSIONS first
For those that know already the PostgreSQL world, that is most probably obvious. It is not for the ones that embrace it just now.
PostgreSQL does already a lot of things. As an example, it has NoSQL functions, a complete Full Text Search, etc. But if you have particular needs and if you didn’t find the relevant keywords in the Official PostgreSQL documentation, before thinking you’ll have search somewhere else… or do it by yourself, please, search in to EXTENSIONS !
I mean not only those already included, I mean the whole world of hundreds of extensions for PostgreSQL!
Here are some lists where you can search for what you need:
-
1000+ PostgreSQL EXTENSIONs by Joel Jakobsson
-
Software Catalogue - PostgreSQL Extensions on the PostgreSQL.org website: 18 products listed so far :-/
And if you plan to use PostgreSQL in a public cloud, check on those pages if the extension(s) you need is(are) supported:
And if you still can’t find anything, try Google, “my need extension postgresql”
Special extension to consider: PostgreSQL Anonymizer
You should use PostgreSQL Anonymizer rigth from the start in any PostgreSQL project where you will (or may) store personally identifiable information, or any commercially sensitive data.
GDPR is a serious thing. Private data is a serious thing… And learning how to use PostgreSQL Anonymizer since the beggining of the project is an huge plus. Habits will come fast enough, and good practices also.
With all the data leaks we saw last years and will, unfortunately, still see in the future, a great practice is to learn from start how to anonymize the data when there’s no need having the real one. The documentation of this extension is clear enough on what it does and how. There’s a full tutorial on the website of the project I recommend you follow.
It’s a good complement of going “FULL SSL” from the start. Most of nowaday’s applications store data in databases, let’s have that the most secure we can by default.
That’s all folks!
That’s all for today, the 1st release of this article. I hope I’ll be able to add things in there in the future :-)