PostgreSQL was built to handle a wide range of use cases, but flexibility also has a flip side. Make sure you’re not making these all-too-common design, configuration, tuning, and other mistakes. Credit: Florent Darrault A lot can go wrong with a PostgreSQL installation. Worse, many problems may lurk undetected as the issue builds over a period of time, then suddenly strike with a major impact that brings it to the forefront of everyone’s attention. Whether it’s a glaring drop in performance, or a dramatic rise in resource consumption and billing costs, it’s important to identify such problems as early as possible—or, better yet, avoid them by configuring your implementation to suit the desired workload. Drawing on Percona’s experience helping countless PostgreSQL shops over the years, we’ve compiled a list of the most common mistakes. Even if you think you’ve configured your PostgreSQL installation the right way, you may still find this list useful in validating your setup. Mistake #1: Running the default configuration PostgreSQL works right out of the box, but it’s not very well configured for your needs. The default configuration is very basic and not tuned for any specific workload. This excessively conservative configuration allows PostgreSQL to run any environment, with the expectation that users will configure it for their needs. The pgtune tool offers a subset of configurations based on hardware resources and the type of workload. That’s a good starting point for configuring your PostgreSQL cluster based on what your workload needs. Additionally, you may have to configure the autovacuum, log, checkpoint, and WAL (write-ahead log) retention variables. It’s really important that your server is optimally configured for any immediate future needs to avoid any unnecessary restarts. So take a look at all GUCs with the “postmaster” context in the pg_settings catalog view. SELECT name, setting, boot_val FROM pg_settings WHERE context = 'postmaster'; This is especially critical when setting up a high availability (HA) cluster because any downtime for the primary server will degrade the cluster and cause the promotion of a standby server to the primary server role. Mistake #2: Unoptimized database design and architecture This point cannot be emphasized enough. I’ve personally seen organizations pay more than five times the cost they needed to, simply because of unoptimized database design and architecture. One of the best tips here is to look at what your workload needs right now, and in the near future, rather than what might be required in six months to a year’s time. Looking too far ahead means that your tables are designed for future needs that may never be realized. And that’s just one aspect of it. Alongside this, overreliance on object-relational mapping (ORM) is also a major cause of poor performance. ORMs are used to connect applications to databases using object-oriented programming languages, and they should simplify life for your developers over time. However, it’s critical that you understand what an ORM provides and what kind of performance impact it introduces. Under the hood, an ORM may be executing multiple queries, whether that’s to combine multiple relations, to perform aggregations, or even to split up query data. Overall, you’ll experience higher latency and lower throughput on your transactions when using an ORM. Beyond ORMs, improving your database architecture is about structuring data so that your reads and write operations are optimal for indexes as well as for relations. One approach that can help is to denormalize the database, as this reduces SQL query complexity and the associated joins so that you may fetch data from fewer relations. In the end, the performance is driven by a simple three-step process of “definition, measurement, and optimization” in your environment for your application and workload. Mistake #3: Not tuning the database for the workload Tuning for a workload requires insights into the amount of data you intend to store, the nature of the application, and the type of queries to be executed. You can always tune and benchmark your setup until you are happy with the resource consumption under a severe load. For example, can your entire database fit into your machine’s available RAM? If yes, then you obviously would want to increase the shared_buffers value for it. Similarly, understanding the workload is key to how you configure the checkpoint and the autovacuum processes. For example, you’ll configure these very differently for an append-only workload compared to a mixed online transaction processing workload that meets the Transaction Processing Performance Council Type C benchmark. There are a lot of useful tools out there that provide query performance insights. You might check out my blog post on query performance insights, which discusses some of the open source options available, or see my presentation on YouTube. At Percona, we have two tools that will help you immensely in understanding query performance patterns: PMM – Percona Monitoring and Management is a free, fully open source project that provides a graphical interface with detailed system statistics and query analytics. Feel free to try out the PMM demo that caters to MySQL, MongoDB, and PostgreSQL. pg_stat_monitor – This is an enhanced version of pg_stat_statements that provides more detailed insights into query performance patterns, actual query plan, and query text with parameter values. It’s available on Linux from our download page or as RPM packages from the PostgreSQL community yum repositories. Mistake #4: Improper connection management The connections configuration looks innocuous at first glance. However, I’ve seen instances where a very large value for max_connections has caused out of memory errors. So configuring max_connection requires some attention. The number of cores, the amount of memory available, and the type of storage must be factored in when configuring max_connections. You don’t want to overload your server resources with connections that may never be used. Then there are kernel resources that are also being allocated per connection. The PostgreSQL kernel documentation has more details. When clients are executing queries that take very little time, a connection pooler significantly improves performance, as the overhead of spawning a connection becomes significant in this type of workload. Mistake #5: Vacuum isn’t working properly Hopefully, you have not disabled autovacuum. We’ve seen in many production environments that users have disabled autovacuum altogether, usually due to some underlying issue. If the autovacuum isn’t really working in your environment, there can be only three reasons for it: The vacuum process is not being triggered, or at least not as frequently as it should be. Vacuuming is too slow. The vacuum isn’t cleaning up dead rows. Both 1 and 2 are directly related to configuration options. You can see the vacuum-related options by querying the pg_settings view. <code> SELECT name , short_desc , setting , unit , CASE WHEN context = 'postmaster' THEN 'restart' WHEN context = 'sighup' THEN 'reload' ELSE context END "server requires" FROM pg_settings WHERE name LIKE '%vacuum%'; </code> The speed can potentially be improved by tuning autovacuum_work_mem and the number of parallel workers. The triggering of the vacuum process may be tuned via configuring scale factors or thresholds. When the vacuum process isn’t cleaning up dead tuples, it’s an indication that something is holding back key resources. The culprits could be one or more of these: Long-running queries or transactions. Standby servers in a replication environment with the hot_standby_feedback option turned on. A larger than required value of vacuum_defer_cleanup_age. Replication slots that hold down the xmin value and prevent the vacuum from cleaning dead tuples. If you want to manage the vacuum of a relation manually, then follow Pareto’s law (aka the 80/20 rule). Tune the cluster to an optimal configuration and then tune specifically for those few tables. Remember that autovacuum or toast.autovacuum may be disabled for a specific relation by specifying the associated storage option during the create or alter statement. Mistake #6: Rogue connections and long-running transactions A number of things can hold your PostgreSQL cluster hostage, and rogue connections are one of them. Other than holding onto connection slots that could be used by other applications, rogue connections and long-running transactions hold onto key resources that can wreak havoc throughout the system. To a lesser extent, in a replication environment with hot_standby_feedback turned on, long-running transactions on the standby may prevent the vacuum on the primary server from doing its job. Think of a buggy application that opens a transaction and stops responding thereafter. It might be holding onto locks or simply preventing the vacuum from cleaning up dead tuples as those remain visible in such transactions. What if that application were to open a huge number of such transactions? More often than not, you can get rid of such transactions by configuring idle_in_transaction_session_timeout to a value tuned for your queries. Of course, always keep the behavior of your application in mind whenever you start tuning the parameter. Beyond tuning idle_in_transaction_session_timeout, monitor pg_stat_activity for any long-running queries or any sessions that are waiting for client-related events for longer than the expected amount of time. Keep an eye on the timestamps, the wait events, and the state columns. <code> backend_start | 2022-10-25 09:25:07.934633+00 xact_start | 2022-10-25 09:25:11.238065+00 query_start | 2022-10-25 09:25:11.238065+00 state_change | 2022-10-25 09:25:11.238381+00 wait_event_type | Client wait_event | ClientRead state | idle in transaction </code> Other than these, prepared transactions (especially orphaned prepared transactions) also can hold onto key system resources (locks or xmin value). I would recommend setting up a nomenclature for prepared transactions to define their age. Say, a prepared transaction with a max age of 5 minutes may be created as PREPARE TRANSACTION 'foo_prepared 5m'. <code> SELECT gid , prepared , REGEXP_REPLACE(gid, '.* ', '') AS age FROM pg_prepared_xacts WHERE prepared + CAST(regexp_replace(gid, '.* ', '') AS INTERVAL) < NOW(); </code> This provides a scheme for applications to define the age of their prepared transactions. A cronjob or a scheduled job could then monitor and roll back any prepared transactions that remain active beyond their intended age. Mistake #7: Over-indexing or under-indexing Surely there’s nothing wrong with over-indexing a relation. Or is there? To get the best performance out of your PostgreSQL instance, it is imperative that you understand how PostgreSQL manages indexes. There are multiple types of indexes in PostgreSQL. Each has a different use case, and each has its own overheads. B-tree is the most commonly used index type. It is used for primary keys as well. The past few major releases have seen a lot of performance-related (and debloating) improvements in B-tree indexes. Here is one of my blog posts that discusses duplicate version churns in PostgreSQL 14. When an index scan is executed on a relation, for each matching tuple, it accesses the heap to fetch both data and visibility information, so that only the version visible to the current transaction is chosen. Over-indexing will cause updates to more indexes, therefore consuming more resources without reaping the desired benefits. Similarly, under-indexing will cause more heap scans, which will potentially lead to more I/O operations and therefore a drop in performance. Indexing is not just about the number of indexes you have on a relation. It is how optimized those indexes are for the desired use cases. Ideally, you would want to hit an index-only scan each time, but there are limitations. Although B-tree indexes support index-only scans for all operators, GiST and SP-GiST indexes support them only for some operators. See the documentation for more details. Following a simple checklist can help you validate that your system is optimally set up for indexes: Ensure configuration is properly set (e.g., random page cost is tuned for your hardware). Check that statistics are up to date, or at least that the analyze or vacuum commands run on the relations with indexes. This will ensure that statistics are more or less up to date so that the planner has a better probability of choosing an index scan. Create the right type of index (B-tree, hash, or another type). Use indexes on the right columns. Don’t forget to include non-indexed columns to avoid heap access. Not all index types allow covering indexes, so do check the documentation. Get rid of unnecessary indexes. See pg_statio_user_indexes for more insights into indexes and block hits. Understand the impact of covering indexes on features like deduplication, duplicate version churns, and index-only scans. See this wiki page on index maintenance for more useful queries. Mistake #8: Inadequate backups and HA HA is not just about keeping a service up and running. It’s also about ensuring that the service responds within the defined acceptance criteria and that it satisfies the RPO (recovery point objective) and RTO (recovery time objective) targets. To match the uptime requirements and the number of nines you are targeting, refer to this wiki page for percentage calculations. Many factors must be considered for meeting RPO and RTO including planned downtimes, any automated or manual activities, their frequencies and duration, and of course the costs associated with unplanned downtime. Having both accurate and timely backups and the ability to recover them effectively play a key role in defining both of these parameters. How frequently is the data being backed up? How are the WAL files being managed? How are the backups and WAL files being validated? Depending on the workload and maintenance windows available, backups should be taken at least once every seven days. Alongside this, you should test your recovery process on a regular basis, so that you know those backups are useful. The truth is that your plan around this will be successful only if you can get your application back up and processing. An untested backup should not be trusted. Mistake #9: Mismanaging extensions PostgreSQL comes with more than 50 extensions. Then there are all the third-party extensions provided by individuals and organizations. There are commonly used extensions like pg_stat_statements provided by the core, and there are other famous extensions like PostGIS that are not part of the core. First and foremost, you should ensure that any set of extensions you have deployed can work together without breaking one another. Alongside this, there are performance considerations. Some extensions are simply SQL extensions, but there are others that carry a shared object or a DLL, which will consume more resources and affect overall performance. Make sure you understand what resources will be consumed by these extensions. More importantly, any extensions that are preloaded become part of the server. Whether you have created a SQL interface by issuing the CREATE EXTENSION… statement or not, these extensions will be working behind the scenes. For example, adding pg_stat_statements to shared preload libraries will cause performance degradation whether you have created the SQL interface or not. The overall lesson here is to think carefully whether you really need those extensions. Here are some queries around extensions that you may find useful. You can query the catalog pg_extension for information about the installed extensions. SELECT * FROM pg_extension; Similarly, you can find out which extensions are available on the system. SELECT * FROM pg_available_extensions(); And you can find out the available versions of the extensions you have. SELECT * FROM pg_available_extension_versions(); Mistake #10: Ignoring supporting tools Setting up your PostgreSQL cluster should not be looked at in isolation. You need to consider what other supporting tools could lead to an improved experience with PostgreSQL. It is therefore imperative that you analyze the available tools. There are misconceptions around some tools owing to older releases that had serious issues. So look at the new versions, the proactiveness of the respective community, and the frequency of releases. As an example, let’s review a few tools that are used for connection pooling and load balancing in the PostgreSQL ecosystem: PgBouncer, HAProxy, and Pgpoo-II. HAProxy is a load balancer. Note that the versions packaged with various distributions are rather old. CentOS 7 has version 1.5. CentOS 8 has version 1.8. The latest is 2.6. And just for reference, HAProxy 2.4 had 1687 new commits. While it is easier to use the packages provided by your distribution, those might be too old. PgBouncer is a lightweight connection pooler. Although it’s single-threaded, kernels that support the SO_REUSEPORT option might allow load balancing if you run multiple PgBouncer instances that listen to the same port. Check your kernel’s documentation to see whether it supports load balancing or round robin or none at all. Using a systemd template, you can run multiple PgBouncer instances in a very simple and elegant manner. Simply create a file /etc/systemd/system/pgbouncer@.service and run however many instances of PgBouncer you like using commands like systemctl start pgbouncer@1, systemctl start pgbouncer@2, etc. Pgpool-II has picked up a lot over the last few years. Lots and lots of features were added including monitoring and quorum presence. So it offers a lot more than just connection pooling. What will you use? PgBouncer? HAProxy? Pgpool-II? Or perhaps PgBouncer and HAProxy? Do you have streaming replication configured? With HAProxy you may need to set up separate ports for readers and writers. In the end, the selection will be determined by your use cases (and in some cases, the misuse cases!). PostgreSQL is an incredibly popular open source database for many reasons. It is designed to be easy to use and extensible to meet the needs of a wide range of users. However, this flexibility does mean that you have to look at your approach and consider tuning your installation accordingly. Not only will your application perform better, which means happier users, but you might also save a substantial amount on costs over time. Related content news SingleStore acquires BryteFlow to boost data ingestion capabilities SingleStore will integrate BryteFlow’s capabilties inside its database offering via a no-code interface named SingleConnect. By Anirban Ghoshal Oct 03, 2024 4 mins ETL Databases Data Integration feature 3 great new features in Postgres 17 Highly optimized incremental backups, expanded SQL/JSON support, and a configurable SLRU cache are three of the most impactful new features in the latest PostgreSQL release. By Tom Kincaid Sep 26, 2024 6 mins PostgreSQL Relational Databases Databases feature Why vector databases aren’t just databases Vector databases don’t just store your data. They find the most meaningful connections within it, driving insights and decisions at scale. By David Myriel Sep 23, 2024 5 mins Generative AI Databases Artificial Intelligence feature Overcoming AI hallucinations with RAG and knowledge graphs Combining knowledge graphs with retrieval-augmented generation can improve the accuracy of your generative AI application, and generally can be done using your existing database. By Dom Couldwell Sep 17, 2024 6 mins Graph Databases Generative AI Databases Resources Videos