by Charly Batista

What software developers should know about SQL

feature
Oct 10, 202310 mins
DatabasesRelational DatabasesSoftware Development

The lingua franca of databases, Structured Query Language is ubiquitous, powerful, and important for developers to understand. Start with these tips.

shutterstock 359257322 SQL structured query language
Credit: chrupka

Since Structured Query Language was invented in the early 1970s, it has been the default way to manage interaction with databases. SQL remains one of the top five programming languages according to Stack Overflow, with around 50% of developers using it as part of their work. Despite this ubiquity, SQL still has a reputation for being difficult or intimidating. Nothing could be further from the truth, as long as you understand how SQL works.

At the same time, because businesses today place more and more value on the data they create, knowing SQL will provide more opportunities for you to excel as a software developer and advance your career. So what should you know about SQL, and what problems should you look to avoid?

Don’t fear the SQL

SQL can be easy to use because it is so structured. SQL strictly defines how to put queries together, making them easier to read and understand. If you are looking at someone else’s code, you should be able to understand what they want to achieve by going through the query structure. This also makes it easier to tune queries over time and improve performance, particularly if you are looking at more complex operations and JOINs.

However, many developers are put off by SQL because of their initial experience. This comes down to how you use the first command that you learn: SELECT. The most common mistake developers make when starting to write SQL is choosing what to cover with SELECT. If you want to look at your data and get a result, why not choose everything with SELECT *?

Using SELECT too widely can have a big impact on performance, and it can make it hard to optimize your query over time. Do you need to include everything in your query, or can you be more specific? This has a real world impact, as it can lead to massive ResultSet responses that affect the memory footprint that your server needs to function efficiently. If your query covers too much data, you can then end up assigning more memory to it than needed, particularly if you are running your database in a cloud service. Cloud consumption costs money, so you can end up spending a lot more than you need down to a mistake in how you write SQL.

Know your data types

Another common problem for developers when using SQL is around the type of data that they expect to be in a column. There are two main types of data that you will expect—integers and variable characters, or varchar. Integer fields contain numbers, while varchar fields can contain numbers, letters, or other characters. If you approach your data expecting one type—typically integers—and then get another, you can get data type mismatches in your predicate results.

To avoid this problem, be careful in how you approach statement commands and prepared statement scripts that you might use regularly. This will help you avoid situations where you expect one result and get something else. Similarly, you should evaluate your approach when you JOIN any database tables together so that you do not use columns with different data types. Checking your data can help you avoid any data loss when that JOIN is carried out, such as data values in the field being truncated or converted to a different value implicitly.

Another issue that commonly gets overlooked is character sets, or charset. It is easy to overlook, but always check that your application and your database are using the same charset in their work. Having different charsets in place can lead to encoding mismatches, which can completely mess up your application view and prevent you from using a specific language or symbols. At worst, this can lead to data loss or odd errors that are hard to debug.

Understand when data order matters

One assumption that many developers make when they start out around databases is that the order of columns does not matter any more. After all, we have many database providers telling us that we don’t need to know schemas and that their tools can take care of all of this for us. However, while it might appear that there is no impact, there can be a sizable computational cost on our infrastructure. When using cloud services that charge for usage, that can rapidly add up.

It is important to know that not all databases are equal here, and that not all indexes are the same either. For example, the order of the columns is very important for composed indexes, as these columns are evaluated from the leftmost in the index creation order. This, therefore, does have an impact on potential performance over time.

However, the order you declare the columns in a WHERE clause doesn’t have the same impact. This is because the database has components like the query plan and query optimizer that try to reorganize the queries in the best way to be executed. They can reorganize and change the order of the columns in the WHERE clause, but they are still dependent on the order of the columns in the indexes.

So, it is not as simple as it sounds. Understanding where data order will affect operations and indexes can provide opportunities to improve your overall performance and optimize your design. To achieve this, the cardinality of your data and operators are very important. Understanding this will help you put a better design in place and get more long-term value out.

Watch out for language differences

One common issue for those just starting out with SQL is around NULL. For developers using Java, the Java Database Connector (JDBC) provides an API to connect their application to a database. However, while JDBC does map SQL NULL to Java null, they are not the same thing. The NULL command in SQL can also be called UNKNOWN, which means SQL NULL = NULL is false and not the same as null == null in Java.

The end result of this is that arithmetic operations with NULL may not result in what you expect. Knowing this discrepancy, you can then avoid potential problems with how you translate from one element of your application through to your database and query design.

There are some other common patterns to avoid around Java and databases. These all concern how and where operations get carried out and processed. For example, you can potentially load tables from separate queries into maps and then join them in Java memory for processing. However, this is a lot more complicated and computationally expensive to carry out in memory. Look at ordering, aggregating, or executing anything mathematic so it can be processed by your database instead. In the vast majority of cases, it is easier to write these queries and computations in SQL than it is to process them in Java memory.

Let the database do the work

Alongside making it easier to parse and check this work, the database will probably be faster to carry out the computation than your algorithm. Just because you can process results in memory doesn’t mean you should. It is not worth doing this for reasons of speed overall. Again, spending on in-memory cloud services is more expensive than using your database to provide the results.

This also applies to pagination. Pagination covers how you sort and display the results of your queries in multiple pages rather than in one, and it can be carried out either in the database or in Java memory. Just as with mathematical operations, pagination results should be carried out in the database rather than in memory. The reason for this is simple—each operation in memory has to bring all the data to the memory, carry out the transaction, and then return it to the database. This all takes place over the network, adding a round trip for each time it gets carried out and adding transaction latency as well. Using the database for these transactions is much more efficient than trying to carry out the work in memory.

Databases also have a lot of useful commands that can make these operations even more efficient. By taking advantage of commands like LIMIT, OFFSET, TOP, START AT, and FETCH, you can make your pagination requests more efficient around how they handle the data sets that you are working with. Similarly, we can avoid early row lookups to further improve the performance.

Use connection pooling

Linking an application to a database requires both work and time to take place before a connection is made and a transaction is carried out. Because of this, if your application is active regularly, it will be an overhead you want to avoid. The standard approach to this is to use a connection pool, where a set of connections is kept open over time rather than having to open and close them every time a transaction is needed. This is standardized as part of JDBC 3.0.

However, not every developer implements connection pooling or uses it in their applications. This can lead to an overhead on application performance that can be easily avoided. Connection pooling greatly increases the performance of an application compared to the same system running without it, and it also reduces overall resource usage. It also reduces connection creation time and provides more control over resource usage. Of course, it is important to check that your application and database components follow all the JDBC steps around closing connections and handing them back to the resource pool, and which element of your application will be responsible for this in practice.

Take advantage of batch processing

Today, we see lots of emphasis on real-time transactions. You may think that your whole application should work in real time in order to keep up with customer demands or business needs. However, this may not be the case. Batch processing is still the most common and most efficient way to handle multiple transactions compared to running multiple INSERT operations.

Making use of JDBC can really help here, as it understands batch processing. For example, you can create a batch INSERT with a single SQL statement and multiple bind value sets that will be more efficient compared to standalone operations. One element to bear in mind is to load data during off-peak times for your transactions so that you can avoid any hit on performance. If this is not possible, then you can look at smaller batch operations on a regular basis instead. This will make it easier to keep your database up-to-date, as well as keeping the transaction list small and avoid potential database locks or race conditions.

Whether you are new to SQL or you have been using it for years, it remains a critical language skill for the future. By putting the lessons above into practice, you should be able to improve your application performance and take advantage of what SQL has to offer.

Charly Batista is PostgreSQL technical lead at Percona.

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.