Database design is a complex topic, and this guide assumes you already have some familiarity with databases, or have an existing database to work with.
Below we discuss database best practices. Some of them may seem blatantly obvious to you. If so, that’s a great! For others, you may be hearing about these for the first time. Either way, you’ll get the most out of your database by following these best practices. We’ll discuss what to do, and why you should do it.
We won’t discuss what table structure you should use or how to normalize your database. Those are topics best left to another guide. These best practices are practical, easy to use tips to get the most out of your database.
But, first where did these best practices come from?
Before we get to the practical suggestions, let’s take a look at the guiding principles behind these time-tested best practices. The databases you manage need to have data integrity, be easy to use and learn for new users, and be extensible for the future. These guiding principles define a high quality database. How?
Data integrity is critical to a database. What good is the data you have if you can’t rely on it? Duplicate data, orphaned data, and conflicting data are all problems that need to be resolved. Better yet, your database design should prevent them from ever becoming an issue in the first place.
Ease of Use/Learning Curve
There will come a time when you’ll try to explain your database to someone else. There may even come a time (six months from now), when you’ll need to try to explain your database to yourself. You will struggle. Best practices provide mental shortcuts; chances are, the new person will have seen other databases that follow these best practices as well. By following best practices you’ll also make it clearer for others to understand your design and thinking. A lot of these best practices focus on making what you think explicit and hard-coded, rather than implicit and in your head.
If your project is successful (and even if it isn’t), your data storage needs will grow. A new feature will be added and you’ll need to extend your database. It might be a new table, it might be new columns, but you’ll need to make some sort of change. A goal of these best practices is to make extending your database as easy as possible. You will be able to make changes without fear of unnecessarily breaking backwards compatibility.
Standardization (The Golden Rule)
If there is one thing you take away from this guide, it should be that standardization is the most important part of database design. In fact, we call it the golden rule of database design. Once you settle on a naming or design scheme, you must stick with it. It doesn’t matter whether you choose to use camelCase or underscore_case. Once you’ve picked one, stick with it. If you decide to change it later on, you must change the scheme everywhere. Stick to the standard, write it down and get the rest of your team to stick to it too.
Why is this so important?
Because standardization makes your life easier. When you’re querying your database, you don’t need to remember whether the attribute you’re searching for is personName or person_name, you’ll know immediately. Standardization will save you literally hours of debugging stupid issues in your queries, or consulting database diagrams. There’s no valid reason not to follow a standard scheme, and it is always worth the upfront cost to put one in place.
Naming is one of the hardest problems in computer science. Luckily, it’s a lot easier to name a column or table than it is to name a random variable in a script. This is because relational databases are designed around storing entities. Each row of data in your table should describe a distinct entity, and the table itself is a collection of entities. So, when naming a table, consider what the name of your entity is. Since your table is a collection of entities, think about what the plural of your entity name is. If you’re creating a database for a town, you might be storing information about the buildings in your town, which you would store in the “Buildings” table. For some entities, this isn’t as easy. If you want to store information about a specific person, you might store it in a “People” table or a “Persons” table. Both options work, since it’s obvious in both cases what is stored in the table, even if you wouldn’t use “Persons” in regular speech.
You should try to keep your table names as simple as possible. There’s no need for redundant information. You should never include words that don’t add any unique information about the table. “PersonsTable” is a bad name because “Table” adds no additional information. The “Persons” table name can only be successfully used in queries where table names are used.
When naming columns it can be tempting to use short, simple names. For example, in the “Persons” table mentioned above, you may be tempted to include a “Name” column. After all, the name column is part of the “Persons” table, so it obviously represents the name of the person. But what happens when that column is used in a complex query with aliases, does “p.Name” obviously mean the column represents a person’s name? Or maybe it’s the name of something else. You have no way of knowing without tracing the query back and figuring out where the column originates.
You should design your column names such that it is immediately clear at a glance what the column represents. Yes, this will mean more typing, but that’s what autocomplete is for! Each column name should include both the entity name and the attribute name. For example, the “Name” column mentioned above should be called “PersonName”. This removes ambiguity over the exact information that is being displayed. When query results are displayed, you’ll never mistake the “PersonName” column for the “EmployerName” column, for example.
One last point on column names. It’s very beneficial to include a descriptive word for the type of data stored in a column. Some common column names obviously describe the type. For example, it would be odd to see “PersonName” storing lots of rows with only numerical values. For other columns it may not be as obvious. For example, if you have a database storing data for a network for retail store locations, you may want to track the amount of sales made by each store, but a column named “StoreSales” is ambiguous. Is that column tracking the number of sales? Or the dollar value for the sales? Either “StoreSalesCount” or “StoreSalesAmount” more clearly identify what is stored in the column.
You may get slightly longer names by following these best practices, but they’re best practices for a reason. They really do make a difference in making your database easy to use with a shallow learning curve for new users.
We can admit that “no acronyms” may be a bit extreme, but acronyms seriously suck. You should stay away from any acronyms or abbreviations that aren’t used by the general public. Everyone probably knows what “RSVPDate” stands for, so using that as a column name is probably fine.
Each time you’re thinking about using an acronym or abbreviation in your column, ask yourself if someone from outside your industry would know what it means. Even if you think they would, do a quick search for “YOURACRONYM acronym”. In most cases you’ll get over a hundred examples of acronyms that match your letters. Context matters with acronyms and abbreviations, and your context won’t always be the same. When you come back to your database in six months, will you still think the acronym means the same thing? What about the new team member. Do you think they have the same context as you? They won’t, so avoid acronyms whenever possible.
At this point, you might feel like all these best practices for naming will having you typing forever. Long names can certainly go too far, but if you’re keeping things simple and following these standards, your names will be incredibly information dense. That means reading your database diagram will be super explanatory, not a slog. And when it comes time to query your database, you’re already using a SQL IDE with autocorrect anyway. Right?
ID or TableNameID?
In each table, you’ll normally have at least one column that specifies the primary key. This column, by convention, is typically an ID column with a unique number or value to reference the row by. Now, should that column be named simply “ID” or “TableNameID”? This is a topic that can and has been debated endlessly. If you’re following the column naming best practices from above, the answer is obvious. Since we’re following our golden rule of standardization, you will of course use the “TableNameID” convention.
By using “TableNameID”, you remove all ambiguity when using the ID column. It prevents accidentally joining on the wrong columns in a query, since you’ll always join on columns of the same name. It’s very easy to accidentally join “m.ID” to “a.ModelID” when what you really wanted to do was join “m.ID” to “a.ManufacturerID”. It’s much harder to accidentally join “m.ManufacturerID” to “a.ModelID”.
Use Foreign Keys
There are many reasons to include references from one table to another. This is used most commonly when an entity can be associated with multiple other entities. This relationship always needs to be enforced by the database using a foreign key. This is essential to maintaining data integrity. If you don’t use them, it is only a matter of time until you experience a data integrity issue. A row in one table will reference a row in another that doesn’t exist, or you’ll accidentally add a reference to a row that never existed. Something will go wrong, so don’t put yourself in that position. Add foreign keys any time a column is referencing a value in another column.
What’s our golden rule? Standardization. You should have a standard method of naming your columns that reference values from another table. The best method of doing this is to use identical column names in both tables. This is an implicit signal that these columns can be joined. That signal can be confirmed later by checking if the foreign key exists, but you should design your database in such a way that joining columns with the same name is always a safe assumption. If you’re following the best practices on column naming we mentioned above, it won’t only be obvious which column is being referenced, but also the table it is meant to join. Most often you’ll reference the ID column of a table. For example, an “Employee” table might include a “DepartmentID” column to reference the department the employee belongs to, as described in the “Department” table.
Sometimes you’ll have a column that should be limited to specific values. For example, your “Employee” might have an “EmployeeStatus” column that describes whether they are new hire, regular employee, have resigned, etc. You should decide in advance what the limited set of values for this column will be, and use an enumerated type if it’s available to you.
If your database management system doesn’t include an enumerated type (we’re looking at you, SQL Server), then you should create an enumerated type table. In our example, that’d be an “EmployeeStatusEnum” table, with a single column with the values you’ll allow. You can then use a foreign key to ensure that your employee statuses will only be one of the selected types. This prevents spelling mistakes from messing with your reports, and makes sure that any time a new value needs to be added to the enumerated type, it is done in a consistent way.
An index provides your database with a quick way to find and access any row in a table. Without an index, your database looks at a table as a generic block of data. Every time you query that table, your database will individually look at each record to see if it matches your query. With an index, your database has already figured out which rows in your table match your query, and can simply grab the rows you need without searching the whole table. This can provide a huge performance gain.
Commonly Joined Columns
Assuming you’re using primary and foreign keys throughout your database, your database will already have indexes. Database management systems automatically keep indexes in order to enforce those keys. If, for some reason, you commonly join columns that do not belong to a foreign or primary key, you should add indexes to those columns as well.
There is a trade off to using indexes. Although they can make it significantly faster to read from your table, they also make it slower to write to your table. Each time you add or update a row in your table, the database needs to also update the index to keep them in sync. In most cases, this trade off is worth it to quickly read data. It’s only when you get very large tables that are being updated extremely frequently that you may want to remove, or limit the indexes you use. In the vast majority of cases, this won’t be a concern for you.
Null is Your Friend
What value represents nothing? For some people, it’s 0 character strings, “”. For others, it’s a space ” “. There is only one correct answer though, and that is a NULL value. You should (as a standard!) always use null to represent unknown values in your table. Using empty strings, or spaces, or “Unknown”, or anything else will only result in painful issues later on. Anybody using your database for the first time will know that null means an empty value. Nulls are also treated specially by your database, saving you storage space and making sure that you don’t accidentally join two rows that shouldn’t be joined. A null value will never join a null value, but “Unknown” will match “Unknown” in another table.
Schemas are logical separations of tables. For most database management systems, they don’t have any impact on query performance. If possible, you should use them to group common tables together. This will make your database easier to use and understand by providing additional context to your tables and quickly identifying which types of tables relate to one another.
Schemas also prepare your database for future growth. You may need to split your database in the future to different physical systems in order to handle storage size requirements or performance loads. Schemas provide a logical starting point for where to split your application.
Note that MySQL treats schemas differently from other database management systems. In MySQL, schemas and databases are the same concept and treated one-to-one, so you won’t be able to use them as logical separations unless there is no interaction between schemas.
To review, in order to follow best practices you should:
- Use clear, plural entity names for your tables.
- Use clear column names that describe the entity, the attribute and the type of data stored in the column.
- Avoid acronyms and abbreviations! Stick to clear, descriptive language.
- Use foreign keys to enforce data integrity.
- Use indexes to speed up querying your database.
- Always use NULL to represent unknown values.
- Use schemas to make your database easier to understand and scale.
These best practices are designed to get you the most out of your database and prevent you from making stupid mistakes. Above all else, focus on standardization throughout your database. Once you have picked a way of doing things, stick with it throughout your database!
Are you looking for an easier way to work with your database? Try out modelDBA, which let’s you view and edit your database with a simple interactive database diagram and interface. Try out the latest version for free here.