Programming

Introduction to databases

Understanding databases (plus spreadsheets tips)


This article introduces the basics of databases, as a building engineer understands them.

I think it serves as a backbone for understanding construction-related software, BIM (Building Information Modeling) in particular.

What’s a database?

Simply put, a database is an organized container of data, structured so that the information can be queried (fancy term to say read) using algorithmic logic.

We can divide databases into two categories:

  1. Relational
  2. Non-relational

Relational databases

A relational database is one that stores data in tables, were the columns are called fields and the rows are called records.

They are also known as SQL databases, because of Structured Query Language, a programming language used for CRUD (Create Read Update Delete) operations on data. SQL syntax looks like this:

SELECT field_1, field_2, field_3 FROM table

You can have a relational database structure even without SQL; for instance, you could exploit simple CSV (Coma Separated Value) files, which are nothing more than a table-like structure, and whatever programming language you are comfortable with. What’s important is the structure and the logic, not the language.

The word “relational” describes the fact that the data about records are stored in different tables, connected with relations.

Here an example speaks more than one hundred words.

Let’s say we own a grocery store, and we want to create a database to store information about clients, products and orders. To do that we need three tables.

clients table:

id first_name last_name
0 John Doe
1 Ellie Knight
2 Billy Cooke

products table:

id description price
0 apple 1.00
1 banana 1.40
2 pineapple 1.30
3 pear 1.50
4 orange 1.60

orders table:

id clients products
0 1 1
1 0 4
2 2 0

Here the orders table is relational because it relates the records of clients and products tables.

Note: every record has a unique integer identifier, known as the primary key.

Note: the orders table makes use of many-to-many relations. One-to-one and one-to-many relation types exist too.

An example of a one-to-one relation can be demonstrated by adding a contact_data table to the database.

contact_data table:

id email client_id
0 johndoe@email.com 0
1 ellieknight@email.com 2
2 billycooke@email.com 1

Note: an email can only be assigned to one client and vice-versa, hence one-to-one.

An example of a one-to-many relation can be demonstrated by adding a messages table to the database.

messages table:

id message client_id
0 Aliquam congue orci eros, at posuere arcu blandit sed. 0
1 Lorem ipsum dolor sit amet, consectetur adipiscing elit. 2
2 Fusce scelerisque facilisis massa, nec porta ex tincidunt vel. 0

Note: a message can only be connected to one client, but a client could have written more than one message, hence one-to-many.

For a real implementation, it would make more sense to add a contact_data_id and message_id field to the clients‘ table, instead of adding a client_id field to the contact_data and messages tables. The type of relation would be the same though.

Non-relational databases

A non-relational database is one that stores data in collections, which are tree-like structures that contain key-value pairs.

Popular non-relational databases come in the form of JSON (JavaScript Object Notation) or XML (eXtensible Markup Language) files.

Note that proper databases are usually stored in binary form, so that they are faster to query, not in text form, which is good to be read by humans. This doesn’t change the nature of the structure nor the underlying logic.

Let’s use the same example as before, this time in a non-relational format (JSON formatting is used here).

clients collection:

{

"clients": [

{"id": 0, "first_name": "John", "last_name": "Doe"},

{"id": 1, "first_name": "Ellie", "last_name": "Knight"},

{"id": 2, "first_name": "Billy"}

]

}

Note: the last item of the clients‘ collection doesn’t have a last_name.

products collection:

{

"products": [

{"id": 0, "description": "apple", "price": 1.00},

{"id": 1, "description": "banana", "price": 1.40},

{"id": 2, "description": "pineapple", "price": 1.30},

{"id": 3, "description": "pear", "price": 1.50},

{"id": 4, "description": "orange", "price": 1.60},

]

}

orders collection:

{

"orders": [

{"id": 0, "client_name": "Ellie", "product_description": "banana", "price": 1.40},

{"id": 1, "client_name": "John", "product_description": "orange", "price": 1.60},

{"id": 2, "client_name": "Billy", "product_description": "apple", "price": 1.00}

]

}

Note: all the information I want to store about the orders, is in the orders collection. This is what we mean when we say that there are no relations.

Comparison table

Relational Non-Relational
Data uses schemas Schema-less
Relations! No (or very few) Relations
Data is distributed across multiple tables Data is typically merged or nested in a few collections
Horizontal scaling is difficult or impossible; Vertical scaling is possible Both horizontal and vertical scaling is possible
Limitations for lots of (thousands) read & write queries per second Great performance for mass (simple) read & write requests

What is a schema? A schema is a structure of keys (or fields, in case of a relational database) that must be followed by every item (or record). Relational databases force you to stick to the schema, non-relational don’t.

What scaling means? Scaling means to change the way the database is stored on disk (or solid drive, or tape, or whatever support is used to memorize data when you’re reading this article).

Horizontal scaling means that different parts of the database are stored in different locations, so to increase the available space we can add more servers (fancy term to describe a computer connected to the web, usually entirely dedicated to store data and “serve” it when the remote client asks for it).

Vertical scaling means that the entire database is stored in a single location, so the only way to increase the available space is to add drives to that specific server (or local machine). Of course, you can still have a backup of the database in different locations for safety purposes.

The limitation in read & write operations in relational databases is connected to the fact that to edit a record we often need to read from other tables, whereas on non-relational databases most relevant data is connected in the same collection.

Bonus tip (using spreadsheets)

Often times we find ourselves using spreadsheets as databases.

Keeping in mind that a spreadsheet is not the best way to implement a database, there are good practices that we can use, now that we know a little more about what a database is under the hood.

Note: since spreadsheets are tables, they are suited for a relational structure.

So here’s a brief list of those practices I find myself using:

  • one table per sheet;
  • separate report tables form data collection tables (the latter are the actual database);
  • every table has a header in the first row;
  • avoid row headers in data collection tables;
  • the first column is dedicated to unique indexes (use the function =ROW()-2 to start from 0 in the second row and have auto-increment);
  • use indexes to create relations, not the actual data;
  • retrieve data from indexes using the =VLOOKUP() function.

Conclusions

I don’t really have anything to add here, so here’s a percentage of the number of times I wrote the word “database” in this article for those who are tired of it. I counted 29 over 1134 total words (roughly 2,6%!).

Useful references

  1. Academind, SQL vs NoSQL or MySQL vs MongoDB video | https://www.youtube.com/watch?v=ZS_kXvOeQ5Y
  2. SQL tutorial | https://www.w3schools.com/sql/
  3. VLOOKUP function | https://support.google.com/docs/answer/3093318?hl=en

Credits

  1. Cover image background photo by Twitter: @jankolario on Unsplash

04 Apr 2020
Mattia Bressanelli

Comments

Public comments will be displayd here.

Leave a Reply

Your email address will not be published. Required fields are marked *