Relational Database Modelling and Functional Dependency¶

Today's Lecture¶

Problems in Relational Models
Duplicate vs Redundant Data
Normal Form Databases
Determinants and Identifiers
Determinants and Redundancy
Well-normalised Tables
Fully-normalised Tables
Multivalued Determinacy
Advantages of Full Normalisation

Problems in Relational Modelling¶

Objectives

Illustrate techniques to describe information in terms of table definitions and occurrences
Guard against anomalies when we insert, delete or lose consistency of data in the tables
How do we know our tables are correct?

Key Points to Remember About Relational Models¶

Ordering of rows is not significant
Ordering of columns is not significant
Each row/column intersection contains a single attribute value. Multiple values are not allowed
Each row in a table must be distinct.

→ A row can always be uniquely identified by quoting an appropriate combination of attribute values.

A table conforming to these restrictions is called a normalised table or first normal form.

Example Problem¶

Suppose a company called TechScience wishes to create a database for managing its engineers and the engineering projects on which they work.

Suggest a relational structure (i.e. table(s)) for storing:

Engineer employee IDs
Engineer Employee Names
Project IDs
Project Names

Policy: No engineer can work on more than one project at a time. But a project can have multiple engineers working on it.

Which Is the Correct Table Structure?¶

Solution A:

AllInformation (EngineerId, EngineerName, ProjectID, ProjectName)

Solution B:

Engineer (EngineerId, EngineerName)
Project (ProjectID, ProjectName, EngineerId)

Solution C:

Engineer (EngineerId, EngineerName, ProjectID)
Project (ProjectID, ProjectName)

Solution D:

Engineer (EngineerId, EngineerName)
Project (ProjectID, ProjectName)
EngineerAllocationsToProjects (EngineerId, ProjectID)

Solution A: Single Table¶

AllInformation (EngineerId, EngineerName, ProjectID, ProjectName)

EngineerId	EngineerName	ProjectID	ProjectName
007	James Bond	888	Ventilator
008	Peter Parker	888	Ventilator
009	Bruce Wayne	999	Heart Monitor

Is there any unnecessary duplication?

→ Yes. ProjectName "Ventilator" is duplicated. If we delete one copy, we can still deduce the value from the other — so it is unnecessarily duplicated (i.e. redundant).

Unnecessary duplication creates problems when updating data — you need to ensure all copies are updated when their value changes.

Solution B: Two Tables¶

Project (ProjectID, ProjectName, EngineerId)
Engineer (EngineerId, EngineerName)

ProjectID	ProjectName	EngineerId
888	Ventilator	007
888	Ventilator	008
999	Heart Monitor	009

EngineerId	EngineerName
007	James Bond
008	Sarah-Jane Parker
009	Bruce Wayne

Is there unnecessary duplication? → Yes. "Ventilator" is still duplicated.

Solution C: Two Tables¶

Engineer (EngineerId, EngineerName, ProjectID)
Project (ProjectID, ProjectName)

EngineerId	EngineerName	ProjectID
007	James Bond	888
008	Sarah-Jane Parker	888
009	Bruce Wayne	999

ProjectID	ProjectName
888	Ventilator
999	Heart Monitor

Is there unnecessary duplication? → No. You cannot delete an attribute value in any of the tables without losing some information from the database. You can't delete a value and then work out what that value was.

Solution D: Three Tables¶

Engineer (EngineerId, EngineerName)
Project (ProjectID, ProjectName)
EngineerAllocationsToProjects (EngineerId, ProjectID)

EngineerId	EngineerName
007	James Bond
008	Sarah-Jane Parker
009	Bruce Wayne

ProjectID	ProjectName
888	Ventilator
999	Heart Monitor

EngineerId	ProjectID
007	888
008	888
009	999

Is there unnecessary duplication? → No.

You can't delete an attribute value in any of the tables without losing some information from the database.

Duplicate vs Redundant Data¶

Duplicated data — occurs where an attribute (column) has two or more identical values.

Redundant data — occurs if you can delete a value without information being lost.

→ Redundancy is unnecessary duplication.

Duplication vs. Redundancy: Example¶

Part#	Part Desc.
P2	Nut
P2	nut
P1	Bolt
P1	bolt
P3	Washer
P3	washer
P4	Nut
P4	nut

Parts and their textual descriptions are contained in the table.
We can see some information is duplicated (e.g. "nut" appears multiple times).
Is there any redundant information here?

If we delete a "nut" value, we still have other rows with "nut" — we can still work out what the value was. But we have not lost information about which part has that description.

→ Nut was duplicate but NOT redundant! (each row is distinguishable by Part#)

Duplication vs. Redundancy: With Supplier¶

Now introduce S# — the supplier ID.

S#	Part#	Part Desc.
S2	P1	Bolt
S2	P1	bolt
S7	P6	Bolt
S2	P4	Nut
S5	P1	Bolt

Table represents: Part ID and its description AND who supplies each part.
Two suppliers can supply the same part.
We can see duplicate information (e.g. "Bolt" × 3).
Is any of the information here redundant?

If we delete "bolt" — we can still deduce it from other rows. Some duplicated data was redundant.

Eliminating Redundancy¶

We cannot just delete values from the table! Preferable to split the table into 2 tables:

Parts table:

Part#	Part Desc.
P1	Bolt
P4	Nut
P6	Bolt

Supplies table:

S#	Part#
S2	P1
S7	P6
S2	P4
S5	P1

Eliminated redundancy by table splitting.
P1 description only appears once.
Relationship is made by including Part# in both tables.

So far we've assumed that table structures which permit redundancy can be recognised by inspection of table occurrence. This is not entirely accurate — attribute values are subject to insertion, change, and deletion.

Eliminating Redundancy (cont.)¶

Consider this table SP:

S#	P#	Desc
S2	P1	Bolt
S7	P1	Bolt
S2	P4	Nut

Inspection of table SP does not reveal any redundancy. It could even suggest that "no two suppliers may supply the same part#".

→ Need to take care not to conclude too much from inspecting data — think more deeply about possible data.

Repeating Groups¶

We stated earlier: "Each attribute must have at most one value in each row."

Violating this — having multiple values in a single cell — creates repeating groups:

S#	SName	P#
S5	Wells	P1
S2	Heath	P1, P4, P6
S9	Edwards	P8, P2, P6

Problems:

Table is an asymmetric representation of symmetrical data
Rows can be sorted into S# but not into P#
Rows are different length due to variation in number of P#'s
If rows were fixed length, they would need to be padded with null values

→ Not possible in a relational database.

Elimination of Repeating Groups (Method 1)¶

Easiest way to eliminate repeating groups is to write out the table occurrence using a vertical layout and fill in the blanks by duplicating the non-repeating data:

S#	SName	P#
S5	Wells	P1
S2	Heath	P1
S2	Heath	P4
S2	Heath	P6
S9	Edwards	P8
S9	Edwards	P2
S9	Edwards	P6

But this can lead to 'redundancy' of information! (Does the right-hand table contain redundant information?)

Elimination of Repeating Groups (Method 2)¶

Split the table into two tables so that the repeating group appears in one table and the rest of the attributes in another. Need to provide correspondence by including a key attribute with the repeating group table.

Suppliers table:

S#	SName
S5	Wells
S2	Heath
S9	Edwards

Supplies table:

S#	P#
S5	P1
S2	P1
S2	P4
S2	P6
S9	P8
S9	P2
S9	P6

Unfortunate Conclusion¶

It is not possible to tell by looking at relational tables in a database whether:

There is the potential for redundancy

A snapshot of a table is an inadequate guide to the presence or absence of redundant data.

We need to know the underlying rules — the DBA must discover the rules which apply to the conceptual model.

Bad Database Design¶

customerID	date_of_birth	surname
1001	25-03-1980	Jones
1001	26-03-1980	Jones
1002	13-01-1985	Smith

Data integrity fail! — Customer 1001 has two different dates of birth.

When something in the database can be logically impossible → bad data → bad database design → not normalised.

Normalized Tables¶

A normalized database is structured so that it cannot contain redundant information.

→ You can't give customer 2 dates of birth even if you wanted to.

Benefits of normalized tables:

Easier to understand
Easier to modify
Protected from anomalies (bad insert, delete, or updates)

Codd's Normal Forms¶

Codd identified rules which govern the way we create tables to avoid anomalies when inserting or deleting values in these tables. These rules are called Normal Forms.

There are three and a half important levels (and two further levels which are occasionally used).

1^st Normal Form (1NF)¶

A relation is in first normal form if the domain of each attribute contains only atomic values and the value of each attribute contains only a single value from that domain.

How to Violate 1NF¶

1. Using the order of rows to represent information

Names
Scary
Sporty
Posh
Geri
Baby

There is no right order — using row order to convey information violates 1NF.

2. Mixing data types in a single column

Names	height_in_cm
Scary	165
Sporty	167
Posh	Between 160 and 165

Now a column has a mixture of data types — violates 1NF.

3. A table without a valid primary key

Having duplicate primary key values — violates 1NF.

4. Repeating groups in a row

cust_id	product_1	quantity_1	product_2	quantity_2	product_3	quantity_3
1	camp_soup	1	sars_vinegar	2	saxa_salt	1

Having repeated groups in the same row — violates 1NF.

2^nd Normal Form (2NF)¶

A relation is in 2^nd normal form if, in addition to satisfying the criteria for 1^st normal form, every non-key column is fully functionally dependent on the entire primary key.

How to Violate 2NF¶

Having a non-key attribute that is dependent on part of a primary key.

Example — Shopping Basket:

cust_id	product	quantity	cust_rating
1	camp_soup	1	gold
1	sars_vinegar	2	gold
1	saxa_salt	1	gold
2	tesco_popcorn	2	silver
2	tesco_juice	2	silver

Primary key is {cust_id, product}. But cust_rating depends only on cust_id — part of the primary key. Violates 2NF.

Remedy: Decompose into:

cust_id	product	quantity
1	camp_soup	1
1	sars_vinegar	2

cust_id	cust_rating
1	gold
2	silver

3^rd Normal Form (3NF)¶

A relation is in 3^rd normal form if, in addition to satisfying the criteria for 2^nd normal form, no non-key attributes are transitively dependent upon the primary key.

Test¶

Relation should not have a non-key attribute functionally determined by another non-key attribute (or by a set of non-key attributes). That is, there should be no transitive dependency of a non-key attribute on the primary key.

Remedy¶

Decompose and set up a relation that includes the non-key attribute(s) that functionally determine other non-key attribute(s).

Boyce-Codd Normal Form (BCNF / 3½ NF)¶

All attributes in a relation should be dependent on the key, the whole key and nothing but the key.

Summary: Test & Remedy¶

Normal Form	Test	Remedy (Normalize)
First (1NF)	Relation should have no multivalued attributes or nested relations.	Form new relations for each multivalued attribute or nested relation.
Second (2NF)	For relations where primary key contains multiple attributes, no non-key attribute should be functionally dependent on a part of the primary key.	Decompose and set up a new relation for each partial key with its dependent attribute(s). Keep a relation with the original primary key and any attributes fully functionally dependent on it.
Third (3NF)	Relation should not have a non-key attribute functionally determined by another non-key attribute (no transitive dependency).	Decompose and set up a relation that includes the non-key attribute(s) that functionally determine other non-key attribute(s).

Determinants¶

If there are rules such that duplicate values of attribute A are always associated with the same value of attribute B (within any given occurrence of the table), then attribute A is a determinant of attribute B.

Notation: A → B

Example: If each possible P# value has precisely one associated part description value (i.e. P4 has just one description "nut"), then we can say that P# is a determinant of Part Description.

P# → P_Desc
P# → Qty_in_Stock

Determinants: Stock Table¶

P#	P_Desc	Qty
P1	Bolt	5000
P2	Nut	8300
P3	Washer	9750
P4	nut	2326

Question: Is P_Desc a determinant of P#? No. Question: Is P_Desc a determinant of Qty? No.

Superfluous Attribute¶

If P# determines Qty, then the composite attribute {P#, P_Desc} also determines Qty, but P_Desc is superfluous.

→ We assume determinants do not contain any superfluous attributes.

Determinacy Diagrams¶

We need a notation to express where one attribute determines another — we use determinacy diagrams.

A → B

Example — Stock table determinacy diagram:

P# → P_Desc
P# → Qty

Since P# determines both P_Desc and Qty:

P# → P_Desc
P# → Qty

Determinacy Diagrams (2): Enterprise Rules¶

Example enterprise rules:

Supplier identified by single S# & a part is identified by single P#
Each supplier has only one SName but different suppliers may have same names
A given supplier supplies a given part in just one pack size
A supplier may supply many different parts in many different pack sizes

Draw the determinacy diagram.

Determinacy Diagrams (3): Answers¶

S# is a determinant of SName (each supplier has one name)
{S#, P#} is a determinant of PackSize (a given supplier supplies a given part in one pack size)

S# → SName
{S#, P#} → PackSize

Transitive Determinants¶

If A is determinant of B and B is determinant of C, then A is also determinant of C.

A → B → C
Therefore: A → C

Identifiers¶

"No two rows in a table can have identical values throughout" — therefore an individual row can always be identified by quoting the values of all its attributes. However, some values may not be needed.

Example: Employee (Employee#, Employee_name, Salary)

Rules:

No two rows should have the same value for Employee#
→ Employee# is a row identifier of the table

Note: Where a composed attribute forms the identifier, no component part (of the identifier) can be null (entity constraint).

Identifiers: Candidate Keys¶

In a diagram with attributes {D, E, F, G, H, I, K, L, M, N, O, P, Q}:

Identifier D — uniquely identifies all rows
Identifier G — uniquely identifies all rows
Identifier {K, L} — composite identifier
Identifiers P & Q — both are candidate identifiers

→ 2 candidate identifiers

Determinacy & Redundancy¶

Given a determinacy diagram, we can detect and eliminate table structures which could contain redundant data.

Example: A company sells products to customers through its sales team.

Customer# → Salesman# → Salesman_name

Each customer# is associated with one salesman#, but a salesman# may be associated with several different customer#
Therefore salesman# could have duplicate values
But salesman# is a determinant of salesman name
Therefore each occurrence of a duplicate salesman# value will be associated with the same salesman name → table can contain redundant values of salesman

But: Customer# values cannot be duplicated (because customer# is the identifier for our table), so we cannot allow redundant values of salesman#.

Potential redundancy arises because salesman# is a determinant but we did not choose it as an identifier.

Boyce-Codd Rule for Detecting Redundancy¶

Customer#  Salesman#  Salesman_name
1          21         John
2          25         Jack
3          29         Jim
4          29         Jim

Customer# is the identifier (determinant & identifier)
Salesman# is a determinant only (non-identifying determinant)
Potential redundancy: "Jim" appears twice

Parts and Suppliers Example¶

Qty	S#	P_Desc	P#	Unit Price

Potential redundancy in values of P_Desc and Unit_Price
P# is a determinant but not a candidate identifier!
This gives rise to the Boyce-Codd Rule for detecting redundancy

Transformation into Well-Normalised Table (3NF / BCNF)¶

Boyce/Codd rule for determining redundancy:

"Every determinant must be a candidate identifier."

A table which obeys this rule is said to be in Boyce/Codd Normal Form (BCNF).

To put it another way:

"All attributes in a relation should be dependent on the key, the whole key and nothing but the key."

Remedy: A determinant which is not a candidate identifier is called a non-identifying determinant. To transform a badly normalised table into well-normalised tables:

Create new tables such that each non-identifying determinant in the old table becomes a candidate identifier in a new table.

Example of BCNF Normalisation¶

Original table (not in BCNF):

Customer# → Salesman# → Salesman_name

Customer#	Salesman#	Salesman_name
3	A	John
4	A	John
5	B	Jack
6	C	Jim

Danger if we tried to update Salesman_name — "John" appears twice; updating one without the other creates inconsistency.

After normalisation:

Customer-Salesman table:

Customer#	Salesman#
3	A
4	A
5	B
6	C

Salesman table:

Salesman#	Salesman_name
A	John
B	Jack
C	Jim

Fully Normalised Tables¶

Fully normalised tables are structured in such a way that they cannot contain redundant data.

Generally, a well-normalised table (i.e. one in which each determinant is a candidate identifier) is also fully normalised, but not always — further normalisation may be desirable.

Fully Normalised Tables: Hidden Transitive Dependency¶

Badly normalised:

| order# | customer# | customer_name |

This has a hidden transitive dependency: order# → customer# → customer_name

After normalisation:

Order-Customer table:

order#	customer#
1	C1
2	C2
3	C1

Customer_name table:

customer#	customer_name
C1	Smith
C2	Jones

If we delete Smith from row 3 of customer_name table:

We can still use order#(1) to find corresponding customer# in order-customer table
Search order-customer table for another order# placed by that customer
Use order# to search customer_name table for corresponding customer_name

Multivalued Determinacy¶

Example: Library Database Enterprise Rules

Each book has a unique book#
Each author has a unique author#
Every author has a name and every book has a title
Each subject classification has a unique subject_name
Book# does not distinguish an individual copy of a book, only an individual work
A book may be written by several authors and be classified under several subject_names
An author may write several books
A subject_name may be applied to several books

Determinacy diagram:

author# → author_name
book# → book_title
subject_name → subject_classification

Well-Normalised Tables (Library)¶

Author (author#, author_name)
Book (book#, book_title)
Author_Book_Subject (author#, book#, subject_name)

Book Example: Multivalued Dependency¶

Book# B15 is jointly authored by A2 and A5 and is classified under subject names biology and physics.

If every author of a given book is always associated with all the subject-names under which the book is classified, then the attribute subject-name can contain certain redundant values.

Author#	Book#	Subject-name
A2	B15	biology
A2	B15	physics
A5	B15	biology
A5	B15	physics
A2	B18	physics

If subject names biology & physics were deleted from rows 1 and 2, it would be possible to deduce those values from rows 3 and 4.

Therefore: Author-Book-Subject is well-normalised but NOT fully normalised.

The table is not fully normalised because the same set of subject-names is associated with each author of the same book. Book# is said to multi-determine author# & subject_name.

Full Normalisation (Library)¶

Full normalisation can be achieved by splitting the table into two parts:

Author-Book table:

Author#	Book#
A2	B15
A5	B15
A2	B18

Book-Subject table:

Book#	Subject-name
B15	Biology
B15	Physics
B18	Physics

Advantages of Full Normalisation¶

So far, emphasis has been placed on eliminating redundancy. Further benefits relate to deletion and insertion operations.

Deletion Side Effect¶

In a non-fully normalised table:

C#	CName	S#	SName
C1	Brown	S4	Jones
C2	Carter	S7	Samson
C3	Cross	S4	Jones
C4	Barns	S8	Baker

Delete C2 → delete whole tuple → lose salesman information (Samson)
Deleting C# on its own is not allowed as it is an identifier and cannot be null

Insertion Side Effect¶

Add Salesman S3 whose name is Hall
You cannot do this until that salesman is associated with a customer, otherwise identifier C# will be null

Should be modelled as:

Customer table:

C#	CName
C1	Brown
C2	Carter
C3	Cross
C4	Barns