INDEX

The list above is a set of concrete values. They are not meaningful unless we know what they are values of. We can guess at some of the values: Mary Smith is likely the name of a person; 19 Oak is probably a street address; 75 could be just about anything – an age, degrees, a golf score, etc.

WHAT IS METADATA?

In order for these values to be meaningful, there must be a meaning attached to them. In effect we need data about the data, which is called metadata. See the table below.

*METADATA*	*DATA*
name of person	Mary Smith
age	75
number of grandchildren	2
telephone number	9164537834
college degree	BS
street address	19 Oak
name of cat	Jack
color of eyes	blue

WHAT IS A RECORD?

The metadata gives meaning to the data, but questions still remain – whose age, who has 2 grandchildren, etc. The data need to be applied to some entity, in this case a particular person – Mary Smith. The data above refer to a single entity; a set of data values treated as a unit is called a record.

WHAT IS AN ENTITY?

Records refer to entities. An entity may be a person, a product, an event, a corporation, a disease, etc. In general, an entity is anything that can be described by a set of data values. Usually, a record will contain a data value(s) that uniquely identifies the entity – it is commonly called the primary key..

WHAT IS A FILE?

A group of similar records treated as a unit is called a file. In this example it would be a file of persons or person file. Person file may be too general; for example, employee file or customer file is more specific. Generally, the file name indicates the type of entity that is being referenced – employee file, payroll file, student file, inventory file, sales file, accounts payable file, diseases file, etc.

WHAT IS A DATABASE?

A group of files treated as a unit is called a database. A database acts as an integrated set of files which is well suited to perform the following:

Answer queries requiring the consolidation of data from more than one file;
Reduce the redundancy of data;
Respond quickly to ad hoc (unplanned) queries;
Easily accommodate the dynamic changes that occur to data and their structure;
Program/Data Independence – minimize the impact a change in data structure has on programs using that data.

WHAT IS INFORMATION?

Information is the response given to a question (query); it usually involves the processing of data.

For example:

QUERY: List the names of all employees with a BS degree.

RESPONSE: Each record in the file is examined and those records with a degree of BS are selected and the names displayed.

NOTE: Only data that is used becomes information, data that is never used is wasted space.

WHAT IS A FIELD?

When records are stored they require some type of space – paper, computer hard drive, etc. A piece of space is allocated for each unit of data referenced by the metadata of a given record – this piece of space is usually referred to as a field. It is not uncommon to refer to the metadata of a record as the record’s fields. For example, a record may have a name field, an address field, etc. Defining a file’s structure (design) usually means defining the fields of the file’s record - each record in a given file has the samee format. If a file were stored on paper, the fields may appear as column headings (see below). A column of space may be designated to contain name values and another column may be designated for address values.

*NAME*	*ADDRESS*
Mary Smith	18 Oak Street
John Doe	123 Ames Avenue
Emma Quinn	38 Pearl Street

In C++, a record’s fields are defined as follows:

struct recordType

{

string name;

string address;

};

NOTE: When defining fields (space) for computer storage, it is also necessary to indicate the data type of each field. The field names and the data type are metadata. The field is the smallest unit of meaning that is given a name.

WHAT IS A RELATIONAL DATABASE?

There are a variety of database types – one type is the widely used relational database. A relational database represents files in the form of tables, thus the files of a relational database are sometimes called tables although relation is the technical term for a file in a relational database. A relation may be characterized as a specific type of table – all relations are tables, but not all tables are relations. A table becomes a relation when it meets the following criteria:

No repeating fields (columns);
No duplicate records (rows).

WHAT IS A DATABASE MANAGEMENT SYSTEM (DBMS)?

A DBMS is software that manages and controls metadata/data. It is the successor to the traditional file system.

How does a DBMS differ from the Traditional File System?

Below are two examples of reading and displaying a file which contrast the traditional file system with a DBMS. Both methods will process the following file.

*NAME*	*JOB TITLE*
Smith	programmer
Doe	electrician
Quinn	doctor

Example 1: Tradtional file system using C++ to read and display a file.

#include<iostream>

#include<string>

Oval Callout: NOTE: The structure (metadata) of the data is defined in the program. #include <fstream>

using namespace std;

struct recType

{

string name;

string jobTitle;

};

Oval Callout: The while loop.reads each record and displays the name and job title. void main()

{

recType myRec;

ifstream inFile("nameFile.txt");

while (inFile>>myRec.name>>myRec.jobTitle)

{

cout<<myRec.name<<" "<<myRec.jobTitle<<endl;

}

Example 2: A DBMS (Microsoft’s Access) using SQL (a DBMS language) to read and display a file.

Oval Callout: NOTE: The structure of the data is NOT defined in the SQL program. select name,jobTitle

from nameFile;

Notice the absence of detail in the SQL “program.” The SQL program states WHAT is to be done and the DBMS supplies the details of HOW to perform the task. The C++ program must specify how to perform the task. The diagram below illustrates the difference.

TRADITIONAL FILE SYSTEM

DBMS

The notable difference is that the user program (C++) in the traditional file system interfaces directly with the data; in the DBMS the user program (SQL) interfaces with the DBMS which in turn interfaces with the data. In the DBMS, the file structure is stored once, and as a SQL program references a file, the DBMS accesses the file’s structure. In other words, in a DBMS, a file’s structure is centrally located, while in the traditional file system the file structure is distributed across each program that uses the file – decentralized. Each C++ program that uses the nameFile must include the file’s structure; consequently, a change in the file structure requires a change in the program using that file – the traditional file system has no program/data independence as does the DBMS. If one hundred C++ programs use the nameFile, a change in the file’s structure would require changing and recompiling each of the one hundred C++ programs!

For example: Suppose a salary field were to be added to the nameFile Each C++ program using the nameFile would require adding the new field to its struct, even if the program did not use the salary field! On the other hand, the DBMS makes one change to the centrally located, single copy of the file’s structure. Each SQL program using the nameFile remains unaffected by the file structure change – this is program/data independence.

HOW IS A DATABASE DESIGNED?

The design cycle consists of three main steps that produce the following:

I Conceptual Data Model;

II Logical Data Model;

III Physical Database.

What is a Conceptual Data Model?

A conceptual data model is a very general view (abstraction) of:

1. Entities;

2. Relationships between entities.

The conceptual data model is expressed in a graphical language, consisting mainly of blocks, ovals, and lines. The following is a basic lexicon:

= a reference to an entity class.

= entity attribute

= relationship between entity classes.

* = many

1 = one

Example 3: A company wishes to develop a database involving its salespersons and its customers. The company’s Business Rules allow salespersons to have many customers, but each customer is assigned to only one salesperson.

Step I: create the conceptual data model,

SALESPERSON

CUSTOMER

1 has *

Think of the model above as an expression of the company’s business rules using a graphical language instead of English. The model above says the following: Any given saleperson has the possibility of many customers, and any given customer has the possibility of only one salesperson. The same statement can be summarized as follows: There is a “one to many” relationship between salesperson and customer.

Each entity can have one or more attributes, which describe the entity. It is often the case that an entity has an attribute(s) that is used to uniquely identify each member of the entity class – this attribute(s) is called the primary key. By definition, no two individual entities will have the same value for the primary key. For example, the salesperson ID is the primary key, which means that no two salespersons will have the same ID number. The customer acct# is the primary key for the customer entity. There are cases where it will require more than one attribute to serve as the primary key. It is customary to underline the primary key.

SUMMARY

A.) Each block represents a class of entities – e.g., the SALESPERSON block represents the class of salesperson entities – a member of this class is an individual salesperson such as Mary Smith.

B.) The following are called cardinalities.

1_________* means “one to many” relationship;

*_________* means “many to many” relationship;

1_________1 means “one to one” relationship.

C.) means attribute - e.g., the customer’s NAME, the salesperson’s ID, etc.

C.) A primary key is one or more attributes that serve to uniquely identify each member of a class of entities

Step II: Develop the logical data model

In this example, the step will consist of transforming the conceptual data model into a relational model – the relational model is one of many types of logical data models.

Here is the process.

A.) Transform each block (entity) into a relation.

B.) Transform each relationship into its relational counterpart.

Instead of a relation, we will be using relational schema, which consists of the name of the relation followed by the attributes enclosed in parentheses.

Transform each block (entity) into a relational schema

Salesperson(salespersonId, salespersonName)

Customer(customerAcct#, customerName, customerAddress)

Transform the “one to many” relationship into its relational counterpart

The way the relational model expresses relationships, such as “one to many”, is by the presence of a common attribute in each of the connected (related) relations. In this case, the primary key salespersonId will be added to the customer relation.

Customer(customerAcct#, salespersonId, customerName, customerAddress)

Rule for tranforming “one to many” relationship: add the primary key of the “one” relation to the “many” relation.

The salespersonId attribute in the customer relation is called the foreign key, indicating that this attribute refers to the primary key of another relation, and is used to relate (integrate) the two relations.

Step III: Implement the physical database.

Since the logical data model is relational, a relational DBMS must be chosen - in this case Microsoft’s Access DBMS is chosen.

Here is the process:

A.) Convert each relational schema into a table.

1. Define the table’s metadata( (column names ,data type)

2. Add the actual data values for each record of the table.

B.) Specify the relationships between tables.

Define the table’s metadata( (field names, data type)

Add the actual data values for each record of the table.

salesperson
salespersonId	name
001	smith
002	jones
003	adams

Repeat the above for the customer table.

Specify the relationships between tables - another item of metadata.

Oval Callout: “one to many”

NOTE: ∞ means many as does *

Example 4: A company wishes to develop a database involving the products it assembles and the types of parts that constitute each product. A product is made up of many different part types, and each part type may appear in different products.

Step I: Create the conceptual data model.

PRODUCT

* contains *

The model above says that a single product contains many parts, and a single part type is used in many products; that is, there is a “ many to many” relationship between product and part type. The primary keys are ID for the product entity and PARTNUM for the part type entity.

Step II: Develop the logical data model.

Here is the process.

A.) Transform each block (entity) into a relation.

B.) Transform each relationship into it’s relational counterpart.

Instead of a relation, we will be using relational schema, which consists of the name of the relation followed by the attributes enclosed in parentheses.

Transform each block (entity) into a relational schema

product(productId, productDescription)

part(partnum, partDescription, supplier)

Transform the “many to many” relationship into its relational counterpart

A “many to many” relationship is translated by creating an additional relation that contains, at minimum, the primary key of each relation in the relationship.

product_part(productId, partnum)

Rule for tranforming many to many relationship: Create an additional relation that is a hybrid of the two relations in the many to many relationship. The hybrid relation should contain, at minimum, the primary key of each of the aforementioned relations.

Step III This step is left to the reader.

SQL

SQL (Structured Query Language) is a language designed to be used with relational databases. The language has two main parts, DDL (Data Definition Language) and DML (Data Manipulation Language). The DDL performs such tasks as defining tables, deleting tables, and altering the structure of tables. The DML performs such tasks as retrieving data and maintaining data (updating records, deleting records, inserting records). See the following chart for a more complete breakdown.

--Create

-- Delete

SELECTION

QUERIES

ACTION QUERIES

-- Alter

--Select --Update

--Insert

--Delete

The following is a brief introduction to DML, specifically SELECTION QUERIES.

SELECTION QUERIES

This category of SQL statement(s) is designed to retrieve information from a relational database. It consists of one basic statement (called the select command) and various clauses and parameters. Below is the basic form of the select command.

select ………

from………… .

Where…… …..

Select Clause

The select clause contains what is to be displayed or projected. The expressions in the select clause can be:

field names; Example: select id,description
arithmetic expressions Example: select unitPrice*quantity+25.00

+ = addition
- = subtraction
* = multipication
/ = division

various built-in functions Example: select sum(amount)

sum( )
count( )
avg( )
min( )
max( )

aliases

Example:

select sum(amount)

Oval Callout: Default column name

(Output)

Oval Callout: alias Expr1000

250000

select sum(amount) as SalesTot

Oval Callout: Alias column name

(Output)

SalesTot

250000

From Clause

The from clause contains references to the source of the data. These sources could be:

a single table; Example: from customer
a combination of tables that are joined together; Example: from sales, cust,
a virtual table (explained later). Example: from query1
a select statement that generates a table. Example: from (select * from sales)

Where Clause

The where clause contains conditional expressions and is required only when needed.

Example: where state=’CT’ or state=’RI’

Some of the conditionals are:

= (equal)
<> (not equal)
and
or
not
> (greater than)
< (less than)
>= (greater than or equal)
<= (less than or equal)

Example: Simple projection of one field name.

Select custname

From customer

Example: Simple projection of many field names.

Select custname,age

From customer

Example: Simple projection of every field name in the table.

Select *

From customer

Example: Average age of all customers.

Select avg(age)

From customer

Example:Total number of customers.

Select count(*)

From customer

Example: Using an alias.

Select count(*) as TotalCustomers

From customer

Example:Total sales of all customers.

Select sum(sales)

From customer

Example:Using the where clause.

Select custname

From customer

Where sales > 10000

Example:List the names of customers with sales over 10,000 and under the age of 36.

Select custname

From customer

Where sales > 10000 and age < 36

Example: Simple join. List the names of each customer and the name and age of the salesperson assigned to them – each customer is assigned to only one salesperson.

Select custname,salespersonName,salespersonAge

From customer,salesperson

Where customer.salespersonId = salesperson.salespersonId

NOTE: The where clause above serves to join the customer and salesperson tables on the basis of a common field – salespersonId. As a result of the join, each salesperson record is joined (attached) to the appropriate customer record.

Group by Clause

Example:List the total customer sales by state.

Select custState,sum(sales) as TotSales

From customer

Group by custState

NOTE: The group by clause above groups all the Rhode Island customers together and applies the sum(sales) function; it then groups together all the Connecticut customers and applies the sum(sales) function, etc. See the results below.

custState TotSales

CT 4780

MA 5432

NY 10908

RI 3500

(See the SQL Supplement for a more comprehensive and in-depth coverage of SQL)

DATA MODELING

CONCEPTUAL DATA MODEL

The Conceptual Data Model (CDM) is a diagram that depicts entities and their relationships with other entities. Each entity is further described by depicting its attributes. The CDM represents the first step in a top-down approach to the design and development of a database.

Attribute versus Entity

An attribute is viewed as a property of an entity, but sometimes it is difficult to distinguish an entity from an attribute.

Example 5:

ENTITY: employee

ATTRIBUTES: social security number (ss#); name; skill

In example 5 above, skill is an attribute, but in another context skill is treated as an entity – see example 6 below.

Example 6:

* 1

In example 6, each employee has one skill, and each skill can be had by many employees - there is a one to many relationship from skill to employee.

Example 7 below shows another instance where skill is treated as an entity.

Example 7:

* *

In example 7, an employee can have many skills – there is a many-to-many relationship between employee and skill.

SUMMARY

In example 5 skill is an attribute because there is a functional relationship between employee and skill. This mean that for any given individual employee there is only one associated skill. Being in a functional relationship (on the one side with another entity) is a necessary but not sufficient condition for being an attribute.

In example 6 skill is an entity because skill has attributes, and by definition anything that has attributes is an entity.

In example 7 skills is an entity because there is no functional relationship between employee and skill – for any given employee there is not one skill but many.

Rule for determining attributes: Treat everything as an entity, and then look for entities that are:

1. In a functional relationship (on the one side) with another entity;

2. And do not have attributes;

3. And have no relationships with other entities.

If an entity meets all three criteria, then treat it as an attribute.

Aggregate Entity

An aggregate entity is an entity constructed of other entities.

Example 8A: A company wishes to develop a CDM depicting the relationship between its various stores and products sold. Each store can sell different types of products, and each product type can be sold by different stores.

* *

Example 8B: Same as example 8A, except the company wishes to know the total quantity of products sold by each store; and the total quantity sold of each product.

STORE

PRODUCT

* *

Example 8C: Same as example 8A, except the company wishes to know the total quantity sold of a given product by a given store.

Example 8B shows the total quantity sold by each store – e.g. store 10 sold a total quantity of 1000 product units. The model does not show a detailed breakdown of the various products – e.g 300 units of product x, 200 units of product y, and 500 units of product z. The model in example 8B is too general, putting it another way, it has a coarse granularity. Example 8C calls for more detail or finer granularity. Example 8C will require an aggregate entity.

SALES

* *

PRODUCT

STORE

NOTE

Model 8B cannot provide the detail (finer granularity) that 8C provides, but 8C can provide the generality (coarser granularity) of 8B. Using 8C, store 10’s total quantity sold can be determined by adding up the individual store 10 totals for products x, y, and z. Fine granularity is more flexible than coarse granularity.

The quantity is now the attribute of the aggregate entity (SALES) and represents the total quantity sold of a given product by a given store. The example 8C model has a finer granularity than the example 8B model. SALES is an aggregate entity because it is constructed of other entities – a nested entity. This model will lead to a database that can answer the query How much of product y did store 10 sell; but the granularity is not fine enough to answer the query How much of product y did store 10 sell on a given date? – see example 8D.

Example 8D: Same as example 8C, except the company wishes to know the total quantity sold of a given product by a given store on a given date. Note that a given store can sell the same product type on different dates.

* *

The quantuty attribute now represents the quantity sold of a given product by a given store on a given date. Until now, the relationships have been binary relations - relations between two entities. The relationship inside the aggregate entity above is a ternary relationship – the relationship between three entities. It shows that a given store sells a given product on many dates.

Oval: quantity Example 8E: Same as example 8D, except the ternary relationship is transformed into a binary relationship – a many-to-many between an aggregate entity and a non-aggregate entity. Model 8D and 8E are semantically equivalent – they convey the same meaning.

* *

DATE

*

The 8E model above illustrates a more complex nesting, namely an aggregate entity nested inside an aggregate entity, but there are some who prefer to deal with binary relationships.

Exercise 1: Explain why the model below (call it 8F) cannot answer the same query that 8D and 8E can answer – How many units of product x did store 10 sell on March 26, 2003 and on March 26, 2004?

SALES

* *

* *

* *

Conceptual Entity/Physical Entity

Philosophers make a useful distinction called the type/token distinction. For example, how many words are listed below?

Cat

Someone says three and another says one. In a sense, both are correct – it is said that there is one type and three tokens (instances) of the type. In database design the same distinction is sometimes called the conceptual (type) / physical (token) distinction.

Example 9: In the model below, the product entity is a conceptual entity, because a given product is sold in many stores. This would be impossible if the product entity were a physical entity because a physical entity is singular and can be sold in only one store, thus in this case it is the same conceptual (type) product that can be sold in many stores.

STORE

PRODUCT

* *

Example 10: The model below indicates that there is a always one to many cardinality between a conceptual entity and a physical entity.

(CONCEPTUAL)

AUTO

(PHYSICAL)

AUTO

1 *

Generalization (SuperType/SubType)

One of the most useful organizing activities of the mind is classification – the is a relationship. For example: a savings account is a bank account; a checking account is a bank account; a cat is an animal; gold is a metal. In the preceding examples, there is a supertype / subtype relationship: animal/cat; metal/gold; bank account/savings account, etc.

Example 11: The model below (the Ụ represents an “is a ´relationship) represents a supertype/subtype relationship among entities.

Ụ Ụ Ụ

CHECKING ACCOUNT

SAVINGS ACCOUNT

MORTGAGE ACCOUNT

Weak Entity

A weak entity is one whose existence depends on another entity (called the parent); that is, the cardinality from the weak (dependent) to the parent is a minimum of one and a maximum of one. The weak entity must contain the same primary key as the parent. In the example below, the weak entity contains the same primary key (invoice#) as the parent.

Example 12: The line item entity depends on the bill entity.

1 *

Miscellaneous

Exercises

1. Explain why the model below cannot answer the same query that 8D and 8E can answer – How many units of product x did store 10 sell on March 26, 2003 and on March 26, 2004?

The ACME Corp. wishes to develop a conceptual data model based on the following business rules:

Many parts are purchased from many suppliers.

The same part can be purchased from different suppliers; each supplier may charge a different price for the same part.

Each supplier may provide many parts.

The model should, at minimum, answer the following queries:

What are the various parts provided by a given supplier?
What is the price of a given part charged by a specific supplier?

The model should include the basic attributes of each object( entity).

The BARTON stockbroker firm wishes to develop a conceptual data model to track its clients and the stocks they OWN. Among the data for each stock are its symbol, the industry it is in, and its earnings. The firm requires the client’s name, address, and social security number. One client can hold many stocks, and one stock can be held by many clients. The firm needs to know the date the client purchased the stock, the number of shares purchased, and the purchase price per share.

A department has many employees, but an employee belongs to one department. The department holds periodic meetings with individual employees. The company wishes to record the date of the last meeting with each employee.

A department has many employees and an employee belongs to more than one department. The department holds periodic meetings with individual employees. The company wishes to record the date of the last meeting with each employee.

Many employees are assigned to a given project. A single employee may be assigned to more than one project. An employee can have many skills and a skill may be had by many employees. The company desires to know the set of skills an employee applies to a given project and the date assigned to a given project.

A computer is comprised of many parts. A part can appear in many computers. The company needs to list the parts contained in their various computer models. Also, the company needs to know the quantity on hand for its computers and parts.

The company buys the same part from different manufacturers – for example, it buys hard drives from Western Digital, Seagate, Maxtor, or IBM. The company has been experiencing quality control problems and wishes to identify the manufacturer of a failed part.. Draw an enhanced model that will accommodate both A and B.

Many employees are assigned to a given project. A single employee may be assigned to more than one project. Even though many projects can be assigned to a given city, an employee may be assigned to only one project per city. Develop a model that reflects these company rules.

** Exercise Answers

1. This model states the following: for a given combination of store and product there is only one date.

Models 8d and 8e state the following: for a given combination of store and product there are many dates.

* 1

* *

uses

*

NOTE: Will the model answer the following query: What are the skills of a given employee? The answer is NO. The model indicates the skills used by various employees, but not any skills not being used at the time. The following model adds this enhancement.

has

GENERIC MODELS

(Beyond scope?!!)

NORMAL FORMS

Theory of Relations

There are a variety of logical data models with the relational model being the most widely used variety and the object model variety catching on fast. The relational model will be discussed in this section. The relational model is based on the mathematical theory of relations where tables are called relations; rows are called tuples; fields are called attributes. A relation is a specific type of table that meets the following criteria:

No repeating attributes (fields/columns);
No duplicate tuples (rows/records).
Each tuple of a relation must have the same attributes.

A table with the following fields would not be a relation because of the repeating fields (skill).

WorkerId

WorkerName

skill1

skill2

skill3

One problem with repeating fields in this case is determining how many skill fields to allow. Suppose a worker has more than 3 skills!! Suppose a particular worker has one skill. If you define more skills than any worker could possibly have, then there will be many rows (workers) with wasted fields (space) – rows with many null values in the unused skill fields.

The solution to this problem is to divide the single table into two relations (see below).

worker(workerId, workerName)

workerSkills(workerId, skill)

Each time a single worker acquires a new skill, a new row is added to the workerSkills relation.

(workerSkills Relation)

workerId skill

120	electrical
230	plumbing
120	roofing
120	plumbing
230	electrical

Anomalies

Many larger companies deploy two types of databases:

1. A database for Information Retrieval – often called a Data Warehouse.

2. A database for daily Transactions - often called a Production Database

The topic of Anomalies and Normal Forms, in general, will apply to the Production Database. A production database, distinct from a Data Warehouse, undergoes frequent and voluminous maintenance. Maintenance is:

1. Adding new rows.

2. Deleting rows.

3. Modifying rows.

Anomaly will be defined as a rare and undesirable event. Anomalies can occur when a production database undergoes maintenance. There are the following types of anomalies:

1. Addition anomaly.

2. Deletion anomaly.

3. Modification anomaly.

The following relation will be used to illustrate the three anomalies.

student(studentId, major, studentName, advisorId)

Business Rules:

A student can have many majors.
An advisor is assigned to only one major.
A major can be assigned to only one advisor.

studentId major studentName advisorId

s100	acct	Able	f400
s100	math	Able	f500
s200	acct	Baker	f400
s300	hist	Crane	f600
s400	math	Doe	F500
s500	hist	Evans	f600
s500	math	Evans	f500

Addition Anomaly:

Suppose a new student (s600, Hayes) is to be added to the table, but the student has not yet chosen a major. In this case the student data (name and id) cannot be added to the table until a major is chosen. Why? A relation cannot contain a null value in the primary key. The primary key in this table is comprised of two fields – studentId and major.

Deletion Anomaly:

Suppose students s100 and s200 leave the school and all their rows are deleted from the relation. In this case more information will be deleted than was intended. It just so happens that these two students are the only students advised by f400, thus the information that f400 is an advisor for accounting majors is also deleted.

Modification Anomaly:

Suppose a new advisor (f900) replaces f500. In this case more than one row requires modification – every row with advisor f500 needs to be modified.

Normal Forms/Normalization

Maintaining a relation such as the above can lead to anomalies. This suggests that there is a design flaw in the table. Normalization is the process of designing relations that will not be prone to anomalies and other maintenance problems.

How does normalization relate to another design process conceptual data modeling? Normalization is bottom-up and conceptual data modeling is top-down. Top-down begins with a general view of the relationships between entities; bottom-up begins with a concrete view of fields and their relationships to other fields. Each design process leads to the construction of relations, but normalization can also be used to test tables created by the top-down process for design flaws. Why not just use normalization (bottom-up) because the result should be normalized relations that do not contain design flaws, whereas the top-down approach may produce relations with design flaws?! An advantage of the top-down approach is that it also generates a company-wide graphical overview of the data and the various relationships. The larger companies often use both approaches in a complementary manner.

The normalization process was created as the consequence of analyzing flawed tables and discovering the source of the flaw and thereby providing a remedy. The source of most anomalies is attributed to certain undesirable functional dependencies (FD’s). The normalization process will be viewed in this discussion as a set of constraints outlawing (forbidding) certain FD’s. Normalization produces relations that are in various degrees of normal form. The lower the degree, the less constraints imposed on the table and the more prone to flaws. Higher degrees impose increasingly more constraints until the relation is free from flaws. Each higher degree contains all the constraints of the preceding degrees plus more. What follows are the degrees of normal forms, beginning with the lowest and moving incrementally to the higher normal forms.

First Normal Form (1nf)
Second Normal Form (2nf)
Third Normal Form (3nf)
Boyce Codd Normal Form (bcnf)
Others

Before delving into the constraints imposed by the normal forms, it is necessary to understand the terms composite key, atomic key, subkey , nonkey, and the concept of functional dependency.

Composite Key: If a primary key consists of two or more fields, the primary key is called a composite key.

Subkey: If a primary key is a composite key, then any field or combination of fields within the composite key is called a subkey – the composite key as a whole is not considered a subkey.

Atomic Key: If a primary key consists of one field, the primary key is called an atomic key. An atomic key has no subkeys - a >composite key has subkeys.

Nonkey: Any field or combination of fields that is not part of the key.

Functional Dependency: In this context we are referring to the functional dependency that occurs between fields. A functional dependency occurs between field A and field B when a given value of A is associated with only a single value of B – it is said that A determines B, or B is dependent on A.

For example: In the student relation above, a single value of studentId (100) determines studentName (Able). Conversely, a single value of studentId does not determine major – studentId (100) can be mapped onto two values of major (acct, math). In the latter case, it is said that studentId multi-determines major, or major is not dependent on studentId.

What follows is the list of normal forms and their associated constraints.

1. First Normal Form (1nf)

Table must be a relation.

i. No repeating field.

ii. No duplicate rows.

2. Second Normal Form (2nf)

a. 1nf.

b. No subkey determines a nonkey.
Each nonkey is determined by the whole key.

c. 1nf plus an atomic key = 2nf.

3. Third Normal Form (3nf)

a. 2nf.

b. No nonkey determines another nonkey.
Each nonkey is determined by the whole key and nothing but the whole key.

c. 2nf + one nonkey = 3nf.

4. Boyce Codd Normal Form (bcnf)

a. 3nf.

b. No nonkey determines a subkey.

c. 3nf + atomic key = bcnf.

A relation in bcnf is free from any anomalies caused by FD’s – all lower normal forms are not free from anomalies.

The following notation will be used:

A à B means A determines B.

Aàà means A multi-determines B; implies A does not determine B.

A-/à means A does not determine B

Exercises: Identify the highest normal form of the following relations.

student(id, major, advisor)
majoràadvisor
advisorààmajor
idààmajor
student(id, activity, fee)
activityàfee
student(id, major, advisor)
majorààadvisor
advisoràmajor
student(id, name, year, major, advisor) Note: year means freshman, sophomore, etc.
majorààadvisor
year, majoràadvisor

Answers

1nf. It does not rise to 2nf because a subkey (major) determines a nonkey (advisor).
2nf. It does not rise to 3nf because a nonkey (activity) determines a nonkey (fee).
3nf. It does not rise to bcnf because a nonkey (advisor) determines a subkey (major).
2nf. It does not rise to 3nf because a nonkey (year, major) determines a nonkey (major).

DECOMPOSITION

Decomposition is a technique for normalizing a relation. Normalizing a relation means raising a relation to a higher normal form, in our case raising a relation to bcnf. In most cases, decomposition achieves this task by decomposing the relation in question into two or more relations that are in bcnf.

Example: The relation below is in 1nf because major (a subkey) determines advisor (a nonkey). Decompose the relation into bcnf.

NOTE:

Decomposition must not change the meaning of the original relation. The various relations resulting from decomposition must collectively contain the same information as the original relation. That is, the various relations, when joined, must form the original relation. .

student(id, major, advisor)

Decomposition Process:

Steps:

Create a relation containing the attributes of the offending functional dependency – majoràadvisor.
r1(major, advisor)
The second relation (r2) must contain the remaining attribute(s) and an attribute(s) that links r1 to r2 for joining purposes. There is one remaining attribute (id) in this case, but there is a choice between major and advisor as the linking attribute (foreign key). The rule is to always choose the determinant, which is major.

r2(id, major)

As it stands, joining r1 and r2 will produce the original relation. On the other hand, if advisor were chosen as the foreign key for r2 in step 2 (r2(id, advisor)), then joining r1 and r2 would NOT reproduce the original relation, but rather a relation with erroneous data. WHY?

Advisor multi-determines major – a single advisor can advise students in different majors.
If r2 is joined to r1 based on advisor, then all of a given advisor’s majors will be associated with each student that he/she advises. Suppose a student, s2, has one major (math) and the advisor is a5. Assume that a5 is also an advisor for physics majors. Now, when r2 is joined with r1, it will erroneously indicate that s2 is majoring in math and physics!!!

Exercises: Identify the normal form and decompose the following relations.

student(id, activity, fee)
activityàfee
student(id, major, advisor)
idààmajor
majorààadvisor
advisoràmajor
part(partnum, supplier, price)
partnumààprice
supplieràprice

Answers

1. a) 2nf - nonkey (activity) determines nonkey (fee).
b) r1(activity, fee) r2(id, activity)

2. a) 3nf – nonkey (advisor) determines subkey (major).
b) r1(advisor, major) r2(id, advisor)

3. a) 1nf – subkey (supplier) determines nonkey (price).

Transforming the Conceptual Data Model into Relations

Once a conceptual data model (CDM) has been approved, the next step is to transform it into a logical data model of choice – in this case the relational data model. What follows are examples of the various transformations.

Example 1: Transforming a One to One Relationship.

THE TRANSFORMATION PROCESS

Create a relational schema for each entity.

FACULTY(facId, name) MAJOR(majorId, description).

Transform the one to one relationship.
This can be done in either of two ways:

a) Add the facId field to the major relation where it serves as a foreign key
that relates (connects) the two relations.

b) Add the majorId field to the faculty relation where it serves as a foreign
key that relates (connects) the two relations.

Oval Callout: Foreign Key
(option a)

MAJOR(majorId, facId, description).

Choosing a or b may be a toss of the coin or may be determined by a reason. Suppose the following: every faculty does not act as an advisor; each major has an advisor; and option b was chosen. If that were the case, then some faculty rows would contain a null value in the majorId field indicating that they are not advisors. Usually null values are to be avoided wherever possible – choosing option a avoids the null values in this case.

NOTE:

.The CDM and the relational model are two different languages (notations). The CDM expresses the relationship between two entities as a line connecting the two entities. The relational model expresses the relationship between two relations by way of a common field appearing in each relation. This is done by adding the primary key field of one relation to the other relation where the field is called the foreign key. A foreign key should not be shown in a CDM – it has no meaning in the CDM language.

Example 2: Transforming a One to Many Relationship.

THE TRANSFORMATION PROCESS

Create a relational schema for each entity.

DEPARTMENT(deptid, description) EMPLOYEE(empId, name).
Transform the one to many relationship.
Add a foreign key to the relation on the many side – that is, add the primary key field of the relation on the one side to the relation on the many side.

EMPLOYEE(empId, deptId, name)

Example 3: Transforming a Many to Many Relationship.

THE TRANSFORMATION PROCESS

Create a relational schema for each entity.

STUDENT(sid, name) CLASS(cid, description).
Transform the many to many relationship.
Create an additional relation containing the primary keys of the original two relations.
Each of these fields will serve as foreign keys connecting the new relation to the other two relations. The combination of the foreign keys may serve as the primary key of the new relation.

                                    ENROLL(sid, cid)

                STUDENT(sid, name)     CLASS(cid, description).

Example 4: Transforming Aggregate Entities.

THE TRANSFORMATION PROCESS

Create a relational schema for each entity.

STUDENT(sid, name) CLASS(cid, description).
Transform the aggregate entity.
a) Transform the many to many inside the aggregate entity
ENROLL(sid, cid)

b) In a sense the enroll relation is the relation representing the aggregate entity. What remains is the addition of the aggregate entity’s attributes to the enroll relation.
ENROLL(sid, cid, finalGrade)

Example 5: Transforming Ternary Relationships.

Create a relational schema for each entity.

EMPLOYEE(empId, ……) CITY(cityId, ….). PROJECT(projId, ….)

NOTE:

.The primary key in ASSIGNMENT is composed of two fields – empId and cityId. This means that a given empId cannot have two equal cityId’s without having duplicate keys. This restraint reflects the business rule that a given employee can be assigned to only one project per city.

Transform the ternary relationship.
Create a single relation containing the primary key of each entity in the ternary relationship.

ASSIGNMENT(empId, cityId, projId)

What if the ternary relationship above were inside an aggregate entity? Just add any aggregate entity attributes to the assignment relation.

XML vs RELATIONAL
(Comparison of Models)

The following relational database stores student exam scores.

student
id	name
100	able
200	baker

grades
id	class	exam	score
100	c1	1	85
100	c1	2	88
100	c1	3	92
100	c2	1	87
200	c1	1	75
200	c1	2	85
200	c3	1	90

The following XML document stores the same information.

<?xml version="1.0" ?>

- <student_grades>

- < <student>

< <id>100</id>

< <name>able</name>

- < <grades>

- < <grade>

< <exam>1</exam>

< <score>85</score>

</grade>

- < <grade>

< <exam>2</exam>

< <score>88</score>

</grade>

- < <grade>

< <exam>3</exam>

< <score>92</score>

</grade>

<grade>

< <exam>1</exam>

< <score>87</score>

</grade>

</grades>

</student>

- < <student>

< <id>200</id>

< <name>baker</name>

- < <grades>

- < <grade>

< <exam>1</exam>

< <score>75</score>

</grade>

- < <grade>

< <exam>2</exam>

< <score>85</score>

</grade>

<grade>

< <exam>1</exam>

< <score>90</score>

</grade>

</grades>

</student>

</student_grades>

Note the differences:

XML

1. Data stored in one document.

2. No foreign key(s) are required.

3. The structure is hierarchical.

4. Can contain many levels.

5. Is a text document - can be read by any text reader including notepad, and web browsers.

6. Ideally suited for exchanging data on the web.

RELATIONAL

1. Requires two tables to store the data.

2. id is repeated in the scores table as a foreign key serving as a means for joining the two tables.

3. The structure is tabular – rows and columns.

4. Limited to three levels: table; record; field.

5. Data is stored in a proprietary binary format whose access is primarily limited to a specific database system.

6. Not suitable for exchanging data on the web.

WEB APPLICATION/ASP.NET

Databases play a central role in today’s pervasive world of web applications.

WHAT IS A WEB APPLICATION?

A web application is an application that can be initiated by anyone on the web. The web user supplies the web browser with the web address (URL) of the application – for example www.google.com. The consumer of a web application will be called the client and the provider of the web application will be called the server. The server is often designed to host thousands of clients accessing the application at the same time from all regions of cyberspace. Consequently, the web application must be designed to embrace many different clients:

clients with various web browsers and versions;

clients with various computer hardware/software (platforms);

and more.

Below is a diagram depicting the interaction between clients and a server. Each client may engage in a series of request/response sequences. It should be noted that a single sever may host a variety of web applications at the same time.

WHAT ARE THE MAIN COMPONENTS OF A WEB APPLICATION?

SERVER-SIDE

Web Server such as Microsoft’s IIS.

Receives client requests
Administers the execution of server-side scripts
Renders the response readable by the client’s browser

Program technology, such as Microsoft’s ASP.NET, for storing and executing scripts. (web programs).

Provides a document for storing scripts and data – ASP.NET document uses .aspx extension
Executes server-side scripts that may interact with databases using SQL statements
Server-side scripts are written in languages such as:

i. C#

ii. VB.NET (Visual Basic)

iii. C

iv. Others

Database such as Microsoft’s Access.

Stores client data
Stores SQL stored procedures
Stores application scripts and data

CLIENT-SIDE

Web Browser such as Microsoft’s IE (Internet Explorer)

1. Has the facility to execute client-side scripts written in:

a. JavaScript

b. VBScript

c. Others

2. Has the facility to parse XML documents

3. Exposes (provides access) the DOM (Document Object Model)

HOW DOES A SERVER-SIDE SCRIPT INTERACT WITH A DATABASE?

Using ASP.NET

A set of data access classes, called ADO.NET, is exposed (made available) to server-side scripts. This set contains many useful data access classes such as:

OleDbConnection – a connection object is used to connect the script to a database.
OleDbCommand – a command object is used to execute SQL commands against the database
DataSet – a dataset object can be used as a database that is stored in the computer’s memory and is disconnected from the actual database.
OleDbDataAdapter – a DataAdapter object acts as an intermediary between the actual database and the dataset.

inputs data (executes SQL statements) from the database.
assigns data to the dataset.
updates (executes SQL statements) the database with data from the dataset.
A DataAdapter object has its own built-in connection object and command object.

Here is a sequence of statements from a C# script that creates a dataset derived from the premier database.

string strconn,strcmd1,strcmd2;

strconn="Provider=Microsoft.Jet.OLEDB.4.0;";

strconn+="data Source=c:\\inetpub\\wwwroot\\premier.mdb";

strcmd1="select * from worker";

// connect & execute the SQL command using DataAdapter’s connection object & command object

OleDbDataAdapter da=new OleDbDataAdapter(strcmd1,strconn);

DataSet ds=new DataSet(); // create a dataset object

da.Fill(ds,"worker"); // assign the worker table to the dataset (ds)

strcmd1="select * from assignment";

// execute the SQL command using DataAdapter’s command object

da.SelectCommand.CommandText=strcmd1;

da.Fill(ds,"assignment"); // assign the assignment table to the dataset (ds)

strcmd1="select * from building";

// execute the SQL command using DataAdapter’s command object

da.SelectCommand.CommandText=strcmd1; // execute the SQL command

da.Fill(ds,"building"); // assign the building table to the dataset (ds)

// finally, the DataSet, ds, is complete

MORE ABOUT THE DATASET

This section will use the DataSet, ds, that was created above. The Dataset is a very useful data access object.

Here is an outline of the DataSet.

Assume:

ds = a DataSet object

dt = a DataTable object.

dr = a DataRow object

dc = a DataColumn object

1. Each Dataset has a collection of Tables.

ds.Tables

2. Each Table has a collection of Rows plus a collection of Columns.

a. dt.Rows

b. dt.Columns

3. Collections such as Tables, Rows, and Columns:

a. have a Count property.

i. ds.Tables.Count Indicates the number of tables

ii. dt.Rows.Count Indicates the number of rows

iii. dt.Columns.Count Indicates the number of columns

b. can be indexed to reference individual members of the collection – the index is zero-based

i. ds.Tables[2] refers to the 3^rd table in the collection

ii. dt,Rows[0] refers to the 1^st row in the collection

iii. dt.Rows[2][3] refers to the value stored in the 4th column of the 3^rd row in the collection

4. Each Row is a list of column values and is indexed.

dr[3] refers to the 4^th value in the row, which represents column 4’s value.

5. Each Column has a set of properties such as ColumnName. An individual column is not indexed.

dc.ColumnName

Here is a list of various Dataset related C# notations:

ds.Tables refers to the collection of tables comprising ds

ds.Tables.Count to the number (count) of tables in the dataset

ds.Tables[0] refers to to the first table assigned to ds – the “worker” table

ds.Tables[“worker”] also refers to the “worker” table

ds.Tables[1] refers to the second table assigned to ds – the “assignment” table, etc.

ds.Tables[“worker”].Rows refers to the “worker” table’s rows

DataTable dtw=ds.Tables["worker"]; creates a DataTable object for the “worker” table

dtw.Rows refers to the “worker” table’s rows – easier notation than ds.Tables[“worker”].Rows

dtw.Rows.Count refers to the number (count) of rows in the table

dtw.Rows[0] refers to the first row of the table

dtw.Rows[0][0] refers to the value stored in the first column of the first row of the”worker” table

dtw.Rows[0]["worker_id"] refers to the value stored in the first column (“worker_id”) of the first row of the table

DataRow dr=dtw.Rows[0]; creates a DataRow object of the first row of the “worker” table

dr[0] refers to the value stored in the first column of the first row of the “worker” table

dtw.Columns refers to the collection of columns in the “worker” table

dtw.Columns.Count dtw.Columns.Count refers to the number (count) of columns in the “work” table

dtw.Columns[0].ColumnName refers to the name of column 1

DataColumn dc=dtw.Columns[0]; creates a DataColumn object of the first column of the “worker” table

dc.ColumnName refers to the name of column 1 – same as dtw.Columns[0].ColumnName

Here is a nested for loop that outputs the column name and column value of each row of the worker table

Oval Callout: This loop retrieves each row in the table

for (int row=0;row<dtw.Rows.Count;row++)

{

for (int col=0;col<dtw.Columns.Count;col++)

Oval Callout: This loop retrieves each column in the row {

Response.Write(dtw.Columns[col].ColumnName+" = ");

Response.Write(dtw.Rows[row][col]);

Response.Write("<br>");

}

Response.Write("**************************************<br>");

}

Here is the output of the first two rows:

WORKER_ID = 1235
WORKER_NAME = M. FARADAY
HRLY_RATE = 18.5
SKILL_TYPE = ELECTRIC
SUPV_ID = 1311
**************************************
WORKER_ID = 1311
WORKER_NAME = C. COULOMB
HRLY_RATE = 15.5
SKILL_TYPE = ELECTRIC
SUPV_ID = 1311
**************************************

INDEX

How does a DBMS differ from the Traditional File System?

What is a Conceptual Data Model?

Define the table’s metadata( (field names, data type)

Add the actual data values for each record of the table.

Specify the relationships between tables - another item of metadata.

Select Clause

From Clause

Where Clause

Group by Clause

Generalization (SuperType/SubType)

** Exercise Answers

GENERIC MODELS

NORMAL FORMS

Theory of Relations

Anomalies

Normal Forms/Normalization

XML vs RELATIONAL (Comparison of Models)

WHAT IS A WEB APPLICATION?

XML vs RELATIONAL
(Comparison of Models)