Variables and data

ADS

VARIABLES AND DATA
In scientifi c research, data arise from experiments whose results are recorded systematically. In business, data usually arise from accounting transactions or management processes (e.g., inventory, sales, payroll). Much of the data that statisticians analyze were recorded withoutexplicit consideration of their statistical uses, yet important decisions may depend on the data. How many pints of type A blood will be required at Mt. Sinai Hospital next Thursday? How many dollars must State Farm keep in its cash account to cover automotive accident claim next November? How many yellow three-quarter-sleeve women’s sweaters will Lands’ End sell this month? To answer such questions, we usually look at historical data.

Data Terminology
An observation is a single member of a collection of items that we want to study, such as a person, fi rm, or region. An example of an observation is an employee or an invoice mailed last month. A variable is a characteristic of the subject or individual, such as an employee’s income or an invoice amount. The data set consists of all the values of all of the variablesfor all of the observations we have chosen to observe. In this book, we will use data as a plural, and data set to refer to a collection of observations taken as a whole. Data usually are entered into a spreadsheet or database as an n 3 m matrix. Specifi cally, each column is variable (m columns) and each row is an observation (n rows). Table 2.1 shows a small data set with eight observations (8 rows) and fi ve variables (5 columns). A data set may consist of many variables. The questions that can be explored and the analytical techniques that can be used will depend upon the data type and the number of variables. This textbook starts with univariate data sets (one variable), then moves to bivariate data sets (two variables) and multivariate data sets (more than two variables), as illustrated in Table 2.2.

Categorical and Numerical Data
A data set may contain a mixture of data types. Two broad categories are categorical data and numerical data, as shown in Figure 2.1.

Categorical Data Categorical data (also called qualitative data) have values that are described by words rather than numbers. For example, structural lumber can be classifi ed by the lumber type (e.g., fi r, hemlock, pine), automobile styles can be classifi ed by size (e.g., full, midsize, compact, subcompact), and movies can be categorized using common movie classifications (e.g., action and adventure, children and family, classics, comedy, documentary). Because categorical variables have nonnumerical values, it might seem that categorical data would be of limited statistical use. In fact, there are many statistical methods that can handle categorical data.

of the categorical variable might be represented using numbers. This is called coding. For example, a database might code payment methods using numbers:
1 = cash 2 = check 3 = credit/debit card 4 = gift card

Coding a category as a number does not make the data numerical and the numbers do not typically imply a rank. But on occasion a ranking does exist. For example, a database might code education degrees using numbers:
1 = Bachelor’s 2 = Master’s 3 =Doctorate

Some categorical variables have only two values. We call these binary variables. Examples include employment status (e.g., employed or unemployed), mutual fund type (e.g., load or no-load), and marital status (e.g., currently married or not currently married). Binary variables are often coded using a 1 or 0. For a binary variable, the 0-1 coding is arbitrary, so the choiceis equivalent. For example, a variable such as gender could be coded as:

1 = female 0 = male
or as
1 = male 0 = female

something,
or some kind of mathematical operation. For example, we could count the number of auto insurance claims fi led in March (e.g., 114 claims) or sales for last quarter (e.g., $4,920),or we could measure the amount of snowfall over the last 24 hours (e.g., 3.4 inches). Most accounting data, economic indicators, and fi nancial ratios are quantitative, as are physical measurements. Numerical data can be broken down into two types. A variable with a countable numberof distinct values is discrete. Often, such data are integers. You can recognize integer data because their description begins with “number of.” For example, the number of Medicaid patients in a hospital waiting room (e.g., 2) or the number of takeoffs at Chicago O’Hare International Airport in an hour (e.g., 37). Such data are integer variables because we cannot observe a fractional number of patients or takeoffs

A numerical variable that can have any value within an interval is continuous. This would include things like physical measurements (e.g., distance, weight, time, speed) or fi nancial variables (e.g., sales, assets, price/earnings ratios, inventory turns); for example, runner Usain Bolt’s time in the 100-meter dash (e.g., 9.58 seconds) or the weight of a package of Sun- Maid raisins (e.g., 427.31 grams). These are continuous variables because any interval such as [425, 429] grams can contain infi nitely many possible values. Sometimes we round a continuous measurement to an integer (e.g., 427 grams), but that does not make the data discrete.

Apparent ambiguity between discrete and continuous is introduced when we round continuous data to whole numbers (e.g., your weight this morning). However, the underlying measurement scale is continuous. That is, a package of Sun-Maid raisins is l abeled 425 grams, but on an accurate scale its weight would be a noninteger (e.g., 427.31). P recision depends on the instrument we use to measure the continuous variable. We generally treat fi nancial data(dollars, euros, pesos) as continuous even though retail prices go in discrete steps of .01 (i.e., we go from $1.25 to $1.26). The FM radio spectrum is continuous, but only certain discrete values are observed (e.g., 104.3) because of Federal Communications Commission rules. Conversely, we sometimes treat discrete data as continuous when the range is very large (e.g., SAT scores) and when small differences (e.g., 604 or 605) aren’t of much importance. This topic will be discussed in later chapters. If in doubt, just think about how X was measured and whether or not its values are countable

ADS

Variables and data

No comments:

Post a Comment