Maths Project on Statistics

Statistics is a mathematical science pertaining to the collection, analysis, interpretation and presentation of data.

In our daily life, we have to collect facts which help us in answering most of the questions concerning the world in which we live. The facts we collect are often number facts such as the number of runs scored by Indian team against Pakistan.

The methods and techniques of collection, presentation, analyses and interpretations of numerical data in a logical and systematic manner so as to serve a purpose is known as ‘statistics’.

Meaning of Statistics

Statistics is concerned with scientific method for collecting and presenting, organizing and summarizing and analyzing data as well as deriving valid conclusions and making reasonable decisions on the basis of this analysis.

Origin and Growth of Statistics

The word ‘statistics’ and ‘statistical’ are derived from the Latin word status, meaning political state.

The German statistik, first introduced by Gottfried Achenwall (1749), originally designated the data analysis of state.
It was used by the British mainly for administrative and governmental bodies.
In particular, census provides regular information about the population.
Today, statistics has broadened far beyond the service of a state or government; it includes areas such as business, natural and social sciences, and medicine.
Before 3000 B.C. the Babylonians used small clay tablets to record tabulations of agricultural yields and of commodities bartered or sold.
The Egyptians analyzed the population and material wealth of their kingdoms.
The Roman Empire was the first government to gather extensive data about population, area and wealth of territories that they controlled.

Fundamental Characteristics of Statistics

They are related to each other and are comparable.
They are aggregate of facts and not a single observation. Statistics do not take into account individual cases.
Statistical data are numerically expressed.
Statistics are collection of data in a systematic manner.
Statistics are collected for a predetermined purpose.
Statistics deals with groups and does not study individually.
Statistics laws are not exact; they are true only on averages.
The data collected by someone else, other than the investigator, are known as secondary data.
The data obtained in the original form are called ungrouped data or raw data.
An arrangement of raw numerical data in ascending or descending order of magnitude is called an array.

These figures are in ascending order:

4, 5, 8, 18, 28, 29, 29, 31, 40, 40, 43, 43, 46, 46, 46, 47, 47, 50, 50, 55, 55, 70, 71, 75, 75, 80, 90

Marks	No. of Students	Marks	No. of Students
4	1	46	3
5	1	47	2
8	1	50	2
18	2	55	3
28	1	70	1
29	2	71	1
31	1	75	2
40	2	80	1

The above data in ascending order is called an arrayed data, and the way of arrangement is called an array.

The way of arrangement of data in a table is known as frequency distribution.

Marks are called variates. The number of students who secured a particular number of marks is called the frequency of the variate.

The number of times a number has been repeated is called the frequency of the variate.

Continuous: Quantities which can take all numerical values within a certain interval.

Discontinuous: Quantities or variables which can take only a finite set of values.

Each group into which raw data is condensed is called a class. The size of the class is known as the class interval.

For example, 10 is the class interval of the class 0-10.

Each class is bounded by two figures which are called the lower limit and the upper limit.

The difference between the upper limit of the class and the lower limit of the class is called the class size.

The value which lies midway between the lower and upper limits of a class is known as its mid value or class mark.

Class mark = (upper limit + lower limit) / 2

The difference between the two extreme observations in arranged data, i.e. the difference between the maximum and minimum values of observations, is known as the Range.

The three measures of central tendency are:

Mean
Mode
Median

Mean

If x1, x2, x3, …, xn are values of a variable x, then the arithmetic mean or simply mean of these values is denoted by X and is defined as:

X = (x1 + x2 + x3 + … + xn) / n

or X = (∑i=1 to n of xi) / n

Algorithm

Prepare the frequency table so that its first column consists of the values of the variate and the second column the corresponding frequencies.
Multiply the frequency of each row with the corresponding values of the variable to obtain a third column containing fixi.
Find the sum of all entries in column III to obtain ∑fixi.
Find the sum of all the frequencies in column II to obtain ∑fi = N.
Use the formula: X = ∑fixi / N

Example

Find the missing frequencies in the following frequency distribution if it is known that the mean of the distribution is 1.46.

No. of accidents (x): 0, 1, 2, 3, 4, 5 -- Total

Frequency (f): 46, f1, f2, 25, 10, 5 -- 200

Computation of Arithmetic Mean

xi	fi	fixi
0	46	0
1	f1	f1
2	f2	2f2
3	25	75
4	10	40
5	5	25
	∑f = 86+f1+f2	∑fixi = 140+f1+2f2

N = 200

200 = 86 + f1 + f2
f1 + f2 = 114 … (1)

Also, Mean = 1.46:

1.46 = ∑fixi / N
1.46 = (140 + f1 + 2f2) / 200
292 = 140 + f1 + 2f2
f1 + 2f2 = 150 … (2)

Solving (1) and (2):

f1 = 76 and f2 = 38

Step Deviation Method

Obtain the frequency distribution and prepare the frequency table so that its first column consists of the values of the variable and the second column the corresponding frequencies.
Choose a number A (generally known as the assumed mean) and take deviations di = xi - A about A. Write these deviations against the corresponding values in the third column.
Divide the deviations di by h (class width) to get ui = di / h. Write these ui values in the fourth column.
Multiply the frequencies in the second column with the corresponding ui values in the fourth column to prepare a fifth column of fiui.
Find the sum of all entries in the fifth column to obtain ∑fiui and the sum of all frequencies to obtain N = ∑fi. Use the formula: X = A + h * (1/N * ∑fiui)

Median

The median is the middle value of a distribution - the value of the variable which divides it into two equal parts.

Algorithm

Arrange the observations x1, x2, …, xn in ascending or descending order of magnitude.
Determine the total number of observations, say n.
If n is odd, then the median is the value of the ((n+1)/2)th observation.

Example

Calculate the median from the following distribution:

Class	Frequency
5-10	5
10-15	6
15-20	15
20-25	10
25-30	5
30-35	4
35-40	2
40-45	2

Solution: First, prepare the cumulative frequency table.

Class	Frequency	Cumulative Frequency
5-10	5	5
10-15	6	11
15-20	15	26
20-25	10	36
25-30	5	41
30-35	4	45
35-40	2	47
40-45	2	49

N = 49, N/2 = 24.5

The cumulative frequency just greater than N/2 is 26, and the corresponding class is 15-20 (median class).

l = 15, f = 15, F = 11, h = 5

Median = l + ((N/2 - F) / f) * h = 15 + ((24.5 - 11) / 15) * 5 = 19.5

Mode

The mode or modal value of a distribution is that value of the variable for which the frequency is maximum. To compute the mode of a series of individual observations, we first convert it into a discrete series frequency distribution by preparing a frequency table. From the frequency table, we identify the value having maximum frequency.

Where:

l - lower limit of the modal class
h - width of the modal class
f - frequency of the modal class
f1 - frequency of the class preceding the modal class
f2 - frequency of the class following the modal class

Mode = l + ((f - f1) / (2f - f1 - f2)) * h

Relationship Among Mean, Median and Mode

Mode = 3 * Median - 2 * Mean
Median = Mode + (²⁄₃) * (Mean - Mode)
Mean = Mode + (³⁄₂) * (Median - Mode)

Types of Statistical Graphs

Pie Chart

A pie chart displays data as a percentage of the whole. Each slice has a label and percentage. These have a circle divided into parts or sectors of different sizes to show different amounts of data.

Advantages:

Visually appealing.
Shows percentage of total for each category.

Disadvantages:

No exact numerical data.
Hard to compare two data sets.
“Other” category can be a problem.
Total unknown unless specified.
Best for 3 to 7 categories only.

Bar Graph

A bar graph displays data in separate columns. If data is on a continuous scale, such as height, the bars touch each other. The bars can be vertical or horizontal.

Advantages:

Visually strong.
Can compare 2 or 3 data sets.

Disadvantages:

Graph categories can be reordered to emphasise certain effects.

Line Graph

A line graph plots continuous data as points and then joins them with a line. Multiple data sets can be grouped together, but a key must be used.

Advantages:

Can compare multiple continuous data sets easily.
Interim data can be inferred from the graph line.

Disadvantages:

Use only with continuous data.

Graph: Run rate comparison plotted as a line graph (plotted from data above).

Histogram

A histogram displays continuous data in ordered columns. Categories are of continuous measures such as time, inches, temperature, etc.

Advantages:

Visually strong.
Can compare to normal curve.
Usually the vertical axis is a frequency count of items falling into each category.

Disadvantages:

Cannot read exact values because data is grouped into categories.
More difficult to compare two data sets.
Use only with continuous data.

Graph: Histogram showing frequency distribution of the data (plotted from data above).

Uses and Applications of Statistics

Industries and Business

Report of early sales and comparison with others.
It shows where the factory or its sales lack and where they perform well.

Agriculture

What amount of crops are grown this year in comparison to the previous year or in comparison to the required amount of crop for the country.
Quality and size of grains grown due to use of different fertilizers.

Forestry

How much growth has occurred in area under forest, or how much forest has been depleted in the last 5 years.
How much different species of flora and fauna have increased or decreased in the last 5 years.

Education

Money spent on girls’ education in comparison to boys’ education.
Increase in number of girl students who appeared for different exams.
Comparison of results for the last 10 years.

Ecological Studies

Comparison of increasing impact of pollution on global warming.
Increasing effect of nuclear reactors on environment.

Medical Studies

Number of new diseases that emerged in the last 10 years.
Increase in number of patients for a particular disease.

Sports

Used to compare run rates of two different teams.
Used to compare two different players.

Conclusion

Statistics has been a great learning experience and is a very interesting and important topic that is very helpful for people of all ages and for teachers to clarify their concepts and increase their knowledge.

Maths Project on Statistics