Data
was gathered from the
Another
concern raised about the nature of our data is the fact it contains only female
birthdays. Some might wonder if females
follow a different seasonal birth pattern compared to males. Since seasonal breeding animals synchronize
their reproduction to seasons, which are more favorable to newborn survival, it
is reasonable to believe similar factors may also cause humans to synchronize
their breeding to more favorable conditions.
Research indicates males are more fragile and die earlier than females;
thus fewer males than females are conceived in sub-optimal conditions, such as
the age of parents, environmental pollution, smoking parents, etc… At first glance, this research indicates the
Birthday
Frequencies by Month
From the Meredith College Registrar’s office, 2,106 individuals’ birthdays were collected to make up our data sample. The frequency for this data is shown below in Table 3.1 and is broken down by months – for a more detailed frequency breakdown by calendar day, see Appendix A.
Month |
Frequency |
Observed Percent |
Uniform Percent |
Difference (Obs – Uni) |
January |
155 |
7.36 |
8.49 |
-1.13 |
February |
162 |
7.69 |
7.67 |
0.02 |
March |
173 |
8.21 |
8.49 |
-0.28 |
April |
170 |
8.07 |
8.21 |
-0.14 |
May |
173 |
8.21 |
8.49 |
-0.28 |
June |
177 |
8.40 |
8.21 |
0.19 |
July |
195 |
9.26 |
8.49 |
0.77 |
August |
198 |
9.40 |
8.49 |
0.91 |
September |
198 |
9.40 |
8.21 |
1.19 |
October |
153 |
7.26 |
8.49 |
-1.23 |
November |
172 |
8.17 |
8.21 |
-0.04 |
December |
180 |
8.55 |
8.49 |
0.06 |
Notice, the percents for each month
are between 7.36 and 9.40 percent, which is a point difference from a
twelve-bin uniformity model - However, according to
von Mises’ original assumption of uniformity, each
day has an equal probability of being a birthday. This means the uniform probability for each
month changes based on the amount of days contained within the observed
month. Uniform probabilities for each
month were found according to the following calculations for months with
twenty-eight, thirty and thirty-one days respectively:
Given these new percents it is not enough to simply look at the amount of births observed in a particular month; more importantly is the difference between the observed values and the uniform expected model. Since there is a difference between the values, there is variation in the observed data from the uniform model; this suggests the possibility of an underlying non-uniform distribution for the observed data and population.
Recall from
earlier,
While on this level, our data supports research which states birth patterns follow a non-uniform distribution, the important question is whether the variation is large enough to reject the null assumption of a normal distribution. When Nunikhoven [13] examined birthdays distributed by births per day frequency based on month, he found the difference between his observed distribution and the normal distribution to be insignificant. Based on this research, there is evidence to suggest that even though our data suggests a distribution influenced by seasonal birth patterns, the difference might not be significant enough to reject the uniform distribution.
Since the
registrar’s office only revealed students’ birthdays, it is impossible to know
the name, age, or nationality of a student for any specified birthday. Due to this characteristic of the sample,
none of the birthdays were thrown out.
The fact that our data sample is smaller than the reported student body
for Meredith College during 2002 – 2003 is because students can choose whether
or not to give Meredith College their birthday.
In this case it is reasonable to assume, the 222-person/birthday
difference between the two samples is due to some students not reporting their
birthday to
Having addressed the main concerns about the data sample and recognized the non-uniformity and seasonal pattern of the data set for a 12-month period, it is now necessary to continue with analyzing the data set. Graph 3.2 displays the histogram for the birthdays grouped by month.
The graph supports the patterns previously noticed in the table; there is a peak in September and a drop in January and February. Notice the graph does not communicate the difference between the observed values and the expected values based on the uniform model. This leads the drop in January and February to actually appear to be the trough for the data sample instead of March-May as observed from the table earlier. In the graph, many of the months appear to contain about the same amount of birthdays and referring to the table indicates that few months differ drastically from the uniform model. While not completely uniform, the lack of variation observed may support the uniform distribution; indicating that while not exact, the uniform distribution does an adequate job of modeling the birthday distributions. However since the data does support the seasonal pattern, a more detailed understanding of the data is required.
Birthday
Frequencies by Calendar-Day
Since
the monthly distributions originate from an underlying uniform distribution
based on the 365 calendar days, analyzing the data according to calendar day
frequency is the next step in understanding the nature of the
Breaking
the data into 365 bins with each bin representing one calendar day produces
even more information about the distribution of the data set. On this small of a scale, the data does not
seem to follow any form of the uniform distribution; with raises and falls not
following any pattern, the data is noisy.
In the smaller of the two graphs, Graph 3.3, a red line roughly marks
the uniform distribution located at the uniform probability that a person is born on any day of
the year.
The resolution on the smaller graph does not provide a clear representation for each day. This graph serves only to offer a general feel for how the data is distributed when observed by calendar day. Also one can gather a general idea about the relation of the data sample to the uniform distribution by glancing at the graph. On this small of a scale, it is apparent that the data is rather noisy, reaching highs and lows at random and following no apparent pattern. Considering the data set shows no indication of an underlying uniform distribution, this absence of a pattern offers no support for either the uniform distribution or a seasonal pattern. These trends are better observed on a larger graph with higher resolution.
While many low
points occur throughout the year, one day stands out above the rest with
seventeen birthdays; this incident occurs on day 243 or August thirty-first,
right in the middle of our two peak months.
Since seventeen is the maximum number of birthdays on any one day in the
sample, it would be interesting to know the minimum amount of birthdays on any
one day in the data set. Given the
uniform distribution it would seem reasonable to expect that in a sufficiently
large population every day should contain at least one birthday. While not obvious from graph 3.3 due to low
resolution, the
Rejecting
the Uniform Distribution
Recognizing that the data sample is not large enough to directly replace the uniform distribution raises another interesting question – is the sample size large enough to reject the assumption of a uniform distribution of birthdays? An adequate estimation for a multinomial distribution will depend on the sample size in relation to the population. A population with many ‘bins’ or categories for data to fall into will require a larger sample to adequately represent the population than those populations with fewer categories. When the updated distribution is known, the following formula derived from work by Bromaghin, Thompson, and Tortora yields the minimum sample size needed to accurately represent a population. [7]
In this equation, is the confidence coefficient,
is the number of categories,
is the posterior probability for category i,
is the expected prior probability for category i, and
is a z-score.
Using this
formula, a minimum sample size of 3,506 people is needed for an updated
distribution which assigns less credibility to the prior distribution and 2,607
people are needed for an updated distribution which places more credibility in
the prior distribution. Despite the
updated distribution used, the
# of Birthdays on Any Day |
Frequency |
Percent |
0 |
3 |
.82 |
1 |
6 |
1.64 |
2 |
16 |
4.38 |
3 |
34 |
9.32 |
4 |
59 |
16.16 |
5 |
57 |
15.62 |
6 |
66 |
18.08 |
7 |
46 |
12.60 |
8 |
33 |
9.04 |
9 |
22 |
6.03 |
10 |
6 |
1.64 |
11 |
8 |
2.19 |
12 |
7 |
1.92 |
13 |
1 |
0.27 |
17 |
1 |
0.27 |
Graph 3.6: Frequency
plot of the # of Birthdays
Frequency
of Birthdays on any given Day
Another interesting exploration raised from examining the minimum and maximum amount of birthdays per day is the frequency of the number of birthdays on any given calendar day. This data is displayed in Table 3.5 and Graph 3.6 above. The most common number of birthdays to fall on any one day is between four and seven; thus on any given day of the year, there is a 62.46% chance that between four and seven people are born on any chosen day. With extremely high numbers extending out into the teens and extremely low numbers around zero and one, percent likelihood drops to around one percent. This relationship suggests the number of birthdays clumps around one group of numbers and tapers off toward both ends, thus indicating a unimodal distribution with a mode of six. Since it is impossible to have negative birthdays and the mode is six, the unimodal distribution is probability right skewed. While not discussed further in this paper, an interesting question for future exploration of the generalized birthday problem would be to examine the distribution for this model and the most likely number of birthdays to occur on any given day closer.
This
initial analysis of the data is consistent with the seasonal pattern observed
primarily among Caucasian Americans and also hints at a significant non-uniform
distribution. While influencing the
decision to continue with the uniform distribution updating, the presence of
days with no birthdays in our data sample heavily influenced our decision to
continue with the updating. These two
factors are important in the decision to continue further with the updating of
the uniform distribution; however, the most important factor in the decision to
continue is the presence of zero birthday values for some calendar days. This indicates that an empirical distribution
based on the