Data Sample

 

            Data was gathered from the Meredith College database during the spring of the 2003 academic year.  Founded in 1891, Meredith College is a private, women’s liberal arts college located in Raleigh, North Carolina.  During the 2002-2003 school year, the student body at Meredith College consisted of 2,328 undergraduate and graduate students of traditional and non-traditional age.  These students represented 28 states and 17 foreign countries, with less than one percent being of a foreign nationality.  For this reason, it is acceptable to assume the birthdays of Meredith College students will provide an accurate distribution of the American fertility cycle.

            Another concern raised about the nature of our data is the fact it contains only female birthdays.  Some might wonder if females follow a different seasonal birth pattern compared to males.  Since seasonal breeding animals synchronize their reproduction to seasons, which are more favorable to newborn survival, it is reasonable to believe similar factors may also cause humans to synchronize their breeding to more favorable conditions.  Research indicates males are more fragile and die earlier than females; thus fewer males than females are conceived in sub-optimal conditions, such as the age of parents, environmental pollution, smoking parents, etc…  At first glance, this research indicates the Meredith College data sample of all females may not accurately model the American fertility distribution for both genders.  However, Cagnacci, Renzi, Alessandrini, and Volpe [3] tested this theory on 14,310 births from 1995 to 2001, and concluded that the sex ratio at birth did not show a significant seasonal variation.  Therefore it is reasonable to use the Meredith College data sample as a representation of the American population.

 

Birthday Frequencies by Month

            From the Meredith College Registrar’s office, 2,106 individuals’ birthdays were collected to make up our data sample.  The frequency for this data is shown below in Table 3.1 and is broken down by months – for a more detailed frequency breakdown by calendar day, see Appendix A. 

 


Table 3.1: Birthday Frequencies by Month of the Meredith College Data Sample

* A table with birthday frequencies by calendar day can be found in Appendix A.

Month

Frequency

Observed

Percent

Uniform Percent

Difference

(ObsUni)

January

155

7.36

8.49

-1.13

February

162

7.69

7.67

0.02

March

173

8.21

8.49

-0.28

April

170

8.07

8.21

-0.14

May

173

8.21

8.49

-0.28

June

177

8.40

8.21

0.19

July

195

9.26

8.49

0.77

August

198

9.40

8.49

0.91

September

198

9.40

8.21

1.19

October

153

7.26

8.49

-1.23

November

172

8.17

8.21

-0.04

December

180

8.55

8.49

0.06

 

 

Notice, the percents for each month are between 7.36 and 9.40 percent, which is a point difference from a twelve-bin uniformity model -  However, according to von Mises’ original assumption of uniformity, each day has an equal probability of being a birthday.  This means the uniform probability for each month changes based on the amount of days contained within the observed month.  Uniform probabilities for each month were found according to the following calculations for months with twenty-eight, thirty and thirty-one days respectively:

  

 

Given these new percents it is not enough to simply look at the amount of births observed in a particular month; more importantly is the difference between the observed values and the uniform expected model.  Since there is a difference between the values, there is variation in the observed data from the uniform model; this suggests the possibility of an underlying non-uniform distribution for the observed data and population. 

Recall from earlier, America experiences a peak in births during September and an April-May trough; the data meets the expectation for September with the three highest months being July, August, and September with a positive difference of 0.77, 0.91, and 1.19 respectively.  According to the data, the rise in births starts in July and gradually grows until it reaches a peak in September.  However, the trough for the sample never occurs during a set period of months.  January and February have the lowest percent of births, 7.36 and 7.69, but when looking at the difference information, January has the second lowest difference at -1.13 and February is almost equal to the expected uniform distribution at a difference of 0.02. The lowest difference occurs in October, which is two months away from January.  The two extremes offer nothing in support of a trough period, however when focusing on clusters of negative differences the months March, April, and May stand out with differences of -0.28, -0.14, and -0.28.  This period aligns perfectly with predicted trough based on current research.  However, the large drop observed in January can also be explained according to current research.  Since the Meredith College student body is predominantly white and according to research white Americans experience a through during December and January, the noticeable difference in January is somewhat explained.  Thus our data appears to support previous research in this area.

            While on this level, our data supports research which states birth patterns follow a non-uniform distribution, the important question is whether the variation is large enough to reject the null assumption of a normal distribution.  When Nunikhoven [13] examined birthdays distributed by births per day frequency based on month, he found the difference between his observed distribution and the normal distribution to be insignificant.  Based on this research, there is evidence to suggest that even though our data suggests a distribution influenced by seasonal birth patterns, the difference might not be significant enough to reject the uniform distribution.

Since the registrar’s office only revealed students’ birthdays, it is impossible to know the name, age, or nationality of a student for any specified birthday.  Due to this characteristic of the sample, none of the birthdays were thrown out.  The fact that our data sample is smaller than the reported student body for Meredith College during 2002 – 2003 is because students can choose whether or not to give Meredith College their birthday.  In this case it is reasonable to assume, the 222-person/birthday difference between the two samples is due to some students not reporting their birthday to Meredith College – not all of the Meredith College student body is represented in the sample.  Another important fact to point out about the data set is that the data is self-reported; so it is conceivable that some students would provide an incorrect birthday.  However, considering the importance of a higher education in the United States and the trust a student places in her institution, this concern was disregarded – assume all of the birthdays are correct.

Having addressed the main concerns about the data sample and recognized the non-uniformity and seasonal pattern of the data set for a 12-month period, it is now necessary to continue with analyzing the data set.  Graph 3.2 displays the histogram for the birthdays grouped by month. 

 

Graph 3.2: Histogram of Monthly Birthday Distribution

 

 

The graph supports the patterns previously noticed in the table; there is a peak in September and a drop in January and February.  Notice the graph does not communicate the difference between the observed values and the expected values based on the uniform model.  This leads the drop in January and February to actually appear to be the trough for the data sample instead of March-May as observed from the table earlier.  In the graph, many of the months appear to contain about the same amount of birthdays and referring to the table indicates that few months differ drastically from the uniform model.  While not completely uniform, the lack of variation observed may support the uniform distribution; indicating that while not exact, the uniform distribution does an adequate job of modeling the birthday distributions. However since the data does support the seasonal pattern, a more detailed understanding of the data is required. 

 

Birthday Frequencies by Calendar-Day

            Since the monthly distributions originate from an underlying uniform distribution based on the 365 calendar days, analyzing the data according to calendar day frequency is the next step in understanding the nature of the Meredith College data set.  While it may seem appropriate to look at the daily distribution by month before exploring the daily year distribution, this type of analysis is deceiving and inaccurate.  Recall from the earlier analysis that the varying amount of from month to month changes the expected probability of a birthday.  This characteristic again causes problems in a distribution by days of the month, because after twenty-eight days some months no longer contribute data.  Thus this type of analysis produces a fairly strong relationship for days up to twenty-eight, but then tapers off until the thirty-first day.  Since it is not appropriate to just ignore all the days greater than twenty-eight, looking at the data on a calendar day level is the next logical step in the data analysis process.

            Breaking the data into 365 bins with each bin representing one calendar day produces even more information about the distribution of the data set.  On this small of a scale, the data does not seem to follow any form of the uniform distribution; with raises and falls not following any pattern, the data is noisy.  In the smaller of the two graphs, Graph 3.3, a red line roughly marks the uniform distribution located at the uniform probability that a person is born on any day of the year. 

 

Graph 3.3: Histogram of the Meredith College Data Binned by Calendar Day

 

 

The resolution on the smaller graph does not provide a clear representation for each day.  This graph serves only to offer a general feel for how the data is distributed when observed by calendar day.  Also one can gather a general idea about the relation of the data sample to the uniform distribution by glancing at the graph.  On this small of a scale, it is apparent that the data is rather noisy, reaching highs and lows at random and following no apparent pattern.  Considering the data set shows no indication of an underlying uniform distribution, this absence of a pattern offers no support for either the uniform distribution or a seasonal pattern.  These trends are better observed on a larger graph with higher resolution.

 


Graph 3.4: Histogram of the Meredith College Data Binned by Calendar Day

 

 

 


While many low points occur throughout the year, one day stands out above the rest with seventeen birthdays; this incident occurs on day 243 or August thirty-first, right in the middle of our two peak months.  Since seventeen is the maximum number of birthdays on any one day in the sample, it would be interesting to know the minimum amount of birthdays on any one day in the data set.  Given the uniform distribution it would seem reasonable to expect that in a sufficiently large population every day should contain at least one birthday.  While not obvious from graph 3.3 due to low resolution, the Meredith College data sample contains three days with no birthdays as may be seen in P2.  These days can be observed in the table of calendar day frequencies in Appendix A.  Surprisingly enough these days occur in late September, mid-October, and late November, outside of the trough months of March, April, and May.  This indicates the data sample is not large enough to directly replace the uniform distribution, because it is very unlikely that any day would not contain at least one birth out of the population, especially since these months are outside of the trough period.

 

Rejecting the Uniform Distribution

Recognizing that the data sample is not large enough to directly replace the uniform distribution raises another interesting question – is the sample size large enough to reject the assumption of a uniform distribution of birthdays?  An adequate estimation for a multinomial distribution will depend on the sample size in relation to the population.  A population with many ‘bins’ or categories for data to fall into will require a larger sample to adequately represent the population than those populations with fewer categories.  When the updated distribution is known, the following formula derived from work by Bromaghin, Thompson, and Tortora yields the minimum sample size needed to accurately represent a population. [7] 

 

In this equation, is the confidence coefficient, is the number of categories, is the posterior probability for category i,is the expected prior probability for category i, and is a z-score.

Using this formula, a minimum sample size of 3,506 people is needed for an updated distribution which assigns less credibility to the prior distribution and 2,607 people are needed for an updated distribution which places more credibility in the prior distribution.  Despite the updated distribution used, the Meredith College sample size is not large enough to guarantee that the true value for each bin will fall within a confidence interval around the predicted value.  Thus the sample size is not large enough to dispute the uniform distribution.


Table 3.5: Frequency of the # of Birthdays

# of Birthdays on Any Day

Frequency

Percent

0

3

.82

1

6

1.64

2

16

4.38

3

34

9.32

4

59

16.16

5

57

15.62

6

66

18.08

7

46

12.60

8

33

9.04

9

22

6.03

10

6

1.64

11

8

2.19

12

7

1.92

13

1

0.27

17

1

0.27

 

Graph 3.6: Frequency plot of the # of Birthdays

 


Frequency of Birthdays on any given Day

            Another interesting exploration raised from examining the minimum and maximum amount of birthdays per day is the frequency of the number of birthdays on any given calendar day.  This data is displayed in Table 3.5 and Graph 3.6 above.  The most common number of birthdays to fall on any one day is between four and seven; thus on any given day of the year, there is a 62.46% chance that between four and seven people are born on any chosen day.  With extremely high numbers extending out into the teens and extremely low numbers around zero and one, percent likelihood drops to around one percent.  This relationship suggests the number of birthdays clumps around one group of numbers and tapers off toward both ends, thus indicating a unimodal distribution with a mode of six. Since it is impossible to have negative birthdays and the mode is six, the unimodal distribution is probability right skewed.  While not discussed further in this paper, an interesting question for future exploration of the generalized birthday problem would be to examine the distribution for this model and the most likely number of birthdays to occur on any given day closer.

            This initial analysis of the data is consistent with the seasonal pattern observed primarily among Caucasian Americans and also hints at a significant non-uniform distribution.  While influencing the decision to continue with the uniform distribution updating, the presence of days with no birthdays in our data sample heavily influenced our decision to continue with the updating.  These two factors are important in the decision to continue further with the updating of the uniform distribution; however, the most important factor in the decision to continue is the presence of zero birthday values for some calendar days.  This indicates that an empirical distribution based on the Meredith College data sample is not large enough to replace the original uniform distribution; therefore some form of updating that blends the old distribution with observed data is necessary. Through the application of Bayesian Statistics, we will attempt to address the inadequacies of the data and the uniform model.  Also, establishing the foundation for this type of updating provides a pattern and data resources for anyone who chooses to explore this extension of the generalized birthday problem in the future.