Statistics Notes
1. Introduction to Statistics
Statistics is the study of collecting, analyzing, interpreting, presenting, and organizing data. It provides a way to understand and make decisions based on data. The two main branches are:
- Descriptive Statistics: Focuses on summarizing and describing the features of a data set. It includes measures such as mean, median, mode, range, variance, and standard deviation.
- Inferential Statistics: Uses a random sample of data taken from a population to describe and make inferences about the population. It includes hypothesis testing, confidence intervals, and regression analysis.
2. Descriptive Statistics
Descriptive statistics summarize and describe the main features of a data set. They provide simple summaries about the sample and the measures. This can be done through numerical calculations, graphs, or tables.
Measures of Central Tendency
- Mean: The average of all data points. It is calculated by adding all the numbers in a data set and then dividing by the count of those numbers.
\[ \text{Mean} (\mu) = \frac{\sum_{i=1}^n x_i}{n} \]
- Median: The middle value when the data is ordered. If there is an even number of observations, the median is the average of the two middle numbers.
- Mode: The most frequently occurring value in the data set. A data set may have one mode, more than one mode, or no mode at all.
Measures of Dispersion
- Range: The difference between the highest and lowest values in a data set. It provides a measure of how spread out the values are.
\[ \text{Range} = \text{Maximum} - \text{Minimum} \]
- Variance: The average of the squared differences from the mean. It measures the spread of the data points.
\[ \text{Variance} (\sigma^2) = \frac{\sum_{i=1}^n (x_i - \mu)^2}{n} \]
- Standard Deviation: The square root of the variance. It provides a measure of the average distance from the mean.
\[ \text{Standard Deviation} (\sigma) = \sqrt{\sigma^2} \]
- Interquartile Range (IQR): The difference between the 75th percentile (Q3) and the 25th percentile (Q1). It measures the spread of the middle 50% of the data.
\[ \text{IQR} = Q3 - Q1 \]
3. Data Visualization
Data visualization involves the graphical representation of data to help understand and communicate insights. Some common methods include:
- Histogram: A bar graph depicting the frequency distribution of a data set. Each bar represents the frequency of data within certain ranges (bins).
- Box Plot (Box-and-Whisker Plot): A graphical representation of the data's quartiles and outliers. It shows the median, quartiles, and potential outliers.
- Scatter Plot: A graph of plotted points showing the relationship between two variables. Each point represents an observation.
- Bar Chart: A chart with rectangular bars representing categorical data. Each bar's length is proportional to the value it represents.
- Pie Chart: A circular chart divided into sectors to represent proportions. Each sector's size is proportional to its value.
4. Probability
Probability measures the likelihood of an event occurring. It ranges from 0 (impossible event) to 1 (certain event). Key concepts include:
- Basic Probability: The ratio of the number of favorable outcomes to the total number of outcomes.
\[ P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}} \]
- Conditional Probability: The probability of event A occurring given that event B has occurred.
\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]
- Independent Events: Two events A and B are independent if the occurrence of A does not affect the occurrence of B.
\[ P(A \cap B) = P(A) \cdot P(B) \]
5. Random Variable
A random variable is a numerical outcome of a random phenomenon. In probability and statistics, random variables are used to quantify outcomes and are typically classified into two main types: discrete and continuous.
Types of Random Variables
1. Discrete Random Variable:
- Takes on a countable number of distinct values.
- Examples include the number of heads in coin tosses, the number of students in a class, and the number of cars passing through a toll booth in an hour.
- Probability Mass Function (PMF): Lists the probabilities associated with each possible value.
- Example: The number of heads in 3 coin tosses can be 0, 1, 2, or 3, with specific probabilities for each outcome.
2. Continuous Random Variable:
- Takes on an infinite number of possible values within a given range.
- Examples include the height of students, the time it takes to run a mile, and the temperature in a city.
- Probability Density Function (PDF): Describes the likelihood of the random variable falling within a particular range of values.
- Example: The time it takes to complete a task might be any value between 0 and infinity.
Characteristics of Random Variables
1. Expectation (Mean):
The average value of a random variable over many trials.
For a discrete random variable \(X\):
\[ E(X) = \sum_{x} x \cdot P(X = x) \]
For a continuous random variable \(X\):
\[ E(X) = \int_{-\infty}^{\infty} x \cdot f(x) \, dx \]
2. Variance and Standard Deviation:
Measures the spread or dispersion of the random variable's values around the mean.
For a discrete random variable \(X\):
\[ \text{Var}(X) = \sum_{x} (x - E(X))^2 \cdot P(X = x) \]
\[ \text{SD}(X) = \sqrt{\text{Var}(X)} \]
For a continuous random variable \(X\):
\[ \text{Var}(X) = \int_{-\infty}^{\infty} (x - E(X))^2 \cdot f(x) \, dx \]
\[ \text{SD}(X) = \sqrt{\text{Var}(X)} \]
Examples
1. Discrete Random Variable - Binomial Distribution:
Represents the number of successes in a fixed number of independent Bernoulli trials.
Example: Number of heads in 10 coin tosses with a fair coin (each toss has a 0.5 probability of heads).
Probability Mass Function:
\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]
Expectation and Standard Deviation:
\[ E(X) = np \]
\[ \text{SD}(X) = \sqrt{np(1-p)} \]
2. Continuous Random Variable - Normal Distribution:
Describes data that cluster around a mean.
Example: Heights of adult men in a population.
Probability Density Function:
\[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \]
Expectation and Standard Deviation:
\[ E(X) = \mu \]
\[ \text{SD}(X) = \sigma \]
6. Discrete Random Variable Probability Distribution
A discrete random variable takes on a countable number of distinct values. The probability distribution of a discrete random variable lists the probabilities associated with each possible value. Examples include the binomial distribution and the Poisson distribution.
Binomial Distribution
The binomial distribution represents the number of successes in a fixed number of independent Bernoulli trials, with each trial having the same probability of success. It is defined by two parameters: the number of trials (n) and the probability of success in each trial (p).
The probability mass function (PMF) of a binomial distribution is given by:
\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]
where \( X \) is the random variable representing the number of successes, \( n \) is the number of trials, \( k \) is the number of successes, and \( p \) is the probability of success in each trial.
Example: If you flip a fair coin 10 times, the probability of getting exactly 6 heads (successes) can be calculated using the binomial distribution formula with \( n = 10 \), \( p = 0.5 \), and \( k = 6 \).
The expected value (mean) and standard deviation of a binomial distribution are given by:
\[ E(X) = np \]
\[ \text{SD}(X) = \sqrt{np(1-p)} \]
Poisson Distribution
The Poisson distribution represents the number of events occurring in a fixed interval of time or space, given the average number of times the event occurs over that interval (λ). It is defined by a single parameter \( \lambda \), which is the average number of events in the interval.
The probability mass function (PMF) of a Poisson distribution is given by:
\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \]
where \( X \) is the random variable representing the number of events, \( \lambda \) is the average number of events, and \( k \) is the number of events in the interval.
Example: If a call center receives an average of 5 calls per hour, the probability of receiving exactly 7 calls in an hour can be calculated using the Poisson distribution formula with \( \lambda = 5 \) and \( k = 7 \).
The expected value (mean) and standard deviation of a Poisson distribution are both given by:
\[ E(X) = \lambda \]
\[ \text{SD}(X) = \sqrt{\lambda} \]
7. Continuous Random Variable Probability Distribution
A continuous random variable can take on any value within a certain range. Its probability distribution is described by a probability density function (PDF), which specifies the likelihood of the random variable falling within a particular range of values. Examples include the normal distribution and the exponential distribution.
Normal Distribution
The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution characterized by its bell-shaped curve. It is defined by two parameters: the mean (\(\mu\)) and the standard deviation (\(\sigma\)). The PDF of a normal distribution is given by:
\[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \]
When to Use: The normal distribution is used when data tend to cluster around a central mean value, and it is applicable in various fields such as natural and social sciences.
Expectation and Standard Deviation:
\[ E(X) = \mu \]
\[ \text{SD}(X) = \sigma \]
Example: Suppose the heights of adult men in a certain population are normally distributed with a mean height of 70 inches and a standard deviation of 3 inches. The height of a randomly selected man can be modeled using this normal distribution.
Exponential Distribution
The exponential distribution is a continuous probability distribution often used to model the time between events in a Poisson process. It is defined by a single parameter, \(\lambda\) (the rate parameter). The PDF of an exponential distribution is given by:
\[ f(x) = \lambda e^{-\lambda x} \text{ for } x \geq 0 \]
When to Use: The exponential distribution is used for modeling the time or space between events in a process where events occur continuously and independently at a constant average rate, such as the time between arrivals at a service center.
Expectation and Standard Deviation:
\[ E(X) = \frac{1}{\lambda} \]
\[ \text{SD}(X) = \frac{1}{\lambda} \]
Example: Suppose the average rate of cars passing through a toll booth is 2 cars per minute. The time between car arrivals can be modeled using an exponential distribution with \(\lambda = 2\).
Z-score Table
A Z-score table, also known as a standard normal distribution table, provides the probabilities associated with standard normal distribution values (Z-scores). It helps determine the probability of a random variable falling below or above a certain value in a standard normal distribution.
Converting a Z-score into a probability using a Z-score table involves locating the corresponding value in the table and interpreting the probability associated with it. Here's a step-by-step guide:
- Find the Z-score: Calculate the Z-score of the value you're interested in using the formula:
- Locate the Z-score in the Table: Look for the Z-score in the rows and columns of the Z-score table. The table provides values corresponding to the area under the standard normal curve up to that Z-score.
- Interpret the Probability: Once you find the Z-score in the table, the corresponding value represents the probability (or percentage) of values falling below that Z-score in a standard normal distribution.
\[ Z = \frac{x - \mu}{\sigma} \]
For example, if the Z-score is 1.96, the corresponding value in the table might be 0.9750. This means that approximately 97.50% of the values in a standard normal distribution fall below a Z-score of 1.96.
A standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. It is often denoted by \( Z \) and is used as a reference distribution in many statistical analyses.
8. Sampling Distribution of the Sample Mean
The sampling distribution of the sample mean is the distribution of the means of all possible samples of a given size from a population. It is used to make inferences about the population mean. The Central Limit Theorem states that the sampling distribution of the sample mean will be approximately normally distributed, regardless of the shape of the population distribution, provided the sample size is sufficiently large.
Mathematically, the mean (\( \mu_{\bar{x}} \)) and standard deviation (\( \sigma_{\bar{x}} \)) of the sampling distribution of the sample mean (\( \bar{x} \)) are calculated as follows:
- Mean of the Sampling Distribution:
- Standard Deviation of the Sampling Distribution:
- Where:
- \( \mu \) is the population mean.
- \( \sigma \) is the population standard deviation.
- \( n \) is the sample size.
\[ \mu_{\bar{x}} = \mu \]
\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \]
According to the Central Limit Theorem, as the sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the shape of the population distribution.
9. Inferential Statistics
Inferential statistics make predictions or inferences about a population based on a sample of data. It involves using sample data to estimate population parameters and to test hypotheses.
Hypothesis Testing
- Null Hypothesis (H0): A statement that there is no effect or no difference. It is the hypothesis that the researcher tries to disprove or reject.
- Alternative Hypothesis (H1): A statement that there is an effect or a difference. It is the hypothesis that the researcher wants to support.
- P-Value: The probability of obtaining the observed results, assuming that the null hypothesis is true. A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis.
- Significance Level (α): A threshold for rejecting the null hypothesis, commonly set at 0.05. It represents the probability of committing a Type I error.
- Type I Error: Incorrectly rejecting the null hypothesis when it is true (false positive).
- Type II Error: Failing to reject the null hypothesis when it is false (false negative).
Point Estimation
Point estimation involves using sample data to calculate a single value (known as a statistic) which serves as a best guess or estimate of an unknown population parameter. Examples include the sample mean, sample variance, and sample proportion.
Confidence Intervals
A confidence interval is a range of values that is likely to contain the population parameter with a certain level of confidence. It provides an estimated range of values which is likely to include an unknown population parameter.
\[ \text{Confidence Interval} = \bar{x} \pm Z \left(\frac{\sigma}{\sqrt{n}}\right) \]
where \( \bar{x} \) is the sample mean, \( Z \) is the Z-score corresponding to the desired confidence level, \( \sigma \) is the population standard deviation, and \( n \) is the sample size.
10. Chi-Square Distribution
The chi-square distribution is a continuous probability distribution that is widely used in hypothesis testing, particularly in tests of independence and goodness of fit. It is characterized by its degrees of freedom (df), which are related to the number of categories or variables in the data.
- Chi-Square Test for Independence: Determines whether there is a significant association between two categorical variables.
\[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]
where \( O_i \) is the observed frequency and \( E_i \) is the expected frequency.
- Chi-Square Test for Goodness of Fit: Determines whether a sample data matches a population with a specific distribution.
11. Evaluation of Testing Methods and Single-Sample t-Test
The evaluation of testing methods involves assessing the validity and reliability of statistical tests. The single-sample t-test is used to determine whether the sample mean is significantly different from a known or hypothesized population mean.
- Validity: The extent to which a test measures what it claims to measure.
- Reliability: The consistency of a test in measuring what it is supposed to measure.
- Single-Sample t-Test: Compares the sample mean to a known value (population mean) to determine if there is a significant difference.
\[ t = \frac{\bar{x} - \mu}{s/\sqrt{n}} \]
where \( \bar{x} \) is the sample mean, \( \mu \) is the population mean, \( s \) is the sample standard deviation, and \( n \) is the sample size.
12. Hypothesis Testing for Population Mean: Paired and Independent Sample t-Tests
Paired sample t-tests and independent sample t-tests are used to compare the means of different groups:
- Paired Sample t-Test: Compares the means from the same group at different times or under different conditions. It is used when the samples are dependent.
\[ t = \frac{\bar{d}}{s_d/\sqrt{n}} \]
where \( \bar{d} \) is the mean difference, \( s_d \) is the standard deviation of the differences, and \( n \) is the number of pairs.
- Independent Sample t-Test: Compares the means from two different groups. It is used when the samples are independent.
\[ t = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]
where \( \bar{x_1} \) and \( \bar{x_2} \) are the sample means, \( s_1^2 \) and \( s_2^2 \) are the sample variances, and \( n_1 \) and \( n_2 \) are the sample sizes.
import numpy as np
from scipy import stats
# Chinese grades from class A
grades_a = np.array([82.1, 88.5, 92.2, 82.3, 89.6, 79, 82.8, 83.4, 75.3, 87.2,
84.2, 86.7, 84.5, 86.6, 87, 83.3, 84.3, 86, 82.5, 73.2,
92.9, 87.1, 88.1, 85.8, 89.7, 78.5, 78, 86.7, 77.5, 78.7,
85.1, 85.7, 86.9, 85.1, 84.9, 81.2, 84.7, 89, 72.5, 89.6])
# Chinese grades from class B
grades_class_b = np.array([74.6, 87.5, 81, 78.7, 63.5, 74.2, 83.2, 67, 76.6, 73.1,
78.1, 81.1, 76.6, 74.2, 78.4, 85.7, 78.1, 80, 78.9, 84.3,
76.2, 79.6, 80.4, 89.1, 79.5, 71.8, 75.9, 79.3, 84.6, 75.1,
79.4, 88.4, 76.4, 91.6, 79.4, 83.6, 79.2, 68.5, 84.2,92.4])
# Perform t-test
t_stat, p_value = stats.ttest_ind(grades_a, grades_class_b)
print(f"t-statistic: {t_stat}, p-value: {p_value}")
t-statistic: 4.044147451260445, p-value: 0.0001224569914759775
Given the t-statistic and the very low p-value, we can reject the null hypothesis and conclude that there is a significant difference in
the Chinese grades between Class A and Class B.
13. Hypothesis Testing for Population Median: Non-Parametric Methods (z-Test and Exact Test)
Non-parametric methods, such as the z-test and exact test, are used for hypothesis testing when the data do not necessarily follow a normal distribution:
- z-Test: Used to determine if there is a significant difference between the observed sample proportion and a known population proportion.
\[ z = \frac{p - \pi}{\sqrt{\frac{\pi(1-\pi)}{n}}} \]
where \( p \) is the sample proportion, \( \pi \) is the population proportion, and \( n \) is the sample size.
- Exact Test: Used when sample sizes are small, and assumptions of parametric tests cannot be met. It provides exact p-values without relying on large-sample approximations.
14. Hypothesis Testing for Population Mean: ANOVA (F-Test) and Multiple Comparison t-Tests
ANOVA (Analysis of Variance) uses the F-test to compare the means of three or more groups. If the F-test is significant, multiple comparison t-tests are used to identify specific group differences:
- ANOVA (F-Test): Tests whether there are any statistically significant differences between the means of three or more independent groups.
\[ F = \frac{\text{Between-group variability}}{\text{Within-group variability}} \]
import numpy as np
from scipy.stats import f_oneway
# Example data (grades for three different groups)
group1_grades = np.array([88.8, 74.5, 72.6, 83.7, 101.6, 87.7, 87.4, 73.7, 55.4, 60.7,
78.2, 70.9, 73.6, 78.5, 68.5, 55.5, 68.9, 66.6, 82.5, 87.7,
47.0, 79.2, 80.4, 67.9, 89.7, 71.4, 69.7, 92.6, 70.3, 78.0])
group2_grades = np.array([68.5, 61.5, 55.9, 74.6, 50.6, 63.6, 70.8, 75.3, 66.3, 67.2,
71.0, 72.8, 64.3, 55.9, 89.5, 71.3, 77.6, 69.5, 69.3, 75.7,
73.7, 65.0, 72.2, 46.3, 76.7, 82.7, 79.5, 82.7, 73.9, 70.4])
group3_grades = np.array([84.1, 81.9, 72.9, 77.0, 75.1, 73.0, 84.7, 79.5, 77.4, 87.2,
79.3, 72.2, 81.4, 84.8, 83.1, 82.6, 76.9, 85.5, 76.5, 73.9,
88.0, 84.1, 80.9, 71.0, 76.3, 93.9, 78.7, 75.5, 78.2, 77.7])
# Perform one-way ANOVA
f_stat, p_value = f_oneway(group1_grades, group2_grades, group3_grades)
# Output the results
print(f"F-statistic: {f_stat}, p-value: {p_value}")
F-statistic: 8.662662235567426, p-value: 0.00037079478684125814
Given the F-statistic of 8.66 and the p-value of 0.00037, we reject the null hypothesis. This means that there is strong evidence to
conclude that there are significant differences in the mean grades among the three groups. Thus, at least one group's mean grade is
different from the others.
15. Estimation and Testing of the Relationship between Two Categorical Variables: Contingency Table Analysis
Contingency table analysis is used to estimate and test the relationship between two categorical variables. It often involves chi-square tests:
- Contingency Table: A table that displays the frequency distribution of variables. Each cell in the table shows the count or frequency of occurrences for a specific combination of values.
- Chi-Square Test: Tests the independence between two categorical variables in a contingency table.
\[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]
where \( O_i \) is the observed frequency and \( E_i \) is the expected frequency.
- Relative Risk: A measure of association between exposure to a risk factor and the likelihood of an outcome, calculated as the ratio of the probability of the outcome occurring in the exposed group to the probability of the outcome occurring in the unexposed group.
\[ \text{Relative Risk} = \frac{{P(\text{Outcome}|\text{Exposed})}}{{P(\text{Outcome}|\text{Unexposed})}} \]
- Odds Ratio: A measure of association between exposure to a risk factor and the odds of an outcome, calculated as the ratio of the odds of the outcome occurring in the exposed group to the odds of the outcome occurring in the unexposed group.
\[ \text{Odds Ratio} = \frac{{\text{Odds of Outcome in Exposed}}}{{\text{Odds of Outcome in Unexposed}}} \]
Let's illustrate these concepts with an example:
Suppose we are studying the relationship between smoking status (smoker/non-smoker) and the incidence of lung cancer (diagnosed/not diagnosed). We collect data from 1000 individuals and create a contingency table:
Lung Cancer Diagnosed | Lung Cancer Not Diagnosed | Total | |
---|---|---|---|
Smoker | 70 (30) | 230 (270) | 300 |
Non-Smoker | 30 (70) | 670 (630) | 700 |
Total | 100 | 900 | 1000 |
In the contingency table, the numbers outside the parentheses represent the observed frequencies (\(O_i\)), while the numbers inside the parentheses represent the expected frequencies (\(E_i\)).
From this contingency table, we can calculate the observed and expected frequencies for each cell. Then, we can use the chi-square test formula to determine whether there is a significant association between smoking status and the incidence of lung cancer.
import numpy as np
from scipy.stats import chi2_contingency
# Example data (contingency table)
# Each row of data corresponds to a smoking status (Smoker, Non-Smoker).
# The first column of data represents the count of individuals with lung cancer diagnosed,
# and the second column represents the count of individuals without lung cancer diagnosed.
data = np.array([[70, 230], [30, 670]])
# Perform Chi-Square test
chi2, p, dof, ex = chi2_contingency(data)
print(f"Chi2: {chi2}, p-value: {p}, Degrees of freedom: {dof}")
print("Expected frequencies:\n", ex)
# Calculate Relative Risk
total_exposed = data[0, 0] + data[0, 1]
total_unexposed = data[1, 0] + data[1, 1]
risk_exposed = data[0, 0] / total_exposed
risk_unexposed = data[1, 0] / total_unexposed
relative_risk = risk_exposed / risk_unexposed
print("Relative Risk:", relative_risk)
# Calculate Odds Ratio
odds_ratio = (data[0, 0] * data[1, 1]) / (data[0, 1] * data[1, 0])
print("Odds Ratio:", odds_ratio)
Chi2: 82.55291005291005, p-value: 1.0287905227821553e-19, Degrees of freedom: 1
Expected frequencies:
[[ 30. 270.]
[ 70. 630.]]
Conclusion:
Based on the results of the chi-square test, there is strong evidence to reject the null hypothesis of independence between smoking
status and lung cancer diagnosis. The observed frequencies significantly differ from the expected frequencies under the assumption of
independence, indicating a potential association between smoking status and lung cancer diagnosis.
Additionally, we can calculate the relative risk and odds ratio to further understand the relationship between smoking and lung cancer risk.
Relative Risk: 5.444444444444445
Odds Ratio: 6.797101449275362
Relative Risk: The relative risk, calculated as 5.44, indicates that individuals who smoke are approximately 5.44 times more likely to be
diagnosed with lung cancer compared to non-smokers.
Odds Ratio: The odds ratio, calculated as 6.80, suggests that the odds of being diagnosed with lung cancer are approximately 6.80 times
higher for smokers compared to non-smokers.
16. Regression Analysis
Regression analysis examines the relationship between two or more variables. It helps understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Linear Regression
Linear regression is a method to model the relationship between a dependent variable and one or more independent variables. The simplest form is simple linear regression with one independent variable:
\[ Y = \beta_0 + \beta_1X + \epsilon \]
where \( Y \) is the dependent variable, \( X \) is the independent variable, \( \beta_0 \) is the y-intercept, \( \beta_1 \) is the slope, and \( \epsilon \) is the error term.
It assumes a linear relationship between the dependent and independent variables. The slope \( \beta_1 \) indicates the change in \( Y \) for a one-unit change in \( X \).
import numpy as np
import matplotlib.pyplot as plt
# Step 1: Generate synthetic data
np.random.seed(0) # For reproducibility
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Step 2: Add a bias term (x0 = 1) to each instance
X_b = np.c_[np.ones((100, 1)), X] # Add x0 = 1 to each instance
# Step 3: Implement the Normal Equation (closed-form solution)
theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
# Step 4: Make predictions using the model
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # Add x0 = 1 to each instance
y_predict = X_new_b @ theta_best
# Step 5: Visualize the results
plt.plot(X_new, y_predict, "r-", label="Predictions")
plt.plot(X, y, "b.", label="Data points")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()
print("Best fitting line parameters:", theta_best)
"""
1. Import Libraries:
numpy: Provides numerical computing functions like random number generation and matrix operations.
matplotlib.pyplot: Used for creating visualizations like plots.
2. Generate Synthetic Data:
np.random.seed(0): Sets a seed for the random number generator to ensure reproducibility (the same data will be generated each time).
X = 2 * np.random.rand(100, 1): Creates a 100x1 matrix X containing random values between 0 and 2.
y = 4 + 3 * X + np.random.randn(100, 1): Creates a 100x1 matrix y representing the target values. It's a linear function of X with a slope of 3, an intercept of 4, and added random noise.
The main difference between np.random.rand and np.random.randn is the type of random numbers they generate:
np.random.rand:
Generates random numbers from a uniform distribution between 0 (inclusive) and 1 (exclusive).
This means each number within the range [0, 1) has an equal chance of being generated.
It's useful when you need random values within a specific range where all values are equally likely.
np.random.randn:
Generates random numbers from a standard normal distribution, also known as a Gaussian distribution.
This distribution has a bell-shaped curve with a mean of 0 and a standard deviation of 1.
This means most of the generated values will be close to 0, with fewer values further away on either side.
It's useful when you need random values that follow a natural bell-shaped curve, often used in statistical simulations and machine learning.
3. Add Bias Term:
X_b = np.c_[np.ones((100, 1)), X]: Adds a column of ones (representing the bias term x0 = 1) to the beginning of X, resulting in a 100x2 matrix X_b. This is necessary for the linear regression model to capture the intercept in the equation.
In the code np.ones((100, 1)), the function np.ones is used to create a NumPy array filled with ones. Here's a breakdown of its meaning:
np: This is the NumPy library, which provides powerful tools for numerical computing in Python.
ones: This is a function within NumPy that creates an array of elements set to 1.
((100, 1)): This is the shape of the array being created.
100: This specifies that the array will have 100 rows.
1: This specifies that the array will have 1 column.
Therefore, np.ones((100, 1)) creates a 100x1 matrix (column vector) where all the elements are 1.
4. Implement the Normal Equation:
theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y:
X_b.T: Transposes X_b to create a 2x100 matrix.
X_b.T @ X_b: Calculates the dot product of the transposed and original X_b, resulting in a 2x2 matrix representing the covariance matrix.
np.linalg.inv(X_b.T @ X_b): Inverts the covariance matrix to obtain the weights and intercept.
X_b.T @ y: Calculates the dot product of the transposed X_b and y, resulting in a 2x1 vector of predicted values.
@: Matrix multiplication operator.
5. Make Predictions:
X_new = np.array([[0], [2]]): Creates a 2x1 matrix X_new containing two new input values to make predictions for.
X_new_b = np.c_[np.ones((2, 1)), X_new]: Adds a column of ones to X_new to include the bias term.
y_predict = X_new_b @ theta_best: Calculates the predicted target values y_predict for the new input values using the obtained theta_best.
6. Visualize Results:
plt.plot(X_new, y_predict, "r-", label="Predictions"): Plots the predicted values (red line) using the new input values and y_predict.
plt.plot(X, y, "b.", label="Data points"): Plots the original data points (blue dots).
plt.xlabel("X"): Sets the x-axis label.
plt.ylabel("y"): Sets the y-axis label.
plt.legend(): Adds a legend to the plot.
plt.show(): Displays the plot.
7. Print Best Fitting Line Parameters:
print("Best fitting line parameters:", theta_best): Prints the values of theta_best, which represent the slope and intercept of the best-fitting line.
"""
Multiple Regression
Multiple regression extends simple linear regression to include multiple independent variables. The model can be written as:
\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_nX_n + \epsilon \]
where \( Y \) is the dependent variable, \( X_1, X_2, \ldots, X_n \) are the independent variables, \( \beta_0 \) is the y-intercept, \( \beta_n \) are the slopes, and \( \epsilon \) is the error term.
Each coefficient \( \beta_n \) represents the change in the dependent variable \( Y \) for a one-unit change in the corresponding independent variable \( X_n \), holding all other variables constant.
17. Correlation
A statistical measure that indicates the extent to which two variables fluctuate together. It can be positive (both variables increase together), negative (one variable increases while the other decreases), or zero (no relationship).
- Correlation Coefficient (r): A value between -1 and 1 that indicates the strength and direction of a linear relationship between two variables.
\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} \]
where \( r \) is the correlation coefficient, \( x_i \) and \( y_i \) are the individual sample points, \( \bar{x} \) and \( \bar{y} \) are the sample means.
- Pearson Correlation: Measures the linear relationship between two continuous variables. Values close to 1 or -1 indicate a strong linear relationship, while values close to 0 indicate a weak or no linear relationship.
For example, if the correlation coefficient \( r \) between two variables is 0.9, it indicates a strong positive linear relationship, meaning as one variable increases, the other also increases.
- Spearman Rank Correlation: Measures the strength and direction of the monotonic relationship between two ranked variables. It is a non-parametric measure of correlation.
18. Common Probability Distributions
Probability distributions describe how the values of a random variable are distributed. Some common probability distributions are:
- Normal Distribution: A continuous probability distribution characterized by a bell-shaped curve, defined by mean (μ) and standard deviation (σ).
It is symmetric about the mean, with most of the data points falling within three standard deviations of the mean.
The normal distribution is widely used in statistics due to the Central Limit Theorem, which states that the sum of many independent and identically distributed random variables approaches a normal distribution, regardless of the original distribution of the variables.
- Binomial Distribution: A discrete distribution representing the number of successes in a fixed number of independent Bernoulli trials. It is defined by two parameters: the number of trials (n) and the probability of success in each trial (p).
\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]
where \( X \) is the random variable representing the number of successes, \( n \) is the number of trials, \( k \) is the number of successes, and \( p \) is the probability of success in each trial.
- Poisson Distribution: A discrete distribution representing the number of events occurring in a fixed interval of time or space. It is defined by a single parameter \( \lambda \), which is the average number of events in the interval.
\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \]
where \( X \) is the random variable representing the number of events, \( \lambda \) is the average number of events, and \( k \) is the number of events in the interval.
The Poisson distribution is often used to model the number of occurrences of rare events, such as the number of phone calls received by a call center in an hour or the number of decay events per unit time from a radioactive source.
19. Common Notations in Probability Theory and Statistics
In probability theory and statistics, similar concepts are often represented using different notations. Here are some common examples of notations used in both fields:
- Random Variable: Typically denoted by uppercase letters, such as \( X \), \( Y \), or \( Z \).
- Probability of an Event: Represented by \( P(\text{Event}) \), where Event is the outcome of interest.
- Probability Distribution: Represented by functions such as \( P(X) \) or \( f(x) \), where \( X \) is a random variable and \( x \) is a specific value.
- Expected Value: Denoted by \( E(X) \) or \( \mu \), representing the mean or average value of a random variable.
- Variance: Denoted by \( \text{Var}(X) \) or \( \sigma^2 \), representing the measure of the spread or dispersion of a random variable.
- Standard Deviation: Represented by \( \sigma \) or \( \text{SD}(X) \), representing the square root of the variance.
- Probability Density Function (PDF): Denoted by \( f(x) \), representing the function that describes the probability distribution of a continuous random variable.
- Sample Mean: Often represented by \( \bar{x} \) or \( \hat{\mu} \), representing the average value of a sample.
- Sample Variance: Denoted by \( s^2 \), representing the measure of variability within a sample.
- Sample Standard Deviation: Represented by \( s \), representing the square root of the sample variance.
- Population Mean: Typically denoted by \( \mu \), representing the average value of a population.
- Population Variance: Denoted by \( \sigma^2 \), representing the measure of variability within a population.
- Population Standard Deviation: Represented by \( \sigma \), representing the square root of the population variance.
- Estimators: Denoted by Greek letters with hats (e.g., \( \hat{\mu} \) or \( \hat{\sigma} \)), representing estimates of population parameters based on sample data.
While there may be some overlap in notations between probability theory and statistics, understanding the context in which the notation is used is important for proper interpretation. Probability theory focuses on modeling uncertain events and their likelihood, while statistics involves the analysis of data to make inferences about populations based on sample information.
Comments
Post a Comment