Can someone give me a basic/general statistics overview for standard operations?

Question

anonymous · Answer

hope ths helps https://controls.engin.umich.edu/wiki/index.php/Basic_statistics:_mean,_median,_average,_standard_deviation,_z-scores,_and_p-value

anonymous · Answer

When performing statistical analysis on a set of data, the mean, median, mode, and standard deviation are all helpful values to calculate. The mean, median and mode are all estimates of where the "middle" of a set of data is. These values are useful when creating groups or bins to organize larger sets of data. The standard deviation is the average distance between the actual data and the mean. 

Mean and Weighted Average

The mean (also know as average), is obtained by dividing the sum of observed values by the number of observations, n. Although data points fall above, below, or on the mean, it can be considered a good estimate for predicting subsequent data points. The formula for the mean is given below as equation (1). The excel syntax for the mean is AVERAGE(starting cell: ending cell). 
The median is the middle value of a set of data containing an odd number of values, or the average of the two middle values of a set of data with an even number of values. The median is especially helpful when separating data into two equal sized bins. The excel syntax to find the median is MEDIAN(starting cell: ending cell). The mode of a set of data is the value which occurs most frequently. The excel syntax for the mode is MODE(starting cell: ending cell). Now that we've discussed some different ways in which you can describe a data set, you might be wondering when to use each way. Well, if all the data points are relatively close together, the average gives you a good idea as to what the points are closest to. If on the other hand, almost all the points fall close to one, or a group of close values, but occassionally a value that differs greatly can be seen, then the mode might be more accurate for describing this system, whereas the mean would incorporate the occassional outlying data. The median is useful if you are interested in the range of values your system could be operating in. Half the values should be above and half the values should be below, so you have an idea of where the middle operating point is. The standard deviation gives an idea of how close the entire set of data is to the average value. Data sets with a small standard deviation have tightly grouped, precise data. Data sets with large standard deviations have data spread out over a wide range of values. The formula for standard deviation is given below as equation (3). The excel syntax for the standard deviation is STDEV(starting cell: ending cell). Population parameters follow all types of distributions, some are normal, others are skewed like the F-distribution and some don't even have defined moments (mean, variance, etc.) like the Chaucy distribution. However, many statistical methodologies, like a z-test (discussed later in this article), are based off of the normal distribution. How does this work? Most sample data are not normally distributed. 

This highlights a common misunderstanding of those new to statistical inference. The distribution of the population parameter of interest and the sampling distribution are not the same. Sampling distribution?!? What is that? 

Imagine an engineering is estimating the mean weight of widgets produced in a large batch. The engineer measures the weight of N widgets and calculates the mean. So far, one sample has been taken. The engineer then takes another sample, and another and another continues until a very larger number of samples and thus a larger number of mean sample weights (assume the batch of widgets being sampled from is near infinite for simplicity) have been gathered. The engineer has generated a sample distribution. 

As the name suggested, a sample distribution is simply a distribution of a particular statistic (calculated for a sample with a set size) for a particular population. In this example, the statistic is mean widget weight and the sample size is N. If the engineer were to plot a histogram of the mean widget weights, he/she would see a bell-shaped distribution. This is because the Central Limit Theorem guarantees that as the sample size approaches infinity, the sampling distributions of statistics calculated from said samples approach the normal distribution. 

Conveniently, there is a relationship between sample standard deviation (σ) and the standard deviation of the sampling distribution (\sigma_{\bar{X}} - also know as the standard deviation of the mean or standard errordeviation). This relationship is shown in equation (5) below: 
The linear correlation coefficient is a test that can be used to see if there is a linear relationship between two variables. For example, it is useful if a linear equation is compared to experimental points.A p-value is a statistical value that details how much evidence there is to reject the most common explanation for the data set. It can be considered to be the probability of obtaining a result at least as extreme as the one observed, given that the null hypothesis is true. In chemical engineering, the p-value is often used to analyze marginal conditions of a system, in which case the p-value is the probability that the null hypothesis is true. 

The null hypothesis is considered to be the most plausible scenario that can explain a set of data. The most common null hypothesis is that the data is completely random, that there is no relationship between two system results. The null hypothesis is always assumed to be true unless proven otherwise. An alternative hypothesis predicts the opposite of the null hypothesis and is said to be true if the null hypothesis is proven to be false. 

The following is an example of these two hypotheses: 

4 students who sat at the same table during in an exam all got perfect scores. 

Null Hypothesis: The lack of a score deviation happened by chance. 

Alternative Hypothesis: There is some other reason that they all received the same score. 

If it is found that the null hypothesis is true then the Honor Council will not need to be involved. However, if the alternative hypothesis is found to be true then more studies will need to be done in order to prove this hypothesis and learn more about the situation. 

As mentioned previously, the p-value can be used to analyze marginal conditions. In this case, the null hypothesis is that there is no relationship between the variables controlling the data set. For example: 
## Runny feed has no impact on product quality 
## Points on a control chart are all drawn from the same distribution 
## Two shipments of feed are statistically the same 

The p-value proves or disproves the null hypothesis based on its significance. A p-value is said to be significant if it is less than the level of significance, which is commonly 5%, 1% or .1%, depending on how accurate the data must be or stringent the standards are. For example, a health care company may have a lower level of significance because they have strict standards. If the p-value is considered significant (is less than the specified level of significance), the null hypothesis is false and more tests must be done to prove the alternative hypothesis. 

Upon finding the p-value and subsequently coming to a conclusion to reject the Null Hypothesis or fail to reject the Null Hypothesis, there is also a possibility that the wrong decision can be made. If the decision is to reject the Null Hypothesis and in fact the Null Hypothesis is true, a type 1 error has occurred. The probability of a type one error is the same as the level of significance, so if the level of significance is 5%, "the probability of a type 1 error" is .05 or 5%. If the decision is to fail to reject the Null Hypothesis and in fact the Alternative Hypothesis is true, a type 2 error has just occurred. With respect to the type 2 error, if the Alternative Hypothesis is really true, another probability that is important to researchers is that of actually being able to detect this and reject the Null Hypothesis. This probability is known as the power (of the test) and it is defined as 1 - "probability of making a type 2 error." 

If an error occurs in the previously mentioned example testing whether there is a relationship between the variables controlling the data set, either a type 1 or type 2 error could lead to a great deal of wasted product, or even a wildly out-of-control process. Therefore, when designing the parameters for hypothesis testing, researchers must heavily weigh their options for level of significance and power of the test. The sensitivity of the process, product, and standards for the product can all be sensitive to the smallest error. 

Important Note About Significant P-values

If a P-value is greater than the applied level of significance, and the null hypothesis should not just be blindly accepted. Other tests should be performed in order to determine the true relationship between the variables which are being tested. More information on this and other misunderstandings related to P-values can be found at P-values: Frequent misunderstandings. 

Calculation

There are two ways to calculate a p-value. The first method is used when the z-score has been calculated. The second method is used with the Fisher’s exact method and is used when analyzing marginal conditions. 

First Method: Z-Score

The method for finding the P-Value is actually rather simple. First calculate the z-score and then look up its corresponding p-value using the standard normal table. 

This table can be found here: Media:Group_G_Z-Table.xls 

This value represents the likelihood that the results are not occurring because of random errors but rather an actual difference in data sets. 

To read the standard normal table, first find the row corresponding to the leading significant digit of the z-value in the column on the lefthand side of the table. After locating the appropriate row move to the column which matches the next significant digit. 

Example: 
 If your z-score = 1.13 
Chi Squared Test versus Fisher's Exact
## For small sample sizes, the Chi Squared Test will not always produce an accurate probability. However, for a random null, the Fisher's exact, like its name, will always give an exact result. ##Chi Squared will not be correct when: 

1.fewer than 20 samples are being used 
2.if an expected number is 5 or below and there are between 20 and 40 samples 
##For large contingency tables and expected distributions that are not random, the p-value from Fisher's Exact can be a difficult to compute, and Chi Squared Test will be easier to carry out. 

Binning in Chi Squared and Fisher’s Exact Tests

When performing various statistical analyzes you will find that Chi-squared and Fisher’s exact tests may require binning, whereas ANOVA does not. Although there is no optimal choice for the number of bins (k), there are several formulas which can be used to calculate this number based on the sample size (N). One such example is listed below: 

k = 1 + log2N 

Another method involves grouping the data into intervals of equal probability or equal width. The first approach in which the data is grouped into intervals of equal probability is generally more acceptable since it handles peaked data much better. As a stipulation, each bin should contain at least 5 or more data points, so certain adjacent bins sometimes need to be joined together for this condition to be satisfied. Identifying the number the bins to use is important, but it is even more important to be able to note which situations call for binning. Some Chi-squared and Fisher's exact situations are listed below: 
##Analysis of a continuous variable: 

This situation will require binning. The idea is to divide the range of values of the variable into smaller intervals called bins. 
##Analysis of a discrete variable: 

Binning is unnecessary in this situation. For instance, a coin toss will result in two possible outcomes: heads or tails. In tossing ten coins, you can simply count the number of times you received each possible outcome. This approach is similar to choosing two bins, each containing one possible result. 
##Examples of when to bin, and when not to bin: ##You have twenty measurements of the temperature inside a reactor: as temperature is a continuous variable, you should bin in this case. One approach might be to determine the mean (X) and the standard deviation (σ) and group the temperature data into four bins: T < X – σ, X – σ < T < X, X < T < X + σ, T > X + σ 
##You have twenty data points of the heater setting of the reactor (high, medium, low): since the heater setting is discrete, you should not bin in this case. 

Comparison and interpretation of p-value at the 95% confidence level 

This value is very close to zero which is much less than 0.05. Therefore, the number of students getting sick in the dormatory is significantly higher than the number of students getting sick off campus. There is more than a 95% chance that this significant difference is not random. Statistically, it is shown that this dormatory is more condusive for the spreading of viruses. With the knowledge gained from this analysis, making changes to the dormatory may be justified. Perhaps installing sanitary dispensers at common locations throughout the dormatory would lower this higher prevalence of illness among dormatory students. Further research may determine more specific areas of viral spreading by marking off several smaller populations of students living in different areas of the dormatory. This model of significance testing is very useful and is often applied to a multitude of data to determine if discrepancies are due to chance or actual differences between compared samples of data. As you can see, purely mathematical analyses such as these often lead to physical action being taken, which is necessary in the field of Medicine, Engineering, and other scientific and non-scientific venues.

anonymous · Answer

NOTE: Linear Regression

The correlation coefficient is used to determined whether or not there is a correlation within your data set. Once a correlation has been established, the actual relationship can be determined by carrying out a linear regression. The first step in performing a linear regression is calculating the slope and intercept:

geerky42 · Answer

@DatChinookGuy Source: https://controls.engin.umich.edu/wiki/index.php/Basic_statistics:_mean,_median,_average,_standard_deviation,_z-scores,_and_p-value Phishing isn't cool, man...

geerky42 · Answer

And @ireallylikefood already posted that link...

theopenstudyowl · Answer

um ok???

geerky42 · Answer

Wasn't talking to you... @theopenstudyowl

anonymous · Answer

lol

geerky42 · Answer

@DatChinookGuy 

Why did you feel the need to copy/paste most stuffs from link @ireallylikefood posted?

I mean seriously...?

theopenstudyowl · Answer

just be nice @geerky42

anonymous · Answer

@geerky42 actually, I looked up and wrote my own statement paragraphs on the topic. I just found out the same website that @ireallylikefood posted is where I was inserting the important pieces. Thank you.
Seriously.

geerky42 · Answer

And then you think you should copy/paste bunch of texts here? 

What you posted here is harder to read than that website, especially with pictures. 

Care to tell me where is logic in copying and pasting what you "wrote" in this website or whether you are trying to imply here?

anonymous · Answer

woo haha

anonymous · Answer

@geerky42 either support the question or don't bother even showing up on this question. Thank you. I bet you have as well copy and pasted a bunch of texts from a website to a question.

geerky42 · Answer

You just went and unnecessarily copy nearly entire website. I don't see what made you think this is good move...?

This is what puzzled me...

geerky42 · Answer

Why not just leave @ireallylikefood post alone there?

anonymous · Answer

@geerky42 some people have restricted web sites and he thought to be nice he would post what was said

geerky42 · Answer

OK then just take screenshot of website and post here. copy/paste texts is pretty hard to read. Not to mention that there are lot of missing images and equations/expressions, which make copy/paste text pretty much useless...

geerky42 · Answer

I would be very surprised if @theopenstudyowl actually read @DatChinookGuy 's posts and somehow find them helpful.

geerky42 · Answer

"either support the question or don't bother even showing up on this question."

Since when did I unsupport question? I simply was disgusted by your action... 

@DatChinookGuy