Edoardo Ottavianelli

Software Developer

Cybersecurity Student at Sapienza University. Passionate about Computing, Nature and the whole sphere of Science.

Contact me

Scroll down


[Statistics] Lesson 4

Author: Edoardo Ottavianelli
31/10/2020

Researches about theory (R)

10_R) Explain a unified conceptual framework to obtain all most common measures of central tendency using the concept of distance (or "premetric" in general).

In Statistics, we have a lot of methods to extract data about a population or just a sample of that. When we have a set of items, let's take the sets of all (10 in this example :) )students in the Cybersecurity course (as always), we can apply all of these functions to take informations about them. The central tendency is the thing that tries to describe in a meaningful way the distribution we're analyzing. So, the goal of this measures is to extract few data/values that can be so much representative of the larger set of objects. Central tendency is often called Central location or Summary Statistics. The mean, the median and the mode are the most common ways to extract some menaingful data. Maybe the mean (also called average) is the most absolutely common and famous. We've seen in previous lessons some types to compute the arithmetic mean (but there are also other types of these, like the geometric mean), this value explain just the average value of the larger sets. The most common way to compute this is the naive computation (so summing all the values and then divide it by the number of the set), but we've seen this is a good way to get errors (from the compiler). Others formulas/methods to computer in a good way the mean/average is using the Kahan Sum or the Knuth algorithm. So, just to take a general visual example, let's take and measure the students.

Student 1 2 3 4 5 6 7 8 9 10
Heigth 160 161 163 158 160 157 169 170 200 210

Let's visualize the distribution of the students' height.

We can immediately see that there are two outliers in this sample of data. There are 8 students with an arithmetic mean of the heigth of 162.25cm, and other two student with a mean of 205cm. The mean of the total sample is JUST 170.8cm. So, this isn't a good value for this distribution, we have to apply other types of measures of the central tendency. One other method is the median. The median is the value in the middle of a sorted list of data. So, if we take our students and we sort the data, we have this:
157 158 160 160 161 163 169 170 200 210

Now we can see there isn't a value just in the middle, we have two value in the middle because they have 4 elements on the left and 4 elements on the right. Or better, 4 values less or equal than them and 4 values greater or equal than them. So, we have two median values, but they are 161cm and 163 cm and these values don't describe in a good way our distribution. An other type of measurement is the mode. The mode is the most common and recurrent value in the distribution. As we can see in the previuos chart (histogram) the mode is 160cm, because is the only value present two times.
But we have other types of measures of central tendency that help us to understand better the data. I'm calling about Variance and Standard Deviation. As written on WikiPedia [4]: "In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers is spread out from their average value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by sigma^2 or Var(X)." So, the actual Variance is computed summing from i=1 to n all the (xi - mean)^2, all divided by n.

The variance of this distribution is Variance (σ^2): 313.76cm^2. And finally we have the Standard Deviation. By Wikipedia [5]:
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values.[1] A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range. Standard deviation may be abbreviated SD, and is most commonly represented in mathematical texts and equations by the lower case Greek letter sigma σ, for the population standard deviation, or the Latin letter s, for the sample standard deviation.
The standard deviation or sigma is obtained computing the square root of the sigma^2 or Variance, so in our case is 17.713271860388 cm.

11_R) What are the most common types of means known? Find 1 example where these 2 types of means arise naturally: geometric, harmonic.

  • Arithmetic mean
    • In mathematics and statistics, the arithmetic mean, or simply the mean or the average (when the context is clear), is the sum of a collection of numbers divided by the count of numbers in the collection. But, as we know, we prefer Knuth/Kahan/Welford and other less error-prone methods.
  • Geometric mean
  • Harmonic mean
    • In mathematics, the harmonic mean (sometimes called the subcontrary mean) is one of several kinds of average, and in particular, one of the Pythagorean means. Typically, it is appropriate for situations when the average of rates is desired. The harmonic mean can be expressed as the reciprocal of the arithmetic mean of the reciprocals of the given set of observations. As a simple example, the harmonic mean of 1, 4, and 4 is
  • Weighted arithmetic mean
    • The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.
  • Truncated mean
    • A truncated mean or trimmed mean is a statistical measure of central tendency, much like the mean and median. It involves the calculation of the mean after discarding given parts of a probability distribution or sample at the high and low end, and typically discarding an equal amount of both. This number of points to be discarded is usually given as a percentage of the total number of points, but may also be given as a fixed number of points.

12_R) Explain the idea underlying the measures of dispersion and the reasons of their importance.

The measure of dispersion is important because it can show us the margin of error when we have a sample of a larger population and we want to make inferences, These measures play an important role in any dataset. These measures go along the measures of central tendency and show us the variability of our data. Measures of central tendency, as we have seen before, summarize the data with one or few values to explain better the population/sample. As we have seen before on the previous points, especially with the practical examples of the Cybersecurity class, these measures of central tendency (like mean, median, mode etc..) can't show us exactly or in a good and realistic way the data. So, these measures have to be correlated with the measures of dispersion (Range, Average deviation, Variance, Standard Deviation...).
The greater the dispersion in a sample, the more space you'll need to work within the margin. In other words, greater the dispersion, less representative your central tendency is (as we can see in the practical example).


13_R) Find out all the most important properties of the linear regression.

"In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables)."[9]
To improve my knowledge in Linera regression I saw this video and I advice this resource for all the students reading this blog post.


Actually, when we say if a regression model is good or not, we're comparing it to another model. Let's assume that you are a small restaurant owner or a very business minded server / waiter at a nice restaurant. In United States "tips" are a very important part of a waiter's pay. Most of the time the dollar amount of the tip is related to the dollar amount of the total bill. So, a bill of 14 dollars will have a small tip, while a 500 dollars bill will have a greater tip. As the waiter or owner, you would like to develop a model that will allow you to make a prediction about what amount of tip to expect for any given bill amount. Therefore one evening, you collect data for six meals. Unfortunately you forgot to collect data about the bills amount and you collect only the tips amount. How can you predict the future tips? The only data here I have can be the mean, the mode, the median. With only one variable the prediction of the future values can be based only on the previous collected items values and the mean computed. But, if I start to collect also bills amounts I can correlate the bills to the tips. With this model (called Linear regression) the goal is to reduce the errors margin between the prediction and the actual value. A visual concept of Linera regression can be found below. If we have only one variable (so the tips amount) we can place an horizontal line on the 10 value, and say that is the prediction (assuming 10 is the average). Instead, as you can see, with Linear Regression we can adapt our value in respect to the bills amount.

Linear regression has many practical uses. Most applications fall into one of the following two broad categories:
  • If the goal is prediction, forecasting, or error reduction, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.
  • If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.


Applications / Practice (A)

9_A) Prepare separately the following charts: 1) Scatterplot, 2) Histogram/Column chart [in the histogram, within each class interval, draw also a vertical colored line where lies the true mean of the observations falling in that class] and 3) Contingency table using the graphics object and the Drawstring(), MeasureString(), DrawLine(), etc. methods.When done, merge these charts in your previous application 7_A. Use them to represent 2 numerical variables that you select from a CSV file. In particular, in the same picture box, you will make 2 separate charts: 1 rectangle (chart) will contain the contingency table, and 1 rectangle (chart) will contain the scatterplot, with the histograms/column charts and rug plots drawn respectively near the two axis.

OPT 10_A. Implement your own algorithm to compute a frequency distribution of the words from any text (possibly judiciously scraped from websites) and draw some personal graphical representation of the "word cloud".


Researches about applications (RA)

7_RA) Do a research about the real world window to viewport transformation.

Window to Viewport Transformation is the process of transforming a 2D world-coordinate objects to device coordinates. Objects inside the world or clipping window are mapped to the viewport which is the area on the screen where world coordinates are mapped to be displayed.


General Terms:
  • World coordinate – It is the Cartesian coordinate w.r.t which we define the diagram, like Xwmin, Xwmax, Ywmin, Ywmax.
  • Device Coordinate – It is the screen coordinate where the objects is to be displayed, like Xvmin, Xvmax, Yvmin, Yvmax.
  • Window – It is the area on world coordinate selected for display.
  • ViewPort – It is the area on device coordinate where graphics is to be displayed.
Taken (Xw, Yw) as a point on Window, we have to find the corresponding point on ViewPort.
where, sx is scaling factor of x coordinate and sy is scaling factor of y coordinate

Also, on this resource I found also a good program to compute the window to viewport transformation [10]
    // C# program to implement 
    // Window to ViewPort Transformation 
    using System; 
    
    class GFG 
    { 
    
    // Function for window to viewport transformation 
    static void WindowtoViewport(int x_w, int y_w, 
                                int x_wmax, int y_wmax, 
                                int x_wmin, int y_wmin, 
                                int x_vmax, int y_vmax, 
                                int x_vmin, int y_vmin) 
    { 
        // point on viewport 
        int x_v, y_v; 
    
        // scaling factors for x coordinate 
        // and y coordinate 
        float sx, sy; 
    
        // calculatng Sx and Sy 
        sx = (float)(x_vmax - x_vmin) / 
                    (x_wmax - x_wmin); 
        sy = (float)(y_vmax - y_vmin) / 
                    (y_wmax - y_wmin); 
    
        // calculating the point on viewport 
        x_v = (int) (x_vmin + 
            (float)((x_w - x_wmin) * sx)); 
        y_v = (int) (y_vmin + 
            (float)((y_w - y_wmin) * sy)); 
    
        Console.Write("The point on viewport: " + 
                    "({0}, {1} )\n ", x_v, y_v); 
    }
} 
// This code is contributed by PrinciRaj1992 
    


OPT 8_RA) Do a research with examples about how matrices and homogeneous coordinates can be useful for graphics transformations and charts.


References

[1] Laerd Statistics - Measures of Central Tendency
[2] WikiPedia - Summary Statistics
[3] WikiPedia - Central Tendency
[4] WikiPedia - Variance
[5] WikiPedia - Standard Deviation
[6] WikiPedia - Geometric Mean
[7] WikiPedia - Harmonic Mean
[8] WikiPedia - Weighted arithmetic mean
[9] WikiPedia - Linear regression
[10] GeeksForGeeks - Window to Viewport Transformation in Computer Graphics with Implementation