Edoardo Ottavianelli

Software Developer

Cybersecurity Student at Sapienza University. Passionate about Computing, Nature and the whole sphere of Science.

Contact me

Scroll down


[Statistics] Lesson 3

Author: Edoardo Ottavianelli
26/10/2020

Researches about theory (R)

7_R) Explain what are marginal, joint and conditional distributions and how we can explain the Bayes theorem using relative frequencies.

To help us with the goal of understanding these types of distributions (marginal, joint and conditional) I'm gonna show the graphic view of contingency table. In particular I'm gonna show these things referring to a bivariate distribution. As we've already seen in the first lesson, an univariate distribution it's a distribution considering only one characteristic of the analyzed population/sample (for instance, the age of the Statistics course students). The bivariate distribution shows two characteristics about the population/sample, in general we can say multivariate distribution (n characteristics, with n>=2). Let's take an example to explain these types of data. We take our Statistics class and make a bivariate distribution for it (fake data, just an example). We have 300 students (this is our population) and we want to show the distribution taking into account the age and the sex of the students.

A\S M F
0-23 43 28
24-26 77 36
27-30 39 20
31-37 18 19
38+ 8 12

So, in the first row we have some labels that are saying us: on the first column you have the whole population divided by sets of age values, instead on the first row you have the division by the sex. In the central values (so all the other cells ) we have the i,j value; so the value involving the i-th row and the j-th row. For example we can read that the 24-26 years old males are 77, and so on...
A\S M F d(A)
0-23 43 28 71
24-26 77 36 113
27-30 39 20 59
31-37 18 19 37
38+ 8 12 20
d(S) 185 115 300

Now instead I added a column and a row, specifically the last column and the last row. If you make some little calculations you can see the cells of the last column contain the sum of the previous two values. Instead, the cells on the last row contain the sum of the previous integer numbers. So, we can see the last column and the last row contain respectively the univariate distribution of the age variable, instead the last row contains the univariate distribution of the sex. Obviously the last cell of the last row (300) it's the total number of the items inside the population. This graphical view of a bivariate distribution is called contingency table.
In this view the last row and the last column represent two marginal distributions [1]. The distribution of age and sex, and they're called marginal because we can find them on the margins. Instead, the conditional distribution is composed by all the cells with a specified column. So for example, I can consider only the column under the M symbol, so taking into account only the male students. This means my population now is considerely more little. So we can refer at it as "conditional" because this distribution is conditioned by the fact we take into account only the male students. I can do the same also on the rows, isolating only for instance the row 24-26. Here my population is more little and I consider only 113 students, which 77 are male and 36 female. The last definition is joint distribution. This refer to the value inside a cell having the condition that the row is the i-th and the column is the j-th. So for example we know the 0-23 years old male students are 43. "Joint" because we join the dat about two variables [2].
Now, let's consider only the conditional distribution conditioned by the fact that we take into account only the female students. We can write the relative frequencies dividing all the values in that column by 115, so the total of the out conditioned population. So, reading from the top to the bottom we will have: 28/115, 36/115, 20/115, 19/115 and 12/115. Now I see I can express 36/115 as 36/300 x 300/115. The first parameter is just the joint distribution, so the female students being 24-26 years old. I can rewrite that expression in this way: 36/115 = (36/300) / (115/300). The second parameter is just the relative frequency of the female univariate distribution. This is a very interesting relationship, and obviously I can do the same for the rows and I obtain the same relationship (with different values of course).
Now, let's summarize and make some general assertions. Let n be the population cardinality and n(i,j) the joint distribution of the i-th row and the j-th column, let n(i) be the population of the i-th row and n(j) the population of the j-th column. We can say that n(i,j) / n(j) = ( n(i,j) / n ) / ( n(j) / n ). The first object of the second part is the relative frequency of the joint distribution n(i,j) and the second object of the second part is the relative frequency of n(j). So, the object of the first part is the relative frequency of the i-th row conditioned by the fact we take into account the population of the j-th column.
This relationship is called the Bayes theorem [3].

8_R) Explain the concept of statistical independence and why, in case of independence, the relative joint frequencies are equal to the products of the corresponding marginal frequencies.

Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if the occurrence of one does not affect the probability of occurrence of the other (equivalently, does not affect the odds). Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other. [4].
We know that the conditioned probability of E | F is generally different from the P(E). So, knowing that F has occurred, changes the possibility that E has occurred. In the case we know P(E | F) is equal to P(E), we say these two events are indipendent. Because of P(E | F) = P(E + F) / P(F), we say that E and F are indipendent if P(E + F) = P(E)*P(F).

9_R) Do a review about charts useful for statistics and data presentation. What is the chart type that impressed you most and why?

I know that the "data visualization" studies the differents ways to present/show data, because there are meaningful ways to show different type of data. For example, with a percentage I would like to use a Pie chart, bacause it's the most understandable way to see a 100% total divided into some categories. Let's see some examples of these different charts [5].

  • Line charts
    • When you want to make predictions based on a data history over time.
    • When you want to show trends. For example, how house prices have increased over time.
    • When comparing two or more different variables, situations, and information over a given period of time.
  • Bar charts
    • When you want to display data that are grouped into nominal or ordinal categories (see lesson 2)
    • To compare data among different categories.
    • Bar charts are ideal for visualizing the distribution of data when we have more than three categories.
  • Pie charts
    • When you want to create and represent the composition of something.
    • To show percentage or proportional data.
    • When comparing areas of growth within a business such as profit.
  • Histograms
    • When the data is continuous.
    • When you want to represent the shape of the data’s distribution.
    • To summarize large data sets graphically.
  • Area charts
    • When you want to show trends, rather than express specific values.
    • To show a simple comparison of the trend of data sets over the period of time.
    • To compare a small number of categories.
I can continue as much as I want, because on the website I took these informations there are listed more than 15 chart types. By the way I listed only the charts I saw more often. I think my favourite charts are the pie chart and the bar chart because in my opinion you see really quickly the duistribution of categories and you get a fast ral idea about what you are seeing.


Applications / Practice (A)

7_A) Create - in your preferred language C# or VB.NET - a program which is able to read ANY CSV file (or at least 99% of them), assuming no prior knowledge about its structure (do not even assume to that a first line with variable names is necessarily present in the CSV: when not present, clearly, do some useful automatic naming). The program should use your intelligence, creativity and data checking functions (see references below) to achieve this task. The GUI should display the variables in a control, such as for instance a Treeview (or anything you deem useful) and let the user select the data type for each field in the CSV files. Also, some data preprocessing should be carried out on the data (or a suitable subset) in order to empirically establish the most suitable type of data of each field and, thus, give a preliminary tentative choice of data types for the variable fields to the program user (which he/she can, then, try to change on the GUI at his will before attempting to read the file). Test the program with several CSV files downloaded from the Internet from various languages (ita, es/us, cn, ...) to make that values are parsed as intended. (For specific date field, the GUI could also let the user specify a custom format in a textbox to read it correctly).

OPT 8_A) In the previous program 7_A, as a verification, plug the code you have already developed for computing the mean and the (univariate) statistical distribution, and allow the user to select any variable and compute the arithmetic mean (only when it makes sense) and the distribution. [Make this general enough, in anticipation of next homework program, where we will also add bivariate distributions and, in general, multivariate distributions, with various charts.]


Researches about applications (RA)

4_R) Find on the Internet and document all possible ways you can infer a suitable data type, useful for statistical processing, when you are getting data points as a flow of alphanumeric strings (Be aware of possible format difference due to language).

When we get an user's input, often we're talking about String objects. Strings are list of characters enclosed between two quotes "". For example like this "hello". But, also these are strings: "1", "2.4567", "Cybersecurity", "23-9-1988". These strings as we can see, can be also variuos types of data. The first can be seen as an integer, the second as a floating point value, the thind as a String and the last one as a date. The operation of changing type to a value it's called Parsing. Obviously we have to check some weird cases, for example when a dash is inside a string and we want to parse it as a double. This is the code I wrote in my application to understand what type is the most suitable for an input.

// GUESS THE TYPE OF A VARIABLE
public String guessType(ArrayList inputs)
{
    String result = "String";
    String firstValue = (string)inputs[0];
    // check Date type
    DateTime dateValue;
    if (DateTime.TryParse(firstValue, out dateValue))
    {
        for (int i = 0; i < inputs.Count; i++)
        {
            if ((string)inputs[i] != "" && !DateTime.TryParse((string)inputs[i], out dateValue))
            {
                break;
            }
        }
        result = "Date";
    }
    // check Int type
    int intValue;
    if (int.TryParse(firstValue, out intValue))
    {
        for (int i = 0; i < inputs.Count; i++)
        {
            if ((string)inputs[i] != "" && !int.TryParse((string)inputs[i], out intValue))
            {
                break;
            }
        }
        result = "Int";
    }
    // check Double type
    double doubleValue;
    if (double.TryParse(firstValue, out doubleValue))
    {
        for (int i = 0; i < inputs.Count; i++)
        {
            if ((string)inputs[i] != "" && !double.TryParse((string)inputs[i], out doubleValue))
            {
                break;
            }
        }
        result = "Double";
    }
    // check Bool type
    bool boolValue;
    if (bool.TryParse(firstValue, out boolValue))
    {
        for (int i = 0; i < inputs.Count; i++)
        {
            if ((string)inputs[i] != "" && !bool.TryParse((string)inputs[i], out boolValue))
            {
                break;
            }
        }
        result = "Bool";
    }
    return result;
}
Here I create a variable of type String (because I know at least is a String) and then I try to parse this input as Integer, Double, Boolean or Date value.


5_RA) Do a research about Reflection and the type 'Type' and make all examples that you deem to be useful.

Reflection in C# is used to retrieve metadata on types at runtime. In other words, you can use reflection to inspect metadata of the types in your program dynamically -- you can retrieve information on the loaded assemblies and the types defined in them. Reflection in C# is similar to RTTI (Runtime Type Information) of C++[6]. It's mandatory to work with Reflection to add System.Reflection namespace in the program. Let's now dig into some code to put reflection into action. Consider the following class called Customer.

public class Customer
{
    public int Id
    {
        get; set;
    }
    public string FirstName
    {
        get; set;
    }
    public string LastName
    {
        get; set;
    }
    public string Address
    {
        get; set;
    }
}
The following code snippet shows how you can get the class name and the namespace name of the Customer class using reflection:
Type type = typeof(Customer);
Console.WriteLine("Class: " + type.Name);
Console.WriteLine("Namespace: " + type.Namespace);
The following code snippet illustrates how you can retrieve the list of the properties of the Customer class and display their names in the console window:
static void Main(string[] args)
    {

        Type type = typeof(Customer);
        PropertyInfo[] propertyInfo = type.GetProperties();
        Console.WriteLine("The list of properties of the Customer class are:--");
        foreach (PropertyInfo pInfo in propertyInfo)
        {
            Console.WriteLine(pInfo.Name);
        }
    }
Another resource I found so useful:
    using System;
    using System.Collections.Generic;
    using System.Text;
    using System.Reflection;
    
    namespace ReflectionTest
    {
        class Program
        {
            private static int a = 5, b = 10, c = 20;
    
            static void Main(string[] args)
            {
                Console.WriteLine("a + b + c = " + (a + b + c));
                Console.WriteLine("Please enter the name of the variable that you wish to change:");
                string varName = Console.ReadLine();
                Type t = typeof(Program);
                FieldInfo fieldInfo = t.GetField(varName, BindingFlags.NonPublic | BindingFlags.Static);
                if(fieldInfo != null)
                {
                    Console.WriteLine("The current value of " + fieldInfo.Name + " is " + fieldInfo.GetValue(null) + ". You may enter a new value now:");
                    string newValue = Console.ReadLine();
                    int newInt;
                    if(int.TryParse(newValue, out newInt))
                    {
                        fieldInfo.SetValue(null, newInt);
                        Console.WriteLine("a + b + c = " + (a + b + c));
                    }
                    Console.ReadKey();
                }
            }
        }
    }
[7]


6_RA) Do a comprehensive research about the GRAPHICS (GDI+ library) object and all its members.

In the .NET Framework library, six namespaces define managed GDI+: System.Drawing, System.Drawing.Design, System.Drawing.Drawing2D, System.Drawing.Imaging, system.Drawing.Printing, and System.Drawing.Text. Figure 1.3 shows these namespaces. To use any of the classes defined in these namespaces, you must include them in your application. [8]

The System.Drawing.Design Namespace
As its name suggest, the System.Drawing.Design namespace provides additional functionality to develop design-time controls such as custom toolbox items, graphics editors, and type converters. The System.Drawing.Design namespace also define a few interfaces, delegates, and enumerations.
The System.Drawing.Drawing2D Namespace
The System.Drawing.Drawing2D Namespace defines functionality to develop advanced two-dimensional and vector graphics applications. This namespace provides classes for graphics containers, blending, advanced brushes, matrices, and transformation. The System.Drawing.Drawing2D Namespace provides dozens of enumerations.
The System.Drawing.Imaging Namespace
Basic imaging functionality is defined in the System.Drawing.Imaging namespace. The System.Drawing.Imaging namespace provides functionality for advanced imaging. Before an application uses classes from this namespace, it must reference the System.Drawing.Imaging namespace.
The System.Drawing.Printing Namespace
The System.Drawing.Printing namespace defines printing-related classes and types in GDI+. Before an application uses classes from this namespace, it must include the namespace.
The System.Drawing.Text Namespace
The System.Drawing.Text namespace contains only a few classes related to advanced GDI+ typography functionality. Before an application uses classes from this namespace, it must include the namespace.


References

[1] Wikipedia - Marginal distribution
[2] Wikipedia - Joint distribution
[3] Wikipedia - Bayes' theorem
[4] Wikipedia - Indipendence (probability theory)
[5] IntellSpot - Types of Graphs and Charts And Their Uses
[6] InfoWorld - How to work with reflection in C#
[7] CSharp.Net - Reflection introduction
[8] CSharp Corner - GDI+ Namespaces and Classes in .NET