Edoardo Ottavianelli

Software Developer

Cybersecurity Student at Sapienza University. Passionate about Computing, Nature and the whole sphere of Science.

Contact me

Scroll down


[Statistics] Lesson 2

Author: Edoardo Ottavianelli
18/10/2020

Researches about theory (R)

4_R) A characteristic or attribute or property of the units of observation can be measured and operationalized on different "levels", on a given unit of observation, giving rise to possible different operative variables. Find out about the proposed classifications of variables and express your opinion about their respective usefulness.

We know when we have to deal with entities we can describe a lot of characteristics of them. For example, when I describe a girl I can speak about her eyes (brown or green, little or big..), her body (tall, skinny, 165cm, 5.1 inches...) or her behaviour. All of these properties can be classified into categories. The best-known classification was published by the Psychologist Stanley Smith Stevens. I drop here a chart taken from a video on Youtube [1] that explains in a very clear way this categorization.

levels

We can measure data distingushing from the quantitative data and qualitative data. The first refer only to numeric data (so something that I can quantify), the second instead refer to the data I can describe with words/adjectives or numbers (but they don't have to be a measure of anything). But, be careful, I can describe the same entity using both types of variables, for example as we saw before, I can describe the body of a girl as tall (using a scale of short, medium and tall; so qualitative data) or saying she's 165 centimeters (using quantitative data).
Let's focus on Qualitative data. They can be split into another two categories: Nominal and Ordinal. Both use words to describe some data; an example of nominal measurement is using the seasons name: so winter, fall, spring and summer, so for instance I can group the days of the year classifying them with the season they are in. The ordinal measurement differs from the nominal because I can sort the informations and I have a kind of range of measurement. So for example I can classify some dishes in a restaurant using a scale of disgusting, acceptable, good, delicious. Of course we can see here a scale, a range, a lower grade and upper grade.
Instead talking about Quantitative measurement, there is a little difference between them. Both use numeric values to describe a property, but Interval doesn't have necessarily the Zero value as the lower value. A good example for the nominal level is the temperature measured in C or F. The zero is just a value and it isn't the lower bound. Instead the temperature measured in K has a true Zero value (so is the actual lower bound) and it is a ratio level of measurement. A Nominal level of measurement classifies the objects/entities into some groups or sets to which they belong. We can't do any comparison on these types of data except for the classification/membership. Color, shape and other qualitative adjectives of the object are part of this category. Mathematical Operators: “ = ” , “ ≠ ”. An Ordinal level of measurement defines an order relation (1st, 2nd 3rd) between the objects. So for example we can refer at this type as the height/weight of the students of Statistics Course. We can order the students per height/weight order and do comparisons between the students looking at these informations. I'm 180 cm and I'm shorter than a 190 cm student. Mathematical operator: “ > “ , “ < “. The Interval level of measurement defines a scale where information can be placed. So, for example with a meterstick we have a low-bound and an high-bound level and some points into this scale where all the data fall in. Mathematical operator: “ + “, “ – “. A ratio scale possesses a meaningful (unique and non-arbitrary) zero value. Examples include mass, length, duration, plane angle, energy and electric charge. In contrast to interval level of measurement, ratios are now meaningful because having a non-arbitrary zero point makes it meaningful to say, for example, that one object has “twice the length”. Mathematical operator: “ * “,“ / “. [2]

5_R) Describe the most common configuration of data repository in the real world and corporate environment. Concepts such as Operation systems (oltp), Data Warehouse (DW), Data Marts, Analitical and statistical systems (olap), etc. Try to draw a conceptual picture of how all these elements works together and how the flow of data and informations is processed to extract useful knowledge from raw data.

The Online transaction processing (OLTP) systems facilitate the transactions processes. We can define two meanings of the term transaction [3]:

  • Transaction in Database Management System means the atomic process to apply changes to data.
  • Transaction in Financial world means the actual process to move money from a source to a target.
This type of systems are specifically advanced to process insert and update processes. One example of these is the ATM where you can send/receive money. Obviously the requisites for these systems are availability, speed and high concurrency performance. The OLTP system design requires:
  • Rollback segments: These are the portions of DBMS who record the actions of transactions and provide the ability to go back and restore the system to its previous state.
  • Clusters: These are groups of tables in a DBMS and they help the systems to make the JOIN operations faster.
  • Discrete transactions: This technology make the changes atomic and defers all the changes until the transaction is committed. So, it stores changes in a different environment and apply the changes when the transaction is confirmed.
  • Block size: The block size of these systems must be a multiple of the OS where the system is hosted to avoid multiple unnecessary and low-performance input/output actions.
  • Buffer cache size: These types of systems have to maximize the use of caching technology due to high level of concurrent requests.
This type of system is constrasted to Online Analytical Processing (OLAP) because these systems process much more complex queries but in a smaller volume. OLTP uses all of the CRUD operations (Create, Read, Update and Delete) instead OLAP systems typically uses read only. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture [4]. There are three main types of OLAP systems:
  • Multidimensional OLAP (MOLAP): MOLAP (multi-dimensional online analytical processing) is the classic form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a relational database.
  • Relational OLAP (ROLAP): ROLAP works directly with relational databases and does not require pre-computation. The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information.
  • Hybrid OLAP (HOLAP): HOLAP database will use relational tables to hold the larger quantities of detailed data, and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed data. HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of both approaches.
A Data Warehouse (DW) is a centralized repository of data that can be analyzed to make more informed decisions. DBMS and other systems send informations to data warehouses usually on a regular basis. Business analysts, data engineers, data scientists access data to take decisions and study the flow of the information. The data and the related analysis have become critical factors to ensure the competitiveness of corporates. Reports, dashboards and analytics tools are indispensable for extracting insights from data, monitoring business performance and supporting decision making.
Benefits of DW (taken from Wikipedia [5]): A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to:
  • Integrate data from multiple sources into a single database and data model. More congregation of data to single database so a single query engine can be used to present data in an ODS.
  • Mitigate the problem of database isolation level lock contention in transaction processing systems caused by attempts to run large, long-running analysis queries in transaction processing databases.
  • Maintain data history, even if the source transaction systems do not.
  • Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.
  • Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.
  • Make decision–support queries easier to write.
  • Organize and disambiguate repetitive data.
Other important well-deserved mentions are Data marts, which are repositories containing well-ordered and classified data for a specific topic, as we can see from the previous image (purchasing, sales, inventory).

6_R) Show how we can obtain an online algo for the arithmetic mean and explain the various possible reasons why it is preferable to the "naive" algo based on the definition.

What is the arithmetic mean?
In mathematics and statistics, the arithmetic mean, or simply the mean or the average (when the context is clear), is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results of an experiment or an observational study, or frequently a set of results from a survey.[6]
The naive algorithm, so the first coming in mind is ,obviously, to sum all the values and then divide by the number of the items.
So, for instance, if I have this dataset:

Person Age
Edoardo 23
Francesca 45
Manuele 1
Chiara 22
Mario 56
Francesco 98
Cristian 33
Zaira 77
Marco 16
Gianna 19
To calculate the arithmetic mean, I have to sum all the ages and then divide by 10. So: 390/10 = 39. We know now the arithmetic mean of this set of values is 39.
We calculated this number without difficulty, but can we have some obstacles in some weird circumstances?
Yes. First of all, as we have seen in the previous section, we could have an infinite set of values. So, for example, we could have to collect data every n milliseconds (timeseries data, for instance financial/geosismic data) and it's pratically impossible to start this type of algorithm because we never will have a finite set of values.
There are also others problems with floating point values and very huge values where the naive approach doesn't work. These types of problems are discussed in the Researches about application section (3_RA). By the way, a better solution to calculate the arithmetic mean is using the Knuth formula: knuth formula or the Kahan algorithm:
function KahanSum(input)
var sum = 0.0                    // Prepare the accumulator.
var c = 0.0                      // A running compensation for lost low-order bits.
for i = 1 to input.length do     // The array input has elements indexed input[1] to input[input.length].
    var y = input[i] - c         // c is zero the first time around.
    var t = sum + y              // Alas, sum is big, y small, so low-order digits of y are lost.
    c = (t - sum) - y            // (t - sum) cancels the high-order part of y; subtracting y recovers negative (low part of y)
    sum = t                      // Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!
next i                           // Next time around, the lost low part will be added to y in a fresh attempt.
return sum
                            


Applications / Practice (A)

4_A) Create - in both languages C# and VB.NET - a demonstrative program which computes the online arithmetic mean (if it's a numeric variable) and the distribution for a discrete variable (can use values simulated with RANDOM object).

This program is just to enhance my skills on C# and the whole Windows development environment. It tries to compute the (naif) arithmetic mean and if it can't do that because there are some non-numerical values it computes a distribution over the population.


This program instead computes the online arithmetic mean with random values. So, every 1 second it generates a random number (the grade for an exam, so with the range 18-31, with 31 excluded) and that number will affect the current mean. There are two buttons, a "start" button to start the timer and the arithmetic mean computing, and a "stop" button to stop the timer and of course also the arithmetic mean computing.

This program instead computes the distribution with random values. So, every 0.1 second it generates a random number (the grade for an exam, so with the range 18-31, with 31 excluded) and that number will affect the current distribution. There are two buttons, a "start" button to start the timer and the distribution computing, and a "stop" button to stop the timer and of course also the distribution computing.

5_A) Create - in your preferred language C# or VB.NET - a demonstrative program which computes the online arithmetic mean (or "running mean") and distribution for a continuous variable (can use random simulated values). (In both cases, create your own algorithm, by either inventing it from scratch based on your own ideas, or putting it together by researching everywhere, striving for the most usable and general logic and good efficiency and numerical stability).

6_A) Create one or more simple sequences of numbers which clearly show the problem with the "naive" definition formula of the arithmetic mean, and explore possible ways to fix that. Provide alternative algorithms to minimize problems with the floating point representation with simple demos with actual numbers.


Researches about applications (RA)

3_RA) Understand how the floating point representation works and describe systematically (possibly using categories) all the possible problems that can happen. Try to classify the various issues and limitations (representation, comparison, rounding, propagation, approximation, loss of significance, cancellation, etc.) and provide simple examples for each of the categories you have identified.

In general, who don't have knowledge about computer programming and computer architecture could think as floating point numbers as just numbers with a comma (point, or separator in general). So if I have to representate 16,56; this number can be stored in a computer like this: 16,56. So a number representing the integer number, a separator and a number representing the decimal part. This representation is quite uncomfortable. For example, if I have a computer working with 32-bit word, I can assign 20 bits for the integer part and the restant 12 bits for the decimal part. With this method I can represent numbers going from -524.287 to 524.287 with only 4 digits for the decimal part, so the smallest number I can represent is 0,00025. This obviously it's very unuseful for scientific calculus.[9]
So, engineers thinked about a new representation of floating point numbers in computers that works in this way: I have a triple of values <s,m,e>: the s is the sign of the number, the m is the mantissa and the e is the exponent. This: + 1656 x 10^-2 is a good way to represent the number 16,56. By the way, even if this representation is the real one used in computing now, it still has problems: let's analyze them. The problem of scale. Each FP number has an exponent which determines the overall “scale” of the number so you can represent either really small values or really larges ones, though the number of digits you can devote for that is limited. Adding two numbers of different scale will sometimes result in the smaller one being “eaten” since there is no way to fit it into the larger scale.[10] Conversions to integer are not intuitive: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than round. Floor and ceiling functions may produce answers which are off by one from the intuitively expected value. Limited exponent range: results might overflow yielding infinity, or underflow yielding a subnormal number or zero. In these cases precision will be lost. 64-bit integers may not fit into a double. Unless you know only 53 bits are used, the only suitable floating-point type is the 80-bit extended precision format (long double) on x86 processors. In many if not most cases numerical computations involve iteration rather than just the evaluation of a single formula. This can cause errors to accumulate, as the chance of an intermediate result not being exactly representable rises. Even if the target solution is an attractive fixed point of your iteration formula, numerical errors may catapult you out of its domain of attraction. In addition to the rounding errors introduced at every step in a computation, the values your processor computes for transcendental functions may be off by more than basic arithmetic. This is because optimal rounding to the nearest representable number requires distinguishing on which side of the middle between them (an infinitely narrow line) the true result lies. Transcendental functions are power series with infinitely many non-zero coefficients, so one cannot really know that for a given argument without computing an infinity of terms of the series.[11]


References

[1] Data Science & Statistics: Levels of measurement
[2] Wikipedia - Level of measurement
[3] Wikipedia - Online Transaction Processing
[4] Wikipedia - Online Analytical Processing
[5] Wikipedia - Data Warehouse
[6] Wikipedia - Arithmetic mean
[7] Best algorithms to compute the “online data stream” arithmetic mean
[8] Wikipedia - Floating point arithmetic
[9] Progettazione di sistemi digitali - Rappresentazione dei numeri razionali
[10] StackOverflow - Floating point inaccuracy examples
[11] What you never wanted to know about floating point but will be forced to find out