The term video quality remains poorly defined even in cases where it seems that it shouldn’t be. It comes down to the word “quality”. Quality depends on the “reference” and that “reference” may be of low appeal. How do you quantify quality when the “reference” has no appeal? We often think in terms of how far the quality is from “pristine”, but that word causes even more grief, and let’s not get started on “similarity” or “fidelity”. Where do we start in defining video quality?
We must start at the beginning and define what we want to achieve. Are we trying to measure the perceived effect when we know that the video quality has been degraded? Perhaps, we want to know if the video quality is equal to what we already know is good?
It all starts with a simple concept; we must judge video quality in a “consistent” way; regardless, of the type of distortion.
When the MOS score is on a 1 to 5 scale, the scores are
Many video CODECs use a Group of Pictures (GoP) frame structure, which consists of independently coded reference frames (“I” frames), motion changes from the last reference frame (“P” frames) and motion changes from the last reference or next reference frame (“B” frames). If a transmission error occurs, the type of frame lost determines the propagation time of the error.
If the transmission error results in the loss of a frame, then the type of frame lost determines the duration of the loss:
Transmission Error with Concealment on the Right
If the video signals require a re-compression or rate-shape to fit into the available bandwidth for transmission, the compression could be too high. In this case, the video can become blurry or blocky. An example of a blocking error is shown below. The original is on the left.
Blocking Error – Too High a Compression on the Right
Video is not the only place where compression, digitization takes place. Most audio CODECs detect high frequency components and encode these with very few bits because the human ear can only hear loud high frequencies. Some algorithms reduce the dynamic range to reduce the amount of data. If a transmission error occurs, the audio will pop or go silent. If the compression is too extreme, the audio will lack depth – i.e. sound tinny or hollow.
These objective algorithms which predict a subjective score are divided into 3 general categories:
To start, the 2 streams (“reference” and “processed”) must be aligned both temporally and spatially. Audio and Video synchronization issues are reported at this point. The audio and video are aligned. It can be assumed, for simplicity, that the audio and video stay aligned after the initial alignment difference is calculated. If this is not the case, then each frame must be re-aligned.
The most widely used metrics are PSNR (Peak Signal-to-Noise Ratio) or MSE (Mean Squared Error). Both measure the mean error between input and output. PSNR expresses the result as a ratio of the peak signal expressed in dB. PSNR and MSE are known as a dreadful video quality predictor, but they do serve an important role. Unlike the indices soon to be discussed, PSNR and MSE are metrics. They measure the absolute difference between two signals, which is completely quantifiable. This is very important in QA and Monitoring where the perceived quality has already been measured in the laboratory environment and what is needed is PASS/FAIL indicator. A PSNR value of 35dB is generally considered good. A general comparison of PSNR to MOS is shown below.
Traditional perceptual video quality index methods are based on a bottom-up approach which attempts to simulate the functionality of the relevant early human visual system (HVS) and human audio systems (HAS) components. These methods usually involve
In either the top down or bottom up approach, the model produces a number and that number needs to be correlated with the subjective MOS. Three (3) methods exist for this:
Many people prefer the simpler MOS (actually, DMOS as it is the difference between “reference” and “processed” Mean Opinion Score). A group of people are asked to rate a video sequence relative to reference (full reference). The general methodology for conducting subjective tests is outlined in ITU-R BT.500. The measurement gives a numeric value based on a 1-5 scale. The ITU recommends MOS (or DMOS) under ITU-T P.910. The heuristic, nominal values for MOS are listed below:
JND reports how many users need to be put into a room before 1 person thinks that the “reference” video quality is better and 1 person thinks that the “processed” video quality is better. The score is written as NumberOfPeople = 2(JND+1). This method is the foundation for T1.TR.75.2001 (“Objective Perceptual Video Quality Measurement Using a JND-Based Full Reference Technique”).
Since the JND and DMOS algorithms use different philosophical approaches, there is no direct correlation between them. In general, we see the following behavior. Please remember that JND calculates the blockiness and blurriness of an image; while DMOS calculates whether structures are still the same shape.
The VQEG (Video Quality Experts Group) created a large database of video. They compressed these using H.263, H.264, and MPEG-2 and conducted subjective tests. These databases are open to member companies, but are not royalty free.
The University of Texas started with 10 royalty free videos from the Technical University of Munich and distorted these in many ways using MPEG-2 and H.264 and transmitted them over IP networks. They conducted subjective tests and released their database called LIVE along with the royalty free data.
RTM captures 2 live inputs, aligns the audio and video inputs, reports lip-sync issues, calculates the absolute difference between the 2 inputs (metric), continually reports the quality score, generates a pass/fail, and saves failures for further offline analysis.
ClearView Video Analysis generates test signals, captures live inputs, and inputs compressed or uncompressed files. It then aligns the audio and video and reports lip-sync issues. It calculates the DMOS, JND, and/or PSNR scores. It uses the Sarnoff/PQR algorithm ported to JND (using the VQEG database) and the MS-SSIM algorithm ported to DMOS (using the University of Texas' LIVE database). It also lets you view the “reference” and “processed” signals side-by-side or their difference maps for your own subjective evaluation.
The Sarnoff/PQR and MS-SSIM algorithms are further discussed on our website at www.videoclarity.com/
He is currently the VP of Marketing and the CTO for Video Clarity, which makes quantitative, repeatable video quality testing tools. For more information about Video Clarity, please visit their website at http://www.videoclarity.com.
We must start at the beginning and define what we want to achieve. Are we trying to measure the perceived effect when we know that the video quality has been degraded? Perhaps, we want to know if the video quality is equal to what we already know is good?
It all starts with a simple concept; we must judge video quality in a “consistent” way; regardless, of the type of distortion.
General Video Quality Defined
We are using video quality to define 3 components:- Picture Quality – an index of eyes ability to understand the picture
- Audio Quality – an index of the ears ability to discern the audio
- Lip Sync – a measurement of the audio to video synchronization
- Metric – an algorithm that quantifies differences
- Index – an algorithm that measures quality using the Human Visual or Audio System (HVS/HAS)
When the MOS score is on a 1 to 5 scale, the scores are
- Unacceptable
- Poor
- Fair
- Good
- Excellent
Types of Errors
Two types of problems can arise with digital television:- The digital transmission path can fall below acceptable levels and cause a complete loss – i.e. no picture and no audio.
- The amount and quality of the compression can lend itself to poor quality.
Many video CODECs use a Group of Pictures (GoP) frame structure, which consists of independently coded reference frames (“I” frames), motion changes from the last reference frame (“P” frames) and motion changes from the last reference or next reference frame (“B” frames). If a transmission error occurs, the type of frame lost determines the propagation time of the error.
If the transmission error results in the loss of a frame, then the type of frame lost determines the duration of the loss:
- A loss of an “I” frame causes all frames until the next “I” frame to be lost
- A loss of a “P” frame affects the quality of all frame from this “P” to the next “I” frame
- A loss of a “B” frame affects only this “B” frame
Transmission Error with Concealment on the Right
If the video signals require a re-compression or rate-shape to fit into the available bandwidth for transmission, the compression could be too high. In this case, the video can become blurry or blocky. An example of a blocking error is shown below. The original is on the left.
Blocking Error – Too High a Compression on the Right
Video is not the only place where compression, digitization takes place. Most audio CODECs detect high frequency components and encode these with very few bits because the human ear can only hear loud high frequencies. Some algorithms reduce the dynamic range to reduce the amount of data. If a transmission error occurs, the audio will pop or go silent. If the compression is too extreme, the audio will lack depth – i.e. sound tinny or hollow.
Perceptual/Objective Quality Testing
A number of algorithms have been developed to estimate video quality. These algorithm reduce an image or sound to a number. This number is then correlated with subjective data to generate a score. Since the subjective data results are based on a limited number of test images, these algorithm attempt to define quality in a known way (either increasing numbers reflect better or poorer quality). In this way, a curve can generate a subjective quality score between 2 real points.These objective algorithms which predict a subjective score are divided into 3 general categories:
- Full reference algorithms compare the output video stream to its input (or to another point)
- No reference algorithms analyze on the output stream
- Reduced reference algorithms extra specific information from the input stream and use it when analyzing the output stream.
To start, the 2 streams (“reference” and “processed”) must be aligned both temporally and spatially. Audio and Video synchronization issues are reported at this point. The audio and video are aligned. It can be assumed, for simplicity, that the audio and video stay aligned after the initial alignment difference is calculated. If this is not the case, then each frame must be re-aligned.
The most widely used metrics are PSNR (Peak Signal-to-Noise Ratio) or MSE (Mean Squared Error). Both measure the mean error between input and output. PSNR expresses the result as a ratio of the peak signal expressed in dB. PSNR and MSE are known as a dreadful video quality predictor, but they do serve an important role. Unlike the indices soon to be discussed, PSNR and MSE are metrics. They measure the absolute difference between two signals, which is completely quantifiable. This is very important in QA and Monitoring where the perceived quality has already been measured in the laboratory environment and what is needed is PASS/FAIL indicator. A PSNR value of 35dB is generally considered good. A general comparison of PSNR to MOS is shown below.
Traditional perceptual video quality index methods are based on a bottom-up approach which attempts to simulate the functionality of the relevant early human visual system (HVS) and human audio systems (HAS) components. These methods usually involve
- Video/Audio alignment
- Low pass filtering (to simulate the eye – video only)
- Calculating the differences that affect the human eye/ear.
- Blockiness
- Blurriness
- Noise
- Lack of Dynamic Range
- Loss of High Frequencies.
- Classify the types of distortions.
- Generate a number.
- This number should increase (or decrease) when the video quality increases and decrease (increase) when the video quality decreases.
- Sarnoff/PQR – First Widely Heralded HVS Metric
- VQM – Video Quality Metric
- PEVQ – Perceptual Evaluation of Video Quality
- PEAQ – Perceptual Evaluation of Audio Quality
In either the top down or bottom up approach, the model produces a number and that number needs to be correlated with the subjective MOS. Three (3) methods exist for this:
- Differential Mean Option Score (DMOS)
- Mean Opinion Score (MOS)
- Just Noticeable Differences (JND)
Many people prefer the simpler MOS (actually, DMOS as it is the difference between “reference” and “processed” Mean Opinion Score). A group of people are asked to rate a video sequence relative to reference (full reference). The general methodology for conducting subjective tests is outlined in ITU-R BT.500. The measurement gives a numeric value based on a 1-5 scale. The ITU recommends MOS (or DMOS) under ITU-T P.910. The heuristic, nominal values for MOS are listed below:
- 4.4-5.0 – Very Satisfied
- 4.0-4.3 – Satisfied
- 3.0-3.9 – Some Users Satisfied
- 2.0-2.9 – Many Users Dissatisfied
- 1.0-1.9 – Most Users Dissatisfied
JND reports how many users need to be put into a room before 1 person thinks that the “reference” video quality is better and 1 person thinks that the “processed” video quality is better. The score is written as NumberOfPeople = 2(JND+1). This method is the foundation for T1.TR.75.2001 (“Objective Perceptual Video Quality Measurement Using a JND-Based Full Reference Technique”).
| JND Score | Experts | Percentage | Description |
| 0 | 2 | 50% | If you ask 2 experts which video is better, they cannot agree. |
| 1 | 4 | 75% | 3 pick one sequence and 1 picks the other sequence |
| 2 | 8 | 87.5% | 7 pick one sequence and 1 picks the other sequence |
| 3 | 16 | 93.75% | |
| 4 | 32 | 96.875% | |
| 5 | 64 | 98.437% | |
| 6 | 128 | 99.219% | |
| 9 | 1024 | 99.902% | 1023 pick one sequence and 1 picks the other sequence |
Since the JND and DMOS algorithms use different philosophical approaches, there is no direct correlation between them. In general, we see the following behavior. Please remember that JND calculates the blockiness and blurriness of an image; while DMOS calculates whether structures are still the same shape.
| DMOS | JND | Description |
| 13+ | Probably not aligned check Spatial and Temporal Alignment | |
| 4-3.5 | 10.0-12.99 | Unwatchable |
| 3.4999-3.0 | 7.0-9.99 | Annoying |
| 2.9999-0.4 | 2.0-6.99 | Broadcast Quality |
| 0.3999-0.0001 | 0.01-1.99 | Production Quality |
| 0 | 0 | No Defects |
Subjective Data
The most important item to remember is that lossy, compressed signals have distortions. To understand quality, we must correlate metrics or indices to subjective MOS data. To this end, we must have an open, searchable database of subjective data.The VQEG (Video Quality Experts Group) created a large database of video. They compressed these using H.263, H.264, and MPEG-2 and conducted subjective tests. These databases are open to member companies, but are not royalty free.
The University of Texas started with 10 royalty free videos from the Technical University of Munich and distorted these in many ways using MPEG-2 and H.264 and transmitted them over IP networks. They conducted subjective tests and released their database called LIVE along with the royalty free data.
Video Clarity ClearView Solution
Video Clarity defined 2 product lines the ClearView Video Analysis and RTM (Real Time Monitor).RTM captures 2 live inputs, aligns the audio and video inputs, reports lip-sync issues, calculates the absolute difference between the 2 inputs (metric), continually reports the quality score, generates a pass/fail, and saves failures for further offline analysis.
ClearView Video Analysis generates test signals, captures live inputs, and inputs compressed or uncompressed files. It then aligns the audio and video and reports lip-sync issues. It calculates the DMOS, JND, and/or PSNR scores. It uses the Sarnoff/PQR algorithm ported to JND (using the VQEG database) and the MS-SSIM algorithm ported to DMOS (using the University of Texas' LIVE database). It also lets you view the “reference” and “processed” signals side-by-side or their difference maps for your own subjective evaluation.
The Sarnoff/PQR and MS-SSIM algorithms are further discussed on our website at www.videoclarity.com/
The Author
Bill Reckwerdt has been involved in digital video since the early 90’s from digital compression, video on demand, to streaming servers. He received his MS specializing in Behavioral Modeling and Design Automation from the University of Illinois Urbana-Champaign.He is currently the VP of Marketing and the CTO for Video Clarity, which makes quantitative, repeatable video quality testing tools. For more information about Video Clarity, please visit their website at http://www.videoclarity.com.