Statistical sampling is required for design verification and validations. If you have worked for a company or written protocols in the past - how have you explained or ensured that the sampling plan and statistical approach you have taken has been sufficient?
For myself, I have written validation protocols in the past and usually it depends on the extent of the validation if it is minor or critical or variable or attribute data we are measuring. From there we look at 90% reliability and 90% confidence (example) and determine how many samples we need based on that. The acceptance criteria for the test comes afterward along with the statistical analysis. This is a very basic and high level approach. I'm curious as to how people approach the reasoning and explanation behind the amount of samples they choose for verification and validation.
HI, @Scott, I haven't done any protocols for medical devices but in the past working in clinical settings, we have planned different protocols, proposals. We mostly used Gpower analysis to determine the amount of subjects (samples) needed for a protocol (power around 80%) in order for the data to be reliable. And with the confidence level, we used 95%.
Did you use Gpower for your protocol? if you did, what's good/typical power to go by?
In agreeance with Scott’s comment about the extent of the validation required, I have experience with similar guidelines in my position. Firstly, a standard is set for the immunoassay instrument to meet of the concentration of known analyte it is able to detect in its optimal operating state. Multiple R&D instruments are repeatedly built and tested for accuracy to get a good representation of the range of values that are deemed acceptable. Then, through statistics, the acceptable guidelines are established for a ‘working’ instrument. As the instruments begin their production phase, patterns in the pass/ fail of each of the hundreds of analytes begin to appear. If a particular immunoassay testing analyte has a pattern of a high probability pass rate, then the number of replicate tests in the validation can be lower due to the observed higher pass rate. If the problem is deemed to be more common (observation made by R&D in initial testing) then the validation procedure will include many more replicated to ensure that all mistakes are caught when the instrument are being built in the commercial phase.
It seems everyone in this thread agrees that the determination for the appropriate sample size for statistical analysis stems from what your desired confidence and reliability levels. At the company I work for we use our confidence and reliability to determine how we evaluate specific product requirements based on whether or not the test we are performing is variable or attribute.
For attribute testing the confidence and reliability levels gets translated into a minimum number of samples you need to test and out of those tested samples how many failures you are allowed to have. For example for a given confidence level “X” and reliability level “Y” the internal procedure would state that you need to test a minimum of 150 sample and are not allowed to have any failures. However, this would not be your only option for your attribute test the procedure would also state higher sample sizes in which you are able to accept more failures. For instance for the same confidence and reliability level (“X” and “Y”) you may have to test 320 sample to be able to accept one failure. It works as a sliding scale, and for any given confidence and reliability level the more samples you test above the minimum will start to allow you to accept more failures.
For variable testing the confidence and reliability levels are translated into an acceptable capability value (CpK or PpK). Based on the mean, standard deviation, and specific product requirement value the capability can be determined. For my company which produces high volume disposable medical devices the minimum samples size (as dictated by our corporate statistics department) is N=30 samples in order to determine normality and capability. You can always test more if you can afford it based on the timeline and cost associated with the testing for your given product but N=30 would be the minimum. Generally from my experience there is really no negative will testing more samples, but testing only 30 can sometimes pose potentially issues even though it is statistical relevant.
There are several factors that must be considered when determining the appropriate sample size including risks associated with the product, costs associated with producing the product, and costs associated with inspection, measuring, and testing. It is very important to consider the risks associated with the product. The Bayes Success-Run Theorem is one useful method that can be used to determine an appropriate risk-based sample size for process validations.
Now the point raised above that the statistical sampling is based not the confidence and reliability levels. This also depends on the risks associated with the product.
I don't have any industry experience in writing protocols but I have worked in the laboratory where I have done this, and for our study we have used the Gpower analysis. And based on that we decide how many subject we need for our study. To determine how many subject we need to get the reliable data it is very important, specially when you work in a small laboratory where budget is very limited.
I have less experience on writing protocols but I searched about this online and found that sample sizes are calculated based on the magnitude of effect that the researcher would like to see in his treatment population (compared with placebo). It is important to note variables such as prevalence, expected confidence level and expected treatment effect need to be predetermined in order to calculate sample size. As an example, Scarborough state that “on the basis of a background mortality of 56% and an ability to detect a 20% or greater difference in mortality, the initial sample size of 660 patients was modified to 420 patients to detect a 30% difference after publication of the results of a European trial that showed a relative risk of death of 0.59 for corticosteroid treatment.” Determining existing prevalence and effect size can be difficult in areas of research where such numbers are not readily available in the literature.
Statistical sampling tests I have used for research included the paired t-test and one way ANOVA. Typically, for something to be taken as statistically significant, a power of 95% or better is needed. Whereas in industry you will be checking for defects in the product, I was looking to see if my experiment showed any variable dependence. In the lab, there was not much emphasis on how many trials we needed because we had a limited number of trials available to run.
For biological statistics, t & z tests are useful for data analysis. You would want your power to be at lest 80% to be considered a good test, however the higher the power the better. you would use a t test when you know the population deviation (sigma) and the z test when you are estimating that same value.
I had to draft up the validation and protocol for two machines a few months ago. The plan used is known as Acceptable Quality Limit (AQL). It is a ..."statistical method that is used to test the quality level that would be accepted by estimating a characteristic of the product population through a sample". Other test used are t test, ANOVA, and more.
I haven't done statistical tests in industry but I can speak to academic tests I have done. In addition to paired t-tests and one way ANOVAs, mixed ANOVAs and repeated measures ANOVAs are good tests to determine significance between factors. These can take into account within-subject and between-subject factors. 95% is usually what is used in order to determine significance.
I have very little experience on this topic, therefore it was very good and informative to read your comments and experiences designing statistical parameters for testing controls. The only experience related to this that I have is meeting manufacturer standards during validation of medical devices in a specific location in a laboratory. Although this was not for the design of a new device, we were testing different paraffin embedded breast tissue samples in order to validate the reproducibility of the device in delivering the correct recurring score for breast cancer in each case. We were also allowed only a 5% deviation from initial recurring score.
From the comments above I can see that the sample size was determined by the parameters that would be tested and it is surprising to see that a low 80% confidence was allowed for some medical devices. That to me seems to be kind of low in providing a good standard for approval of these devices. Would this mean that 1 in every 5 will fail? If we max scale the product use, it would mean that 200 people of 1000* would have a faulty device? I do not see how this can be someone allowed?
In a research lab setting, we commonly look at previous studies that would be similar to our proposed study and calculate the effect size for that study (which is sometimes directly stated in the research paper). First, it is important to decide which statistical test you would choose (t-tests, ANOVA, etc.) and then find a previous study that is similar to yours, which also used that statistical test. Based on the previous study's effect size and statistical test, we use G*Power to calculate the sample size we would need for at least 80% statistical power. If the effect size is unknown, then one could assume a small effect size. Lastly, it is common to add a few extra samples just in case some samples do not complete the study.
Link to download G*Power software (free): https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower.html
Statistical sampling is required for design verification and validations. If you have worked for a company or written protocols in the past - how have you explained or ensured that the sampling plan and statistical approach you have taken has been sufficient?
For myself, I have written validation protocols in the past and usually it depends on the extent of the validation if it is minor or critical or variable or attribute data we are measuring. From there we look at 90% reliability and 90% confidence (example) and determine how many samples we need based on that. The acceptance criteria for the test comes afterward along with the statistical analysis. This is a very basic and high level approach. I'm curious as to how people approach the reasoning and explanation behind the amount of samples they choose for verification and validation.
When working as a Biomedical Engineering Technician for an international company, one of my customers was using the Elisa device irresponsibly. Meaning they were not doing daily cleaning and maintenance procedures; thus, it caused the device to have false-positive results and eventually burned the motors of some of the modules within the device. Although it functioned in my first attempt to fix the device, it was still giving false-positive results. What I did was, I basically checked, repaired, or changed every little module that has exposure to patient fluids. After that, to make sure we don't get false-positive results, I stayed there for a month, going to the hospital every morning and leaving every night with the lab technicians and recording everything they did with the device. Corrected them from time to time about how they need to use it etc. Within that time, I took the results from all of the patients, plus my own testing patients that tI know the results of(either positive or negative for the test and used them as my control group), and I did 1 sample and 2 sample T-tests to validate the results of the testing procedure of the device. After making sure, the device was functional, I returned to my home city and my office.
I have also used reliability and confidence to determine sample sizes for testing during validations. I have also used AQL (acceptable quality limits) sampling when doing additional testing to mitigate risks observed with processed non conforming products. We also use AQL sampling on incoming lots of materials for processing. All our material lots are inspected and in some cases tested so AQL is used to decide on how many to test.