Investigation of cutting discs (in)efficiency via ANOVA

Hot rolled steel bars must be cutted in certain customer needed length. Different cutting disc types can be applied for this process. We are going to - by means of ANalysisis Of VAriances - figure out, which of three selected cutting discs have the best efficiency respectively the biggest inefficiency. The inefficiency is measured by the product of price times wear and tear of cutting discs.

The effect of following independent variables / process parameters (x-values) will be investigated on the inefficiency respectively on the product of price times wear and tear (y-value, dependent) of cutting discs:

  • Disc type

  • Steel bar temperature

  • Circumferential speed of cutting machine

  • Feed speed of cutting machine

  • Carbon content of the steel bars

  • Area of the steel bars

  • Diameter of the steel bars

Get data / Visualize inefficiency of cutting discs

# load packages
options(contrasts = c("contr.sum", "contr.poly"))
library(tidyverse)
library(scales)
library(reshape2)
library(car)
library(doParallel)
library(foreach)

# Load data set
data.cut <- readRDS("data.cut.rds") 

# Plot data to see what kind of data we have
data.cut %>% 
  ggplot(aes(disc.type, inefficiency, fill = disc.type)) + 
  geom_boxplot(outliers = F)  +
  theme_bw(base_size = 20) + 
  xlab("Disc type") +
  scale_y_continuous("price * wear and tear / inefficiency [€mm]", breaks = pretty_breaks(n = 5)) + 
  stat_summary(fun.y = mean, geom = "point", shape = 20, size = 5, color = "blue", position = position_dodge(0.75), show.legend = F) +
  coord_flip() +
  theme(
    legend.position = "none"
  )
Figure 1: Inefficiency of disc types

Disc B seems to have the biggest inefficiency. The question is now: Is this result significant? For this purpose first of all, we will look at confidence intervals.

Confidence interval

In Figure 2 you can see the mean values of product of price times wear and tear and the corresponding confidence intervals (95%) for the three disc types. There are no overlaps between the confidence intervals of inefficiencies (y-value) of disc types; it seems that the disc types differs (significantly) each other. We will proof that later via ANOVA!

# Calculate confidence interval
library(Rmisc)
data.cut.confidence <- 
  data.cut %>%
  group_by(disc.type) %>%
  dplyr::summarise(avg_inefficiency = mean(inefficiency), 
                   uci_inefficiency = CI(inefficiency)[1], 
                   lci_inefficiency = CI(inefficiency)[3]) 

# Plot confidence interval
ggplot(data = data.cut.confidence, aes(x = disc.type, y = avg_inefficiency, color = disc.type)) +
  geom_point(size = 7) +
  geom_errorbar(aes(ymin=uci_inefficiency, ymax=lci_inefficiency), size = 1, width=.2, position=position_dodge(0.05)) +
  xlab("Disc type") + ylab("price * wear and tear / inefficiency [€mm]") +
  scale_y_continuous(breaks = pretty_breaks(n = 5)) +
  theme_bw(base_size = 20) + 
  coord_flip() +
  theme(
    legend.position = "none"
    )
Figure 2: Confidence interval (95%) of inefficiency for different disc types

Convert continuous data to categorical data

ANOVA needs continuous dependent variable and categorical independent variables. We will divide all (continuous) independent variables into four categories (quartile).

# Calculate quartile for bar area 
q.bar.area <- quantile(data.cut$area)
# Create new column for the different (four) categories of bar area 
data.cut$area.factor <- cut(data.cut$area, c(-Inf, 1982.594, 4190.840, 8557.842, Inf),
                           labels = c("1.Quartil", "2.Quartil", "3.Quartil", "4.Quartil"),
                           right = F, ordered_result = T)

# Calculate quartile for bar temperatur 
q.temp <- quantile(data.cut$temperature) 
# Create new column for the different (four) categories of bar temperature
data.cut$temp.factor <- cut(data.cut$temperature, c(-Inf, 233.0, 314.0, 368.0, Inf),
                          labels = c("1.Quartil", "2.Quartil", "3.Quartil", "4.Quartil"),
                          right = F, ordered_result = T)
# Calculate quartile for bar diameter
q.diameter.factor <- quantile(data.cut$diameter)
# Create new column for the different (four) categories of bar diameter
data.cut$diameter.factor <- cut(data.cut$diameter, c(-Inf, 60, 80, 95, Inf),
                         labels = c("1.Quartil", "2.Quartil", "3.Quartil", "4.Quartil"),
                         right = F, ordered_result = T)

# Calculate quartile for circumferential speed of cutting machine 
q.circum.factor <- quantile(data.cut$circumferential.speed)
# Create new column for the different (four) categories of circumferential speed
data.cut$circum.factor <- cut(data.cut$circumferential.speed, c(-Inf, 90, 96, 98, Inf),
                            labels = c("1.Quartil", "2.Quartil", "3.Quartil", "4.Quartil"),
                            right = F, ordered_result = T)

# Calculate quartile for feed speed of cutting machine 
q.feed.factor <- quantile(data.cut$feed.speed)
# Create new column for the different (four) categories of feed speed
data.cut$feed.factor <- cut(data.cut$feed.speed, c(-Inf, 13, 14, 15, Inf),
                              labels = c("1.Quartil", "2.Quartil", "3.Quartil", "4.Quartil"),
                              right = F, ordered_result = T)

# Calculate quartile for carbon content of steel bars  
q.carb.factor <- quantile(data.cut$carbon)
# Create new column for the different (four) categories of carbon content of steel bars
data.cut$carb.factor <- cut(data.cut$carbon, c(-Inf, 0.186, 0.365, 0.483, Inf),
                          labels = c("1.Quartil", "2.Quartil", "3.Quartil", "4.Quartil"),
                          right = F, ordered_result = T)

Levene-Test

Perform a levene test to proof the homogeneity of the variances - in the population - between the three groups (cutting-discs).

The homogeneity of the variances are a necessary condition to perform ANOVA. According to the levene test (p < 0.05) the variances in the population are not equal (homogeneous)!; therefore the essential requirement to do ANOVA are not fulfilled. In this case there are alternatives like Welch-Test for one way (factor) ANOVA, but for multi-factor ANOVA there not alternatives.

Nevertheless, ANOVA is an robust method to detect significant differences in mean-values between groups (factors), if the sample size in groups are the same respectively the sample size is bigger than 10 per group; even though the variances of population are not equal.

# Perform levene test
system.time({
  registerDoParallel(cores = (detectCores() - 1))
  getDoParWorkers()
  getDoParName()
data.levene <- leveneTest(inefficiency ~ disc.type * area.factor * temp.factor * diameter.factor *
                            circum.factor * feed.factor * carb.factor, data.cut)
  registerDoSEQ()
})
       User      System verstrichen 
      19.78        0.25       20.65 
# save leve-test results to avoid long recalculation times
# saveRDS(data.levene, "data.levene.rds")
 
# load results of ANOVA
# data.levene <- readRDS("data.levene.rds")

# p < 0.05 --> Variances are not homogen! 

ANOVA

According to the ANOVA the disc type (one of our x-value) has a significant influence on the inefficiency (y-value, price * wear and tear). By means of post-hoc-test we can figure out, which cutting discs distinguish each other significantly.

# Perform ANOVA (needs much time due to many factors) - you can perform it only with disc.type, area.factor and circum.factor!
#system.time({
#  registerDoParallel(cores = (detectCores() - 1))
#  getDoParWorkers()
#  getDoParName()
#  data.anova.res <- aov(inefficiency ~ disc.type * area.factor * temp.factor *
#                          diameter.factor * circum.factor * feed.factor * 
#                          carb.factor, data.cut)
#  registerDoSEQ()
#  })
 
# save ANOVA results to avoid long recalculation times
# saveRDS(data.anova.res, "data.anova.res.rds")
 
# load results of ANOVA
data.anova.res <- readRDS("data.anova.res.rds")

# summary of ANOVA
anova.res.summary <- summary(data.anova.res)

# show ANOVA results
# anova.res.summary

Post HOC Test

Because the groups have not the same variance in the population (due to levene-test), we are going to apply games-howell post-hoc test. According to the games-howell-test the inefficiency (y-value, price * wear and tear) of all (three) cutting discs distinguish each other significantly!

# load needed library for the post-hoc test
library(rstatix, warn.conflicts = F)

# Perform post-hoc test
data.post.hoc <- games_howell_test(data.cut, inefficiency ~ disc.type)

# Or read the results of post-hoc test
# data.post.hoc <- readRDS("post.hoc.rds")

Visualize post-hoc results

In Figure 3 you can read at the x-axis the mean differences of inefficiency with confidence intervals of cutting discs and at the y-axis you can see to which disc types / pairs this are related.

For instance is the mean differences of inefficiency between cutting disc C and B -8540€mm (mean inefficiency C minus mean efficiency B) and there confidence interval doesn’t include zero (black dashed line at x-axis).

None of the confidence intervals overlaps the value of zero at the x-axis; this means significant differences of the cutting discs regarding the inefficiency.

As we see in the boxplot before (Figure 1), the cutting disc B is the most inefficient disc type! The most efficient cutting disc is C!

But which of our seven independent variables / parameters has the most effect on the inefficiency of the cutting discs? We can answer this by means of effect size eta-squared.

# Define mean, min, max and p for the plot
colnames(data.post.hoc)[4:7]<-c("mean",
                      "min",
                      "max",
                      "p")


# Create data frame for graphical depiction
data.post.hoc <- data.frame(id = c("B - A", "C - A", "C - B"),
                mean = data.post.hoc$mean,
                min = data.post.hoc$min,
                max = data.post.hoc$max,
                idt = ifelse(data.post.hoc$p < 0.05,
                           "significant",
                           "not significant")
                )

# Plot post-hoc results - if significant red, otherwise green color
ggplot(data.post.hoc, aes(id, color = idt)) +
  geom_point(aes(x = id, y = mean), size = 4, show.legend = F) +
  scale_y_continuous(breaks = pretty_breaks(n = 10)) +
  geom_errorbar(aes(ymin = min,
                    ymax = max),
                width = 0.5,
                size = 1.25) +
  labs(x = NULL, color = NULL) +
  scale_color_manual(values = c("red","green")) +
  scale_y_continuous("Delta inefficiency [€mm]", breaks = pretty_breaks(n = 5)) +
  coord_flip() +
  theme_bw(base_size = 20) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black", size = 1) +
  theme(
    legend.title = element_blank(),
    legend.position = c(.95, 0.95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(3, 3, 3, 3)
    )
Figure 3: Post-Hoc results

Eta-squared (effect size)

Due to eta-squared the area of cutted bars has the biggest influence on the (in)efficiency of cutting discs! But the disc type and circumferential speed of the cutting machine has also an - albeit a small - effect on (in)efficiency.

Figure 4: Eta squared