데이터프레임과 티블

library(tidyverse)

- 길이가 50인 벡터생성

A = rnorm(10*5)
A ## 길이가 50인 벡터에 정규분포에 추출한 값
 [1]  1.17767702  2.73106894 -1.35304458 -0.03276446  0.53369832  0.49395437
 [7] -0.57098744  0.08639052  0.13038589 -0.83919513  0.63070688  0.79315491
[13] -0.88411918 -0.89302478 -1.73600236 -1.03698473 -0.01582456  0.96717754
[19] -0.22076719  1.01093743  0.95756614 -1.06746861 -0.80694017  0.40289417
[25]  0.03584857  0.07879276 -1.20817133 -0.67848560 -0.99625585  0.99035758
[31] -0.30129979  0.63455394  0.59275899 -0.02683564  0.36659406 -0.39990796
[37]  0.03857880 -0.94117668 -0.52508619  0.68735157 -0.11328093 -0.62775891
[43]  0.90407099  0.45493649 -0.32498564  0.11173280 -0.18886705 -0.30540003
[49]  1.84845946 -0.17473138

- 매트릭스로 변경

dim(A) = c(10,5) 
A
      [,1]        [,2]        [,3]        [,4]        [,5]      
 [1,]  1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
 [2,]  2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
 [3,] -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
 [4,] -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
 [5,]  0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
 [6,]  0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
 [7,] -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
 [8,]  0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
 [9,]  0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
[10,] -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- 티블로 변경 (자동으로 V1,...,V5 열의 이름이 지정됨)

A = as_tibble(A) 
A
Warning message:
“The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
   V1          V2          V3          V4          V5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- 컬럼의 이름 확인

colnames(A)
[1] "V1" "V2" "V3" "V4" "V5"

- 칼럼의 이름을 바꾸고 싶다면?

colnames(A) = c('X1','X2','X3','X4','X5') 
A
   X1          X2          X3          X4          X5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

select()

- 처음 3개의 칼럼과 마지막 1개의 칼럼을 선택하고 싶다면?

A %>% select(1:3,5)
   X1          X2          X3          X5        
1   1.17767702  0.63070688  0.95756614 -0.1132809
2   2.73106894  0.79315491 -1.06746861 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.9040710
4  -0.03276446 -0.89302478  0.40289417  0.4549365
5   0.53369832 -1.73600236  0.03584857 -0.3249856
6   0.49395437 -1.03698473  0.07879276  0.1117328
7  -0.57098744 -0.01582456 -1.20817133 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.3054000
9   0.13038589 -0.22076719 -0.99625585  1.8484595
10 -0.83919513  1.01093743  0.99035758 -0.1747314

- 열의 이름을 통하여서도 선택할 수 있음

A %>% select(X1:X3,X5)
   X1          X2          X3          X5        
1   1.17767702  0.63070688  0.95756614 -0.1132809
2   2.73106894  0.79315491 -1.06746861 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.9040710
4  -0.03276446 -0.89302478  0.40289417  0.4549365
5   0.53369832 -1.73600236  0.03584857 -0.3249856
6   0.49395437 -1.03698473  0.07879276  0.1117328
7  -0.57098744 -0.01582456 -1.20817133 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.3054000
9   0.13038589 -0.22076719 -0.99625585  1.8484595
10 -0.83919513  1.01093743  0.99035758 -0.1747314

- 열의 이름과 위치인덱스를 혼합하여 사용할 수 있음

A %>% select(X1:X3, 5)
   X1          X2          X3          X5        
1   1.17767702  0.63070688  0.95756614 -0.1132809
2   2.73106894  0.79315491 -1.06746861 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.9040710
4  -0.03276446 -0.89302478  0.40289417  0.4549365
5   0.53369832 -1.73600236  0.03584857 -0.3249856
6   0.49395437 -1.03698473  0.07879276  0.1117328
7  -0.57098744 -0.01582456 -1.20817133 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.3054000
9   0.13038589 -0.22076719 -0.99625585  1.8484595
10 -0.83919513  1.01093743  0.99035758 -0.1747314

- 열의 이름과 위치인덱스를 혼합하여 사용할 수 있음 (2)

A %>% select(1:3, X5)
   X1          X2          X3          X5        
1   1.17767702  0.63070688  0.95756614 -0.1132809
2   2.73106894  0.79315491 -1.06746861 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.9040710
4  -0.03276446 -0.89302478  0.40289417  0.4549365
5   0.53369832 -1.73600236  0.03584857 -0.3249856
6   0.49395437 -1.03698473  0.07879276  0.1117328
7  -0.57098744 -0.01582456 -1.20817133 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.3054000
9   0.13038589 -0.22076719 -0.99625585  1.8484595
10 -0.83919513  1.01093743  0.99035758 -0.1747314

- 선택한 결과를 다른 변수에 저장가능

A %>% select(1:3, X5) -> B 
B
   X1          X2          X3          X5        
1   1.17767702  0.63070688  0.95756614 -0.1132809
2   2.73106894  0.79315491 -1.06746861 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.9040710
4  -0.03276446 -0.89302478  0.40289417  0.4549365
5   0.53369832 -1.73600236  0.03584857 -0.3249856
6   0.49395437 -1.03698473  0.07879276  0.1117328
7  -0.57098744 -0.01582456 -1.20817133 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.3054000
9   0.13038589 -0.22076719 -0.99625585  1.8484595
10 -0.83919513  1.01093743  0.99035758 -0.1747314

select(-c())

- A라는 자료를 다시 살펴보자.

A
   X1          X2          X3          X4          X5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- 특정열을 제외하고 선택

A %>% select(-(X1:X3))
   X4          X5        
1  -0.30129979 -0.1132809
2   0.63455394 -0.6277589
3   0.59275899  0.9040710
4  -0.02683564  0.4549365
5   0.36659406 -0.3249856
6  -0.39990796  0.1117328
7   0.03857880 -0.1888670
8  -0.94117668 -0.3054000
9  -0.52508619  1.8484595
10  0.68735157 -0.1747314
A %>% select(-(1:3))
   X4          X5        
1  -0.30129979 -0.1132809
2   0.63455394 -0.6277589
3   0.59275899  0.9040710
4  -0.02683564  0.4549365
5   0.36659406 -0.3249856
6  -0.39990796  0.1117328
7   0.03857880 -0.1888670
8  -0.94117668 -0.3054000
9  -0.52508619  1.8484595
10  0.68735157 -0.1747314

select(...,start_with())

- 데이터 확인

A
   X1          X2          X3          X4          X5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- 열의 이름 확인 및 변경

colnames(A)
[1] "X1" "X2" "X3" "X4" "X5"
colnames(A)= c('XY1','XY2','XZ1','XZ2','X5')
A
   XY1         XY2         XZ1         XZ2         X5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- XY로 시작하는 열을 선택

A %>% select(starts_with('XY'))
   XY1         XY2        
1   1.17767702  0.63070688
2   2.73106894  0.79315491
3  -1.35304458 -0.88411918
4  -0.03276446 -0.89302478
5   0.53369832 -1.73600236
6   0.49395437 -1.03698473
7  -0.57098744 -0.01582456
8   0.08639052  0.96717754
9   0.13038589 -0.22076719
10 -0.83919513  1.01093743

select(...,ends_with())

- 데이터확인

A
   XY1         XY2         XZ1         XZ2         X5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- 2로 끝나는 열을 선택

A %>% select(ends_with('2'))
   XY2         XZ2        
1   0.63070688 -0.30129979
2   0.79315491  0.63455394
3  -0.88411918  0.59275899
4  -0.89302478 -0.02683564
5  -1.73600236  0.36659406
6  -1.03698473 -0.39990796
7  -0.01582456  0.03857880
8   0.96717754 -0.94117668
9  -0.22076719 -0.52508619
10  1.01093743  0.68735157

select(...,contains())

- 데이터확인

A
   XY1         XY2         XZ1         XZ2         X5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- Z를 포함하는 열을 선택

A %>% select(contains("Z"))
   XZ1         XZ2        
1   0.95756614 -0.30129979
2  -1.06746861  0.63455394
3  -0.80694017  0.59275899
4   0.40289417 -0.02683564
5   0.03584857  0.36659406
6   0.07879276 -0.39990796
7  -1.20817133  0.03857880
8  -0.67848560 -0.94117668
9  -0.99625585 -0.52508619
10  0.99035758  0.68735157

select(...,everything())

- 데이터확인

A
   XY1         XY2         XZ1         XZ2         X5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- Z를 포함하는 열을 앞으로 이동시킴

A %>% select(contains("Z"),everything())
   XZ1         XZ2         XY1         XY2         X5        
1   0.95756614 -0.30129979  1.17767702  0.63070688 -0.1132809
2  -1.06746861  0.63455394  2.73106894  0.79315491 -0.6277589
3  -0.80694017  0.59275899 -1.35304458 -0.88411918  0.9040710
4   0.40289417 -0.02683564 -0.03276446 -0.89302478  0.4549365
5   0.03584857  0.36659406  0.53369832 -1.73600236 -0.3249856
6   0.07879276 -0.39990796  0.49395437 -1.03698473  0.1117328
7  -1.20817133  0.03857880 -0.57098744 -0.01582456 -0.1888670
8  -0.67848560 -0.94117668  0.08639052  0.96717754 -0.3054000
9  -0.99625585 -0.52508619  0.13038589 -0.22076719  1.8484595
10  0.99035758  0.68735157 -0.83919513  1.01093743 -0.1747314

filter()

- 데이터확인

A
   XY1         XY2         XZ1         XZ2         X5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- X5>0인 행들만 뽑고싶다.

A %>% filter(X5>0)
  XY1         XY2        XZ1         XZ2         X5       
1 -1.35304458 -0.8841192 -0.80694017  0.59275899 0.9040710
2 -0.03276446 -0.8930248  0.40289417 -0.02683564 0.4549365
3  0.49395437 -1.0369847  0.07879276 -0.39990796 0.1117328
4  0.13038589 -0.2207672 -0.99625585 -0.52508619 1.8484595

- X5>0 이고 XY2>0 인 행들만 뽑고싶다.

A %>% filter(X5>0 & XY2>0)
     XY1 XY2 XZ1 XZ2 X5

- 위의 결과에서 XZ를 포함하는 열을 뽑고싶다.

A %>% filter(X5>0 & XY2>0) %>% select(contains("XZ"))
     XZ1 XZ2

mutate()

- 데이터를 확인

A
   XY1         XY2         XZ1         XZ2         X5        
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314

- 새로운 변수추가

A %>% mutate(X6=abs(X5))
   XY1         XY2         XZ1         XZ2         X5         X6       
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809 0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589 0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710 0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365 0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856 0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328 0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670 0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000 0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595 1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314 0.1747314

- 여러개의 새로운 변수를 추가하는 기능도 있음

A %>% mutate(X6=abs(X5),Z7=X5**2)
   XY1         XY2         XZ1         XZ2         X5         X6       
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809 0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589 0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710 0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365 0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856 0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328 0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670 0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000 0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595 1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314 0.1747314
   Z7        
1  0.01283257
2  0.39408125
3  0.81734435
4  0.20696721
5  0.10561567
6  0.01248422
7  0.03567076
8  0.09326918
9  3.41680236
10 0.03053106

- 표준화

A %>% mutate(X6=abs(X5),Z7=X5**2,Z8=(Z7-mean(Z7))/sd(Z7))
   XY1         XY2         XZ1         XZ2         X5         X6       
1   1.17767702  0.63070688  0.95756614 -0.30129979 -0.1132809 0.1132809
2   2.73106894  0.79315491 -1.06746861  0.63455394 -0.6277589 0.6277589
3  -1.35304458 -0.88411918 -0.80694017  0.59275899  0.9040710 0.9040710
4  -0.03276446 -0.89302478  0.40289417 -0.02683564  0.4549365 0.4549365
5   0.53369832 -1.73600236  0.03584857  0.36659406 -0.3249856 0.3249856
6   0.49395437 -1.03698473  0.07879276 -0.39990796  0.1117328 0.1117328
7  -0.57098744 -0.01582456 -1.20817133  0.03857880 -0.1888670 0.1888670
8   0.08639052  0.96717754 -0.67848560 -0.94117668 -0.3054000 0.3054000
9   0.13038589 -0.22076719 -0.99625585 -0.52508619  1.8484595 1.8484595
10 -0.83919513  1.01093743  0.99035758  0.68735157 -0.1747314 0.1747314
   Z7         Z8        
1  0.01283257 -0.4756090
2  0.39408125 -0.1127605
3  0.81734435  0.2900747
4  0.20696721 -0.2908439
5  0.10561567 -0.3873039
6  0.01248422 -0.4759406
7  0.03567076 -0.4538731
8  0.09326918 -0.3990545
9  3.41680236  2.7640755
10 0.03053106 -0.4587647

group_by()

sex = c("M","M","M","F","F","F","F") 
value = c(1,2,3,10,20,30,40)
df = tibble(sex,value)
df
  sex value
1 M    1   
2 M    2   
3 M    3   
4 F   10   
5 F   20   
6 F   30   
7 F   40   
df %>% group_by(sex) %>% summarise(mean_value=mean(value))
  sex mean_value
1 F   25        
2 M    2        
  • 그룹별로 묶은 다음에 평균을 계산함

- 신기한데? 다른것도 실습해보자.

df %>% group_by(sex) %>% summarise(sqrt_value=sqrt(value))
`summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.

  sex sqrt_value
1 F   3.162278  
2 F   4.472136  
3 F   5.477226  
4 F   6.324555  
5 M   1.000000  
6 M   1.414214  
7 M   1.732051  
  • 의도한 결과는 아님. 이렇게 쓰는건 아니에요!

- 아래를 수행한 꼴임

df %>% mutate(sqrt_value=sqrt(value))
  sex value sqrt_value
1 M    1    1.000000  
2 M    2    1.414214  
3 M    3    1.732051  
4 F   10    3.162278  
5 F   20    4.472136  
6 F   30    5.477226  
7 F   40    6.324555  

- summarise에 쓸수 있는 함수꼴은 평균과 같이 $n$개의 벡터가 입력으로 올때 하나의 스칼라출력을 주는 함수모양이어야함

- 예를들면 아래는 가능

df %>% group_by(sex) %>% summarise(mean_sqrt_value=mean(sqrt(value)))
  sex mean_sqrt_value
1 F   4.859049       
2 M   1.382088       
(sqrt(1)+sqrt(2)+sqrt(3))/3
[1] 1.382088
(sqrt(10)+sqrt(20)+sqrt(30)+sqrt(40))/4
[1] 4.859049

- group_by + mutate

df %>% group_by(sex) %>% mutate(value2= value-mean(value))
  sex value value2
1 M    1     -1   
2 M    2      0   
3 M    3      1   
4 F   10    -15   
5 F   20     -5   
6 F   30      5   
7 F   40     15   

- 위의 코드는 아래와 동일하다.

df %>% filter(sex=='M') %>% mutate(value2 = value - mean(value))
  sex value value2
1 M   1     -1    
2 M   2      0    
3 M   3      1    
df %>% filter(sex=='F') %>% mutate(value2 = value - mean(value))
  sex value value2
1 F   10    -15   
2 F   20     -5   
3 F   30      5   
4 F   40     15   
df %>% filter(sex=='M') %>% mutate(value2 = value - mean(value)) -> A1
df %>% filter(sex=='F') %>% mutate(value2 = value - mean(value)) -> A2
A1
  sex value value2
1 M   1     -1    
2 M   2      0    
3 M   3      1    
A2
  sex value value2
1 F   10    -15   
2 F   20     -5   
3 F   30      5   
4 F   40     15   

- 두 데이터프레임을 합치면 된다. (어떻게..??)

rbind(A1,A2) # 아 몰라.. 매트릭스처럼 하면 되지 않을까?
  sex value value2
1 M    1     -1   
2 M    2      0   
3 M    3      1   
4 F   10    -15   
5 F   20     -5   
6 F   30      5   
7 F   40     15   
  • 된다... (떡밥)

NA

- NA 포함되어도 개별연산은 괜찮아요

x = c(1,2,NA,4,5) 
x
[1]  1  2 NA  4  5
x+1
[1]  2  3 NA  5  6
x**2
[1]  1  4 NA 16 25

- 그런데 summarise에 쓸 연산은 망가져요

mean(x)
[1] NA
max(x)
[1] NA

- 수정을 위해서

mean(x,na.rm=TRUE)
[1] 3
max(x,na.rm=TRUE)
[1] 5

- 데이터프레임에서도 비슷합니다.

tibble(x) %>% mutate(z=x+1)
  x  z 
1  1  2
2  2  3
3 NA NA
4  4  5
5  5  6
tibble(x) %>% summarise(z=mean(x))
  z 
1 NA
tibble(x) %>% summarise(z=mean(x,na.rm=TRUE))
  z
1 3

- NA를 무조건 없애는것이 좋은가?

x<- c(1,2,NA,4) 
y<- c(1,NA,3,4)
z<- c(NA,2,3,4)
w<- c(1,2,3,NA)
tibble(x,y,z,w)
  x  y  z  w 
1  1  1 NA  1
2  2 NA  2  2
3 NA  3  3  3
4  4  4  4 NA
tibble(x,y,z,w) %>% mutate(xy=x+y)
  x  y  z  w  xy
1  1  1 NA  1  2
2  2 NA  2  2 NA
3 NA  3  3  3 NA
4  4  4  4 NA  8
tibble(x,y,z,w) %>% mutate(xy=x+y) %>% summarise(mean_xy = mean(xy,na.rm=TRUE))
  mean_xy
1 5      
  • 그래도 NA를 살려두면 좋지않을까? 데이터 하나하나가 소중하니까요!

- 결측치를 살릴까 죽일까?

x=c(1,2,3,4,NA,5,6,7)
y=c(-1,-2,-3,-4,77,-5,-6,-7)
tibble(x,y)
  x  y 
1  1 -1
2  2 -2
3  3 -3
4  4 -4
5 NA 77
6  5 -5
7  6 -6
8  7 -7
tibble(x,y) %>% filter(x>3,x<7)
  x y 
1 4 -4
2 5 -5
3 6 -6
  • ?? y=77 은 죄없이 사라짐. 사실 보류로 봐야하지않나? NA는 3보다 크지 않지만 작지도 않음, 또한 7보다 크지 않지만 작지도 않음

- 결측치를 살리고 싶다.

tibble(x,y) %>% filter(is.na(x))
  x  y 
1 NA 77
tibble(x,y) %>% filter(is.na(x) | (x>3 & x<7))
  x  y 
1  4 -4
2 NA 77
3  5 -5
4  6 -6
  • 데이터는 소중함

실습

library(nycflights13)
df=flights
df
       year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1      2013 1     1   517      515             2         830      819          
2      2013 1     1   533      529             4         850      830          
3      2013 1     1   542      540             2         923      850          
4      2013 1     1   544      545            -1        1004     1022          
5      2013 1     1   554      600            -6         812      837          
6      2013 1     1   554      558            -4         740      728          
7      2013 1     1   555      600            -5         913      854          
8      2013 1     1   557      600            -3         709      723          
9      2013 1     1   557      600            -3         838      846          
10     2013 1     1   558      600            -2         753      745          
       arr_delay carrier flight tailnum origin dest air_time distance hour
1       11       UA      1545   N14228  EWR    IAH  227      1400     5   
2       20       UA      1714   N24211  LGA    IAH  227      1416     5   
3       33       AA      1141   N619AA  JFK    MIA  160      1089     5   
4      -18       B6       725   N804JB  JFK    BQN  183      1576     5   
5      -25       DL       461   N668DN  LGA    ATL  116       762     6   
6       12       UA      1696   N39463  EWR    ORD  150       719     5   
7       19       B6       507   N516JB  EWR    FLL  158      1065     6   
8      -14       EV      5708   N829AS  LGA    IAD   53       229     6   
9       -8       B6        79   N593JB  JFK    MCO  140       944     6   
10       8       AA       301   N3ALAA  LGA    ORD  138       733     6   
       minute time_hour          
1      15     2013-01-01 05:00:00
2      29     2013-01-01 05:00:00
3      40     2013-01-01 05:00:00
4      45     2013-01-01 05:00:00
5       0     2013-01-01 06:00:00
6      58     2013-01-01 05:00:00
7       0     2013-01-01 06:00:00
8       0     2013-01-01 06:00:00
9       0     2013-01-01 06:00:00
10      0     2013-01-01 06:00:00
 [ reached getOption("max.print") -- omitted 51 rows ]

- 1월1일의 모든 항공편을 선택하고싶다.

df %>% filter(month==1 & day==1)
    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1   2013 1     1   517      515             2         830      819          
2   2013 1     1   533      529             4         850      830          
3   2013 1     1   542      540             2         923      850          
4   2013 1     1   544      545            -1        1004     1022          
5   2013 1     1   554      600            -6         812      837          
6   2013 1     1   554      558            -4         740      728          
7   2013 1     1   555      600            -5         913      854          
8   2013 1     1   557      600            -3         709      723          
9   2013 1     1   557      600            -3         838      846          
10  2013 1     1   558      600            -2         753      745          
    arr_delay carrier flight tailnum origin dest air_time distance hour minute
1    11       UA      1545   N14228  EWR    IAH  227      1400     5    15    
2    20       UA      1714   N24211  LGA    IAH  227      1416     5    29    
3    33       AA      1141   N619AA  JFK    MIA  160      1089     5    40    
4   -18       B6       725   N804JB  JFK    BQN  183      1576     5    45    
5   -25       DL       461   N668DN  LGA    ATL  116       762     6     0    
6    12       UA      1696   N39463  EWR    ORD  150       719     5    58    
7    19       B6       507   N516JB  EWR    FLL  158      1065     6     0    
8   -14       EV      5708   N829AS  LGA    IAD   53       229     6     0    
9    -8       B6        79   N593JB  JFK    MCO  140       944     6     0    
10    8       AA       301   N3ALAA  LGA    ORD  138       733     6     0    
    time_hour          
1   2013-01-01 05:00:00
2   2013-01-01 05:00:00
3   2013-01-01 05:00:00
4   2013-01-01 05:00:00
5   2013-01-01 06:00:00
6   2013-01-01 05:00:00
7   2013-01-01 06:00:00
8   2013-01-01 06:00:00
9   2013-01-01 06:00:00
10  2013-01-01 06:00:00
 [ reached getOption("max.print") -- omitted 51 rows ]
df %>% filter(month==1 , day==1)
    year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1   2013 1     1   517      515             2         830      819          
2   2013 1     1   533      529             4         850      830          
3   2013 1     1   542      540             2         923      850          
4   2013 1     1   544      545            -1        1004     1022          
5   2013 1     1   554      600            -6         812      837          
6   2013 1     1   554      558            -4         740      728          
7   2013 1     1   555      600            -5         913      854          
8   2013 1     1   557      600            -3         709      723          
9   2013 1     1   557      600            -3         838      846          
10  2013 1     1   558      600            -2         753      745          
    arr_delay carrier flight tailnum origin dest air_time distance hour minute
1    11       UA      1545   N14228  EWR    IAH  227      1400     5    15    
2    20       UA      1714   N24211  LGA    IAH  227      1416     5    29    
3    33       AA      1141   N619AA  JFK    MIA  160      1089     5    40    
4   -18       B6       725   N804JB  JFK    BQN  183      1576     5    45    
5   -25       DL       461   N668DN  LGA    ATL  116       762     6     0    
6    12       UA      1696   N39463  EWR    ORD  150       719     5    58    
7    19       B6       507   N516JB  EWR    FLL  158      1065     6     0    
8   -14       EV      5708   N829AS  LGA    IAD   53       229     6     0    
9    -8       B6        79   N593JB  JFK    MCO  140       944     6     0    
10    8       AA       301   N3ALAA  LGA    ORD  138       733     6     0    
    time_hour          
1   2013-01-01 05:00:00
2   2013-01-01 05:00:00
3   2013-01-01 05:00:00
4   2013-01-01 05:00:00
5   2013-01-01 06:00:00
6   2013-01-01 05:00:00
7   2013-01-01 06:00:00
8   2013-01-01 06:00:00
9   2013-01-01 06:00:00
10  2013-01-01 06:00:00
 [ reached getOption("max.print") -- omitted 51 rows ]

- 11월에 출발하거나 12월에 출발한 항공

df %>% filter(month==11 | month==12)
      year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1     2013 11    1     5      2359             6       352       345          
2     2013 11    1    35      2250           105       123      2356          
3     2013 11    1   455       500            -5       641       651          
4     2013 11    1   539       545            -6       856       827          
5     2013 11    1   542       545            -3       831       855          
6     2013 11    1   549       600           -11       912       923          
7     2013 11    1   550       600           -10       705       659          
8     2013 11    1   554       600            -6       659       701          
9     2013 11    1   554       600            -6       826       827          
10    2013 11    1   554       600            -6       749       751          
      arr_delay carrier flight tailnum origin dest air_time distance hour
1       7       B6       745   N568JB  JFK    PSE  205      1617     23  
2      87       B6      1816   N353JB  JFK    SYR   36       209     22  
3     -10       US      1895   N192UW  EWR    CLT   88       529      5  
4      29       UA      1714   N38727  LGA    IAH  229      1416      5  
5     -24       AA      2243   N5CLAA  JFK    MIA  147      1089      5  
6     -11       UA       303   N595UA  JFK    SFO  359      2586      6  
7       6       US      2167   N748UW  LGA    DCA   57       214      6  
8      -2       US      2134   N742PS  LGA    BOS   40       184      6  
9      -1       DL       563   N912DE  LGA    ATL  126       762      6  
10     -2       DL       731   N315NB  LGA    DTW   93       502      6  
      minute time_hour          
1     59     2013-11-01 23:00:00
2     50     2013-11-01 22:00:00
3      0     2013-11-01 05:00:00
4     45     2013-11-01 05:00:00
5     45     2013-11-01 05:00:00
6      0     2013-11-01 06:00:00
7      0     2013-11-01 06:00:00
8      0     2013-11-01 06:00:00
9      0     2013-11-01 06:00:00
10     0     2013-11-01 06:00:00
 [ reached getOption("max.print") -- omitted 51 rows ]

- 아래와 동일한 코드

df %>% filter(month %in% c(11,12))
      year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1     2013 11    1     5      2359             6       352       345          
2     2013 11    1    35      2250           105       123      2356          
3     2013 11    1   455       500            -5       641       651          
4     2013 11    1   539       545            -6       856       827          
5     2013 11    1   542       545            -3       831       855          
6     2013 11    1   549       600           -11       912       923          
7     2013 11    1   550       600           -10       705       659          
8     2013 11    1   554       600            -6       659       701          
9     2013 11    1   554       600            -6       826       827          
10    2013 11    1   554       600            -6       749       751          
      arr_delay carrier flight tailnum origin dest air_time distance hour
1       7       B6       745   N568JB  JFK    PSE  205      1617     23  
2      87       B6      1816   N353JB  JFK    SYR   36       209     22  
3     -10       US      1895   N192UW  EWR    CLT   88       529      5  
4      29       UA      1714   N38727  LGA    IAH  229      1416      5  
5     -24       AA      2243   N5CLAA  JFK    MIA  147      1089      5  
6     -11       UA       303   N595UA  JFK    SFO  359      2586      6  
7       6       US      2167   N748UW  LGA    DCA   57       214      6  
8      -2       US      2134   N742PS  LGA    BOS   40       184      6  
9      -1       DL       563   N912DE  LGA    ATL  126       762      6  
10     -2       DL       731   N315NB  LGA    DTW   93       502      6  
      minute time_hour          
1     59     2013-11-01 23:00:00
2     50     2013-11-01 22:00:00
3      0     2013-11-01 05:00:00
4     45     2013-11-01 05:00:00
5     45     2013-11-01 05:00:00
6      0     2013-11-01 06:00:00
7      0     2013-11-01 06:00:00
8      0     2013-11-01 06:00:00
9      0     2013-11-01 06:00:00
10     0     2013-11-01 06:00:00
 [ reached getOption("max.print") -- omitted 51 rows ]

- arr_delay <=120 이고 dep_delay <=120 인 항공편을 찾고 싶다.

df %>% filter(arr_delay <= 120, dep_delay<=120)
       year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1      2013 1     1   517      515             2         830      819          
2      2013 1     1   533      529             4         850      830          
3      2013 1     1   542      540             2         923      850          
4      2013 1     1   544      545            -1        1004     1022          
5      2013 1     1   554      600            -6         812      837          
6      2013 1     1   554      558            -4         740      728          
7      2013 1     1   555      600            -5         913      854          
8      2013 1     1   557      600            -3         709      723          
9      2013 1     1   557      600            -3         838      846          
10     2013 1     1   558      600            -2         753      745          
       arr_delay carrier flight tailnum origin dest air_time distance hour
1       11       UA      1545   N14228  EWR    IAH  227      1400     5   
2       20       UA      1714   N24211  LGA    IAH  227      1416     5   
3       33       AA      1141   N619AA  JFK    MIA  160      1089     5   
4      -18       B6       725   N804JB  JFK    BQN  183      1576     5   
5      -25       DL       461   N668DN  LGA    ATL  116       762     6   
6       12       UA      1696   N39463  EWR    ORD  150       719     5   
7       19       B6       507   N516JB  EWR    FLL  158      1065     6   
8      -14       EV      5708   N829AS  LGA    IAD   53       229     6   
9       -8       B6        79   N593JB  JFK    MCO  140       944     6   
10       8       AA       301   N3ALAA  LGA    ORD  138       733     6   
       minute time_hour          
1      15     2013-01-01 05:00:00
2      29     2013-01-01 05:00:00
3      40     2013-01-01 05:00:00
4      45     2013-01-01 05:00:00
5       0     2013-01-01 06:00:00
6      58     2013-01-01 05:00:00
7       0     2013-01-01 06:00:00
8       0     2013-01-01 06:00:00
9       0     2013-01-01 06:00:00
10      0     2013-01-01 06:00:00
 [ reached getOption("max.print") -- omitted 51 rows ]
df %>% filter(arr_delay <= 120 & dep_delay<=120)
       year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1      2013 1     1   517      515             2         830      819          
2      2013 1     1   533      529             4         850      830          
3      2013 1     1   542      540             2         923      850          
4      2013 1     1   544      545            -1        1004     1022          
5      2013 1     1   554      600            -6         812      837          
6      2013 1     1   554      558            -4         740      728          
7      2013 1     1   555      600            -5         913      854          
8      2013 1     1   557      600            -3         709      723          
9      2013 1     1   557      600            -3         838      846          
10     2013 1     1   558      600            -2         753      745          
       arr_delay carrier flight tailnum origin dest air_time distance hour
1       11       UA      1545   N14228  EWR    IAH  227      1400     5   
2       20       UA      1714   N24211  LGA    IAH  227      1416     5   
3       33       AA      1141   N619AA  JFK    MIA  160      1089     5   
4      -18       B6       725   N804JB  JFK    BQN  183      1576     5   
5      -25       DL       461   N668DN  LGA    ATL  116       762     6   
6       12       UA      1696   N39463  EWR    ORD  150       719     5   
7       19       B6       507   N516JB  EWR    FLL  158      1065     6   
8      -14       EV      5708   N829AS  LGA    IAD   53       229     6   
9       -8       B6        79   N593JB  JFK    MCO  140       944     6   
10       8       AA       301   N3ALAA  LGA    ORD  138       733     6   
       minute time_hour          
1      15     2013-01-01 05:00:00
2      29     2013-01-01 05:00:00
3      40     2013-01-01 05:00:00
4      45     2013-01-01 05:00:00
5       0     2013-01-01 06:00:00
6      58     2013-01-01 05:00:00
7       0     2013-01-01 06:00:00
8       0     2013-01-01 06:00:00
9       0     2013-01-01 06:00:00
10      0     2013-01-01 06:00:00
 [ reached getOption("max.print") -- omitted 51 rows ]
df %>% filter(!(arr_delay > 120 | dep_delay> 120))
       year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1      2013 1     1   517      515             2         830      819          
2      2013 1     1   533      529             4         850      830          
3      2013 1     1   542      540             2         923      850          
4      2013 1     1   544      545            -1        1004     1022          
5      2013 1     1   554      600            -6         812      837          
6      2013 1     1   554      558            -4         740      728          
7      2013 1     1   555      600            -5         913      854          
8      2013 1     1   557      600            -3         709      723          
9      2013 1     1   557      600            -3         838      846          
10     2013 1     1   558      600            -2         753      745          
       arr_delay carrier flight tailnum origin dest air_time distance hour
1       11       UA      1545   N14228  EWR    IAH  227      1400     5   
2       20       UA      1714   N24211  LGA    IAH  227      1416     5   
3       33       AA      1141   N619AA  JFK    MIA  160      1089     5   
4      -18       B6       725   N804JB  JFK    BQN  183      1576     5   
5      -25       DL       461   N668DN  LGA    ATL  116       762     6   
6       12       UA      1696   N39463  EWR    ORD  150       719     5   
7       19       B6       507   N516JB  EWR    FLL  158      1065     6   
8      -14       EV      5708   N829AS  LGA    IAD   53       229     6   
9       -8       B6        79   N593JB  JFK    MCO  140       944     6   
10       8       AA       301   N3ALAA  LGA    ORD  138       733     6   
       minute time_hour          
1      15     2013-01-01 05:00:00
2      29     2013-01-01 05:00:00
3      40     2013-01-01 05:00:00
4      45     2013-01-01 05:00:00
5       0     2013-01-01 06:00:00
6      58     2013-01-01 05:00:00
7       0     2013-01-01 06:00:00
8       0     2013-01-01 06:00:00
9       0     2013-01-01 06:00:00
10      0     2013-01-01 06:00:00
 [ reached getOption("max.print") -- omitted 51 rows ]

- 변수이름을 확인하자.

df %>% colnames
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     
df %>% names
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     
  • 왜 방법이 두개있을까? (떡밥)

- 변수가 너무 많아서 귀찮음. 몇개만 추리자.

  • year~day 는 포함
  • delay로 끝나는 변수들
  • distance, air_time
df %>% select(year:day, ends_with('delay'), distance, air_time)
       year month day dep_delay arr_delay distance air_time
1      2013 1     1    2         11       1400     227     
2      2013 1     1    4         20       1416     227     
3      2013 1     1    2         33       1089     160     
4      2013 1     1   -1        -18       1576     183     
5      2013 1     1   -6        -25        762     116     
6      2013 1     1   -4         12        719     150     
7      2013 1     1   -5         19       1065     158     
8      2013 1     1   -3        -14        229      53     
9      2013 1     1   -3         -8        944     140     
10     2013 1     1   -2          8        733     138     
11     2013 1     1   -2         -2       1028     149     
12     2013 1     1   -2         -3       1005     158     
13     2013 1     1   -2          7       2475     345     
14     2013 1     1   -2        -14       2565     361     
15     2013 1     1   -1         31       1389     257     
16     2013 1     1    0         -4        187      44     
17     2013 1     1   -1         -8       2227     337     
18     2013 1     1    0         -7       1076     152     
19     2013 1     1    0         12        762     134     
20     2013 1     1    1         -6       1023     147     
21     2013 1     1   -8         -8       1020     170     
22     2013 1     1   -3         16        502     105     
23     2013 1     1   -4        -12       1085     152     
24     2013 1     1   -4         -8        760     128     
25     2013 1     1    0        -17       1085     157     
26     2013 1     1    8         32        719     139     
27     2013 1     1   11         14       2586     366     
28     2013 1     1    3          4       1074     175     
 [ reached getOption("max.print") -- omitted 33 rows ]

- 아래의 수식을 이용해서 gain, speed를 계산하자.

  • gain = dep_delay - arr_delay
  • speed = distance / air_time
df %>% 
select(year:day, ends_with('delay'), distance, air_time) %>% 
mutate(gain = dep_delay - arr_delay, speed = distance / air_time)
       year month day dep_delay arr_delay distance air_time gain speed   
1      2013 1     1    2         11       1400     227       -9  6.167401
2      2013 1     1    4         20       1416     227      -16  6.237885
3      2013 1     1    2         33       1089     160      -31  6.806250
4      2013 1     1   -1        -18       1576     183       17  8.612022
5      2013 1     1   -6        -25        762     116       19  6.568966
6      2013 1     1   -4         12        719     150      -16  4.793333
7      2013 1     1   -5         19       1065     158      -24  6.740506
8      2013 1     1   -3        -14        229      53       11  4.320755
9      2013 1     1   -3         -8        944     140        5  6.742857
10     2013 1     1   -2          8        733     138      -10  5.311594
11     2013 1     1   -2         -2       1028     149        0  6.899329
12     2013 1     1   -2         -3       1005     158        1  6.360759
13     2013 1     1   -2          7       2475     345       -9  7.173913
14     2013 1     1   -2        -14       2565     361       12  7.105263
15     2013 1     1   -1         31       1389     257      -32  5.404669
16     2013 1     1    0         -4        187      44        4  4.250000
17     2013 1     1   -1         -8       2227     337        7  6.608309
18     2013 1     1    0         -7       1076     152        7  7.078947
19     2013 1     1    0         12        762     134      -12  5.686567
20     2013 1     1    1         -6       1023     147        7  6.959184
21     2013 1     1   -8         -8       1020     170        0  6.000000
22     2013 1     1   -3         16        502     105      -19  4.780952
 [ reached getOption("max.print") -- omitted 39 rows ]

- year, month, day 로 그룹핑을하고 평균속도와 평균gain을 계산해보자.

df %>% 
select(year:day, ends_with('delay'), distance, air_time) %>% 
mutate(gain = dep_delay - arr_delay, speed = distance / air_time) %>% 
group_by(year,month,day) %>% 
summarise(gain_mean=mean(gain),speed_mean=mean(speed))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.

    year month day gain_mean speed_mean
1   2013 1      1  NA        NA        
2   2013 1      2  NA        NA        
3   2013 1      3  NA        NA        
4   2013 1      4  NA        NA        
5   2013 1      5  NA        NA        
6   2013 1      6  NA        NA        
7   2013 1      7  NA        NA        
8   2013 1      8  NA        NA        
9   2013 1      9  NA        NA        
10  2013 1     10  NA        NA        
11  2013 1     11  NA        NA        
12  2013 1     12  NA        NA        
13  2013 1     13  NA        NA        
14  2013 1     14  NA        NA        
15  2013 1     15  NA        NA        
16  2013 1     16  NA        NA        
17  2013 1     17  NA        NA        
18  2013 1     18  NA        NA        
19  2013 1     19  NA        NA        
20  2013 1     20  NA        NA        
21  2013 1     21  NA        NA        
22  2013 1     22  NA        NA        
23  2013 1     23  NA        NA        
24  2013 1     24  NA        NA        
25  2013 1     25  NA        NA        
26  2013 1     26  NA        NA        
27  2013 1     27  NA        NA        
28  2013 1     28  NA        NA        
29  2013 1     29  NA        NA        
30  2013 1     30  NA        NA        
⋮   ⋮    ⋮     ⋮   ⋮         ⋮         
336 2013 12     2  NA        NA        
337 2013 12     3  NA        NA        
338 2013 12     4  NA        NA        
339 2013 12     5  NA        NA        
340 2013 12     6  NA        NA        
341 2013 12     7  NA        NA        
342 2013 12     8  NA        NA        
343 2013 12     9  NA        NA        
344 2013 12    10  NA        NA        
 [ reached getOption("max.print") -- omitted 21 rows ]
  • 다 NA가 나옴..
df %>% 
select(year:day, ends_with('delay'), distance, air_time) %>% 
mutate(gain = dep_delay - arr_delay, speed = distance / air_time) %>% 
group_by(year,month,day) %>% 
summarise(gain_mean=mean(gain,na.rm=TRUE),speed_mean=mean(speed,na.rm=TRUE))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.

    year month day gain_mean   speed_mean
1   2013 1      1  -1.2154031  5.956879  
2   2013 1      2   0.9849138  6.085977  
3   2013 1      3   5.1744444  6.163061  
4   2013 1      4  10.8986784  6.389652  
5   2013 1      5   7.2580195  6.259174  
6   2013 1      6   2.9095296  6.146544  
7   2013 1      7  10.3645161  6.488469  
8   2013 1      8   5.7858744  6.215331  
9   2013 1      9   2.5655095  6.021767  
10  2013 1     10   8.7438105  6.375353  
11  2013 1     11   7.5954198  6.326487  
12  2013 1     12  14.6093979  6.613691  
13  2013 1     13   4.6877323  6.353392  
14  2013 1     14  -0.8418202  6.018617  
15  2013 1     15  -0.3019296  5.918271  
16  2013 1     16  -9.7924971  5.819344  
17  2013 1     17   1.1923497  5.923112  
18  2013 1     18   4.9263736  6.207338  
19  2013 1     19  11.8882265  6.340853  
20  2013 1     20   3.0640205  6.098927  
21  2013 1     21   1.5232816  5.911828  
22  2013 1     22   0.2225989  5.966793  
23  2013 1     23   3.6925676  6.081496  
24  2013 1     24   4.1156388  6.305144  
25  2013 1     25  -5.3469852  6.177677  
26  2013 1     26   6.4619970  6.342779  
27  2013 1     27   9.6972705  6.371808  
28  2013 1     28   5.7794632  6.288730  
29  2013 1     29   9.1047181  6.297918  
30  2013 1     30   2.0226131  6.072084  
⋮   ⋮    ⋮     ⋮   ⋮           ⋮         
336 2013 12     2    8.6172345 6.538165  
337 2013 12     3    3.7637306 6.385837  
338 2013 12     4    5.5354497 6.374694  
339 2013 12     5    0.7886279 6.315847  
340 2013 12     6   -4.9424779 6.088300  
341 2013 12     7   -0.5210210 6.083830  
342 2013 12     8  -16.0335404 6.015119  
343 2013 12     9   -7.9688889 5.927017  
344 2013 12    10  -18.0433018 5.851833  
 [ reached getOption("max.print") -- omitted 21 rows ]

숙제

강화유리와 유리를 구분할 수 있는 유리 장인이 있다.

이 유리장인은 80퍼센트의 확률로 강화유리를 고른다.

총 10명의 참가자가 있고 이 참가자들은 (유리,강화유리)의 조합으로 이루어진 징검다리를 5번연속으로 건너야 한다.

아래의 경우에 참가자들은 평균적으로 몇명이 살아남겠는가?

(1) 일반인1 - 일반인2 - .... - 일반인9 - 유리장인 || (강화유리, 유리)

(2) 유리장인 - 일반인1 - 일반인2 - ... - 일반인9 || (강화유리, 유리)

1000번 시뮬레이션을 하여 결과를 추정하라.

(단, 일반인은 50%의 확률로 강화유리를 고를수 있다고 하자)


[예시] (1)의 시뮬레이션 결과가 아래와 같다고 하자.

  • 첫번째 징검다리: 유리장인이 강화유리 선택
  • 두번째 징검다리: 유리장인이 강화유리 선택
  • 세번째 징검다리: 유리장인이 일반유리 선택 $\to$ 유리장인 탈락 & 일반인9는 당연히 강화유리를 선택
  • 네번째 징검다리: 일반인9가 일반유리 선택 $\to$ 일반인9 탈락 & 일반인8은 당연히 강화유리 선택
  • 다섯번째 징검다리: 일반인8이 강화유리 선택

이 경우는 일반인8,일반인7, $\dots$, 일반인1이 살아남으므로 8명이 살아남는다.

[예시] (2)의 시뮬레이션 결과가 아래와 같다고 하자.

  • 첫번째 징검다리: 일반인9 일반유리 선택 $\to$ 일반인9 탈락 & 일반인8은 강화유리 선택
  • 두번째 징검다리: 일반인8 일반유리 선택 $\to$ 일반인8 탈락 & 일반인7은 강화유리 선택
  • 세번째 징검다리: 일반인7 일반유리 선택 $\to$ 일반인7 탈락 & 일반인6은 강화유리 선택
  • 네번째 징검다리: 일반인6 일반유리 선택 $\to$ 일반인6 탈락 & 일반인5는 강화유리 선택
  • 다섯번째 징검다리: 일반인5 일반유리 선택 $\to$ 일반인5 탈락 & 일반인4는 강화유리 선택

이 경우는 일반인4,일반인3,일반인2,일반인1,유리장인 이 살아남는다. (따라서 5명)

- 즉 살아남을수 있는 최대인원수는 10명이며 최소인원수는 5명이다.

- 유리장인이 100%의 확률로 강화유리를 구분한다면 (1)의 경우 항상 10명이 살아남는다. (즉 평균도 10명)

(1)

  • 10명이 전부 살아남는 경우 = $0.5^5$

- 9명이 살아남는 경우

  • 일반인1이 1번째 탈락 = $0.5^0 \times 0.5 \times 0.5^4$ = $0.5^5$
  • 일반인1이 2번째 탈락 = $0.5^1 \times 0.5 \times 0.5^3$ = $0.5^5$
  • 일반인1이 3번째 탈락 = $0.5^2 \times 0.5 \times 0.5^2$ = $0.5^5$
  • 일반인1이 4번째 탈락 = $0.5^3 \times 0.5 \times 0.5^1$ = $0.5^5$
  • 일반인1이 5번째 탈락 = $0.5^4 \times 0.5 \times 0.5^0$ = $0.5^5$

확률은 동일하게 $0.5^5$이므로, 경우가 중요

- 8 명이 살아남는 결우

  • 일반인1이 1번째 탈락 = $choose(4,1)\times 0.5^5$
    • 일반인2가 2번째 탈락
    • 일반인2가 3번째 탈락
    • 일반인2가 4번째 탈락
    • 일반인2가 5번째 탈락
  • 일반인1이 2번째 탈락 = $choose(3,1)\times 0.5^5$
    • 일반인2가 3번째 탈락
    • 일반인2가 4번째 탈락
    • 일반인2가 5번째 탈락
  • 일반인1이 3번째 탈락 = $choose(2,1)\times 0.5^5$
    • 일반인2가 4번째 탈락
    • 일반인2가 5번째 탈락
  • 일반인1이 4번째 탈락 = $choose(1,1) \times 0.5^5$
    • 일반인2가 5번째 탈락

- 7명이 살아남을 확률?

  • 일반인1이 1번째 탈락 = $choose(4,2)\times 0.5^5$
    • 일반인이 2,3번째 탈락
    • 일반인이 2,4번째 탈락
    • 일반인이 2,5번째 탈락
    • 일반인이 3,4번째 탈락
    • 일반인이 3,5번째 탈락
    • 일반인이 4,5번째 탈락
  • 일반인1이 2번째 탈락 = $choose(3,2)\times 0.5^5$
    • 일반인이 3,4번째 탈락
    • 일반인이 3,5번째 탈락
    • 일반인이 4,5번째 탈락
  • 일반인1이 3번째 탈락 = $choose(2,2)\times 0.5^5$
    • 일반인이 4,5번째 탈락

- 6명이 살아남을 확률?

  • 일반인1이 1번째 탈락 = $choose(4,3)\times 0.5^5$
    • 일반인이 2,3,4번째 탈락
    • 일반인이 2,3,5번째 탈락
    • 일반인이 2,4,5번째 탈락
    • 일반인이 3,4,5번째 탈락
  • 일반인1이 2번째 탈락 = $1\times 0.5^3$
    • 일반인이 3,4,5번째 탈락

- 5명이 살아남을 확률? $0.5^5$

  • 일반인5명 연속탈락
surv10_prob1 = 0.5^5
surv9_prob1 = c()
for (i in 0:4) surv9_prob1[i+1] = 0.5^5
surv8_prob1 = c() 
for (i in 0:3) surv8_prob1[i+1] = choose(4-i,1)*0.5^5
surv7_prob1 = c() 
for (i in 0:2) surv7_prob1[i+1] = choose(4-i,2)*0.5^5
surv6_prob1 = c() 
for (i in 0:1) surv6_prob1[i+1] = choose(4-i,3)*0.5^5
surv5_prob1 = 0.5^5
surv10_prob1*10 +
sum(surv9_prob1)*9 + 
sum(surv8_prob1)*8 + 
sum(surv7_prob1)*7 + 
sum(surv6_prob1)*6 + 
surv5_prob1*5
7.5

(2)

  • 10명이 전부 살아남는 경우 = $0.8^5$

- 9명이 살아남는 경우

  • 장인이 1번째 탈락 = $0.8^0 \times 0.2 \times 0.5^4$
  • 장인이 2번째 탈락 = $0.8^1 \times 0.2 \times 0.5^3$
  • 장인이 3번째 탈락 = $0.8^2 \times 0.2 \times 0.5^2$
  • 장인이 4번째 탈락 = $0.8^3 \times 0.2 \times 0.5^1$
  • 장인이 5번째 탈락 = $0.8^4 \times 0.2 \times 0.5^0$

- 8 명이 살아남는 결우

  • 장인이 1번째 탈락 = $choose(4,1)\times 0.8^0 \times 0.2 \times 0.5^4$
    • 일반인1이 2번째 탈락
    • 일반인1이 3번째 탈락
    • 일반인1이 4번째 탈락
    • 일반인1이 5번째 탈락
  • 장인이 2번째 탈락 = $choose(3,1)\times 0.8^1 \times 0.2 \times 0.5^3$
    • 일반인1이 3번째 탈락
    • 일반인1이 4번째 탈락
    • 일반인1이 5번째 탈락
  • 장인이 3번째 탈락 = $choose(2,1)\times 0.8^2 \times 0.2 \times 0.5^3$
    • 일반인1이 4번째 탈락
    • 일반인1이 5번째 탈락
  • 장인이 4번째 탈락 = $choose(1,1) \times 0.8^3 \times 0.2 \times 0.5^2$
    • 일반인1이 5번째 탈락

- 7명이 살아남을 확률?

  • 장인이 1번째 탈락 = $choose(4,2)\times 0.8^0 \times 0.2 \times 0.5^4$
    • 일반인이 2,3번째 탈락
    • 일반인이 2,4번째 탈락
    • 일반인이 2,5번째 탈락
    • 일반인이 3,4번째 탈락
    • 일반인이 3,5번째 탈락
    • 일반인이 4,5번째 탈락
  • 장인이 2번째 탈락 = $choose(3,2)\times 0.8^1 \times 0.2 \times 0.5^3$
    • 일반인이 3,4번째 탈락
    • 일반인이 3,5번째 탈락
    • 일반인이 4,5번째 탈락
  • 일반인1이 3번째 탈락 = $choose(2,2)\times 0.8^3 \times 0.2 \times 0.5^2$
    • 일반인이 4,5번째 탈락

- 6명이 살아남을 확률?

  • 장인이 1번째 탈락 = $choose(4,3)\times 0.8^0 \times 0.2 \times 0.5^4$
    • 일반인이 2,3,4번째 탈락
    • 일반인이 2,3,5번째 탈락
    • 일반인이 2,4,5번째 탈락
    • 일반인이 3,4,5번째 탈락
  • 장인이 2번째 탈락 = $1\times 0.8^1 \times 0.2 \times 0.5^3$
    • 일반인이 3,4,5번째 탈락

- 5명이 살아남을 확률? 0.8^0 \times 0.2 \times 0.5^4$

  • 장인 + 일반인4명 연속탈락
surv10_prob2 = 0.8^5
surv9_prob2 = c()
for (i in 0:4) surv9_prob2[i+1] = 0.8^i*0.2*0.5^(4-i)
surv8_prob2 = c()
for (i in 0:3) surv8_prob2[i+1] = choose(4-i,1)*0.8^i*0.2*0.5^(4-i)
surv7_prob2 = c()
for (i in 0:2) surv7_prob2[i+1] = choose(4-i,2)*0.8^i*0.2*0.5^(4-i)
surv6_prob2 = c()
for (i in 0:1) surv6_prob2[i+1] = choose(4-i,3)*0.8^i*0.2*0.5^(4-i)
surv5_prob2 = 0.2*0.5^4
surv10_prob2*10 +
sum(surv9_prob2)*9 + 
sum(surv8_prob2)*8 + 
sum(surv7_prob2)*7 + 
sum(surv6_prob2)*6 + 
surv5_prob2*5
8.50848