0





55
1
    people_id  activity_id success totl_act success_rate cum_success cum_act cum_success_rate success_rate_trend
       (fctr)       (fctr)   (int)    (int)        (dbl)       (int)   (int)            (dbl)              (dbl)
1     ppl_100 act2_1734928       0        1            0           0       1                0                 NA
2     ppl_100 act2_2434093       0        1            0           0       2                0                  0
3     ppl_100 act2_3404049       0        1            0           0       3                0                  0
4     ppl_100 act2_3651215       0        1            0           0       4                0                  0
5     ppl_100 act2_4109017       0        1            0           0       5                0                  0
6     ppl_100  act2_898576       0        1            0           0       6                0                  0
7  ppl_100002 act2_1233489       1        1            1           1       1                1                  1
8  ppl_100002 act2_1623405       1        1            1           2       2                1                  0
9  ppl_100003 act2_1111598       1        1            1           1       1                1                  0
10 ppl_100003 act2_1177453       1        1            1           2       2                1                  0

I've this sample data frame. I want to create a variable success_rate_trend using cum_success_rate variable. The challenge is that I want it to compute for every activity_id except the first activity for every unique people_id i.e I want to capture success trend for unique people_id. I'm using the below code:

success_rate_trend<-vector(mode="numeric", length=nrow(succ_rate_df)-1)
for(i in 2:nrow(succ_rate_df)){
     if(succ_rate_df[i,1]!=succ_rate_df[i-1,1]){
         success_rate_trend[i] = NA
       }
        else {
          success_rate_trend[i]<-succ_rate_df[i,8]-succ_rate_df[i-1,8]
    }}

It takes forever to run. I've close to million rows in succ_rate_df dataframe. Can Anyone suggest how to simplify the code and reduce the run time.

Question author Abhi | Source

0


1

Use vectorization:

success_rate_trend <- diff(succ_rate_df$cum_success_rate)success_rate_trend[diff(as.integer(succ_rate_df$people_id)) != 0] <- NA_real_

Note:

  1. people_id is a factor variable (fctr). To use diff() we must use as.integer() or unclass() to remove the factor class.
  2. You are not having an ordinary data frame, but a tbl_df from dplyr. Matrix like indexing does not work. Use succ_rate_df$people_id or succ_rate_df[["people_id"]] instead of succ_rate_df[, 1].

Ask about this question here!