0

The table below shows the login and logout time of a given software.

USER_NAME USER_ROLE GMT_LOGIN_TIME      GMT_LOGOUT_TIME       LOGIN_DURATION 
ABCD      SUP_235   2017-12-15 11:05:25 2017-12-15 11:09:01           216.0
ABCE      SUP_235   2017-12-15 11:23:14 2017-12-15 11:33:17           603.0
ABCF      USER_285  2017-12-15 11:44:12 2017-12-15 12:07:52          1420.0
ABCG      SUP_230   2017-12-15 18:43:17 2017-12-15 19:00:20          1023.0
ABCH      USER_245  2017-12-15 10:36:59 2017-12-15 11:42:00          3901.0

How can I identify simultaneous users? I'll have to do this in Python but I'm not worrying about the implementation in that programming language. I'd like to understand the logic itself.

In the example above users ABCD, ABCE and ABCH are the only simultaneous.

srodriguex
  • 2,900
  • 3
  • 18
  • 28
  • How are you reading this table? What's your desired output? Do you have any code? – pault Dec 20 '17 at 14:50
  • Let's say I just want to put a "1" in a new column, e.g. "Simultaneous", to every user that is logged in the same interval with anyone else. – srodriguex Dec 20 '17 at 14:57

1 Answers1

1

Here is a way to do it using pandas:

from StringIO import StringIO

# read data into dataframe
data = StringIO("""USER_NAME,USER_ROLE,GMT_LOGIN_TIME,GMT_LOGOUT_TIME,LOGIN_DURATION 
ABCD,SUP_235,2017-12-15 11:05:25,2017-12-15 11:09:01,216.0
ABCE,SUP_235,2017-12-15 11:23:14,2017-12-15 11:33:17,603.0
ABCF,USER_285,2017-12-15 11:44:12,2017-12-15 12:07:52,1420.0
ABCG,SUP_230,2017-12-15 18:43:17,2017-12-15 19:00:20,1023.0
ABCH,USER_245,2017-12-15 10:36:59,2017-12-15 11:42:00,3901.0""")
df = pd.read_csv(data, sep=",")

# create a new column for simultaneous
df['simultaneous'] = 0

# loop through dataframe and check condition
for i in df.index:
    login, logout = df.loc[i,'GMT_LOGIN_TIME'], df.loc[i,'GMT_LOGOUT_TIME']
    this_index = df.index.isin([i])
    df.loc[i, 'simultaneous'] = int(any(
        (df[~this_index]['GMT_LOGIN_TIME'] <= logout) & (df[~this_index]['GMT_LOGOUT_TIME'] >= login)
    ))

Output:

  USER_NAME USER_ROLE       GMT_LOGIN_TIME      GMT_LOGOUT_TIME  \
0      ABCD   SUP_235  2017-12-15 11:05:25  2017-12-15 11:09:01   
1      ABCE   SUP_235  2017-12-15 11:23:14  2017-12-15 11:33:17   
2      ABCF  USER_285  2017-12-15 11:44:12  2017-12-15 12:07:52   
3      ABCG   SUP_230  2017-12-15 18:43:17  2017-12-15 19:00:20   
4      ABCH  USER_245  2017-12-15 10:36:59  2017-12-15 11:42:00   

   LOGIN_DURATION   simultaneous  
0            216.0             1  
1            603.0             1  
2           1420.0             0  
3           1023.0             0  
4           3901.0             1  

Which is correct because both ABCD and ABCE are simultaneous with ABCH.

Explanation of the logic:

  1. We loop through each index (row) of the dataframe and grab the login and logout times for that user.

  2. Next we want to check all other rows for an overlap, so we create this_index as an indexer to point to the current row. Using the bitwise inversion operator (~), we can select the other rows with df[~this_index].

  3. For the other rows we check to see if any of them meet the overlap condition, using the any function. Since this returns a boolean, we convert to int. We could also have used sum(condition) if you wanted to see how many other connections were overlapping with this one.

  4. The result of int(any(condition)) is put in the 'simultaneous' column.

Refer to this page for the logic behind the overlapping time ranges: Determine Whether Two Date Ranges Overlap

Edit: A more pandas way

Looping over rows in pandas is slow for large tables. Here is a way to achieve the same results using pandas.Dataframe.apply(), which should be faster.

df['simultaneous'] = df.apply(
    lambda x: int(
        any(
            (df[df['USER_NAME'] != x['USER_NAME']]['GMT_LOGIN_TIME'] <= x['GMT_LOGOUT_TIME']) &\
            (df[df['USER_NAME'] != x['USER_NAME']]['GMT_LOGOUT_TIME'] >= x['GMT_LOGIN_TIME'])
        )
    ),
    axis=1
)
pault
  • 41,343
  • 15
  • 107
  • 149
  • Your solution worked as the problem was stated. In production the table has hundreds of thousands of records. It got very slow. I've updated it to write the 1 value in all simultaneous users at once and skip these users. – srodriguex Dec 20 '17 at 23:52
  • I added an alternative solution that should be faster. If you added some `pandas` tags to this question, you may find others have even faster methods. What was your method? – pault Dec 21 '17 at 15:01
  • My optimisation solution basically uses your condition modified to get an ID for each record that is simultaneous and update them as so. At each iteration it checks whether the current record has been identified previously and skip it if so. This is supposed to avoid to run the condition again, but it doesn't save much time in the end though. – srodriguex Dec 22 '17 at 23:36