2

I am (short-term) monitoring the sizes of several UNIX files and can easily export the numbers to an Excel spreadsheet. The files will eventually be indexed by Splunk and that team wants to know the daily volume to expect. I am currently sampling the file sizes every hour. My file sizes generally are of two types, constantly growing or cyclical. My definition of cyclical is: the files grow, then get reset to size 0, then grow, then reset, etc. A graph of the numbers could be defined as a sawtooth pattern.

Can Excel determine which category a file belongs, either constantly growing or cyclical? The final numbers do not have to be exact, but must give a good idea of the daily volume. (I could do this manually, but I want Excel to allow me to be lazy.)

The reasons I ask are:

  • If a file is reset to zero, which happens typically 3 times a day (but sometimes more often), I want to report the sum of the three largest file sizes. As a bonus, if the resets happen more often, perhaps Excel knows this and I can multiply by that constant.
  • If a file is constantly growing, it's a candidate for log file rotation, and I'll report the difference between early day and late day file sizes to get daily volume. (And see about getting a rotation script running.)
  • If the file size never changes, the program generating the file needs to be checked. I can discover this by comparing min() and max() values.

Here are some sample data pulled directly from my spreadsheet. You can see the top and bottom rows grow until the 5th column, 16:05. It is safe to assume the file size was reset to 0 at some point after the 15:55 sample. The middle row continues to grow.

14:05       14:55       15:05       15:55       16:05       16:55       17:05       17:55       18:05
 3,307,183   3,692,602   3,872,188   4,301,423      61,010     467,733     554,456     981,851   1,068,804
36,348,105  36,888,709  37,003,076  37,541,343  37,650,979  38,182,932  38,292,337  38,838,988  38,947,009
 1,666,276   1,879,624   1,945,419   2,164,186      29,410     238,368     293,786     518,526     565,455

(I cheated on my hourly sample. I wanted to trap the time the resets actually occur. Now that I know, I'll drop back to hourly monitoring and recalculate the older times with the data I've retained. I'll monitor the files for about a week.)

The output I'll report from rows one and three will be three times the maximum, the 15:55 column here. Actually, I'll probably use something like =LARGE(B1:ZZ1,1) + LARGE(B1:ZZ1,2) + LARGE(B1:ZZ1,3), but I need to know this is cyclical file growth.

The output from row two will be the size at 23:55 today minus the size at 23:55 yesterday, but I need to know this is constant file growth.

Then I'll total all my sizes to get a grand total.

iAmJeff
  • 142

2 Answers2

1

This problem seems well suited to Excel.  But first, let me question your question.  Suppose your data look like this:

    60     120     180      45      90     135      35      70     105      60

line graph of the above

You mentioned selecting the largest three values with LARGE().  The three largest values in the above are 120, 180, and 135.  Is that what you want, or do you want the peaks: 180, 135, and 105?  I’ll get back to that.

Set up a rectangular region the same size and shape as your data.  It can be below your data, to the right, or on a different sheet.  I’ll assume that it’s on the same sheet, below the data, in rows 12-14.  Put

=IF(A2>B2, A2, IF(A2=B2, "same", ""))

into cell A12, and drag/fill to the right and down.  This evaluates to the value in the corresponding data cell if it is greater than the value in the next cell in the same row.  In your data, this is true for the local maxima, i.e., the peaks.  When two adjacent data values are equal, this evaluates to the text value same.  Otherwise, it is blank.  Here I modified your row 3 data to contain a block of equal values in the middle:

screen shot of spreadsheet

Now the COUNT() function will tell you how many numbers there are in a row, which is the number of times the data dropped between two consecutive cells.  For steadily increasing values, this will be zero; if the value is periodically reset, this will show how many times it happened.  And COUNTIF(…, "same") tells you how many times two consecutive cells were equal.  This may be better at detecting a dead process than simply comparing the min and max values.

You may find these maxima useful for your statistics.  Consider reducing them with LARGE(), AVERAGE(), or SUM().

  • Yes, I want the peaks when the graph looks like that. The peaks represent the total size up to a point in time. Intermediate values on the uphill side are included in the peak. Your observation is something I hadn't considered. – iAmJeff Apr 30 '15 at 15:21
0

Actually, I'd do this in the script where you capture the data. In addition to capturing the size in one "data" file, save the current size, on next run compare the new current with the previous. If the new is smaller than the old, save TRUE against the file name in a separate CSV file, otherwise save FALSE. Make sure both data and this file are indexed by filename so you can match them.

Using a scripting language such as Python or Node.js would make this a pretty simple task. Much easier than trying to do it all in some convoluted Excel formula.

Julian Knight
  • 14,501
  • 3
  • 29
  • 44
  • I could probably do it in ksh or even the Bourne shell. But I'm not clever enough to build a pretty graph, to show a manager, in case of questions. ;-) – iAmJeff Apr 30 '15 at 15:21
  • 1
    All I meant was to create the TRUE/FALSE logic at source then you can easily combine the data in Excel and produce your nice graph easily. Doing the logic is easier at source and the graph is certainly easier in Excel :) – Julian Knight Apr 30 '15 at 15:54