Help identifying an approach and a sanity check

#1
Hello,

I am hoping some of you smart statisticians can help me with a problem I am trying to solve.

I am trying to build a model that will create a “Score” for sets or groups of devices that exist within a monitored environment. Consider this a "Health Score" or "Risk Factor" type of rating to help identify or triage which device groups need the most immediate attention, and report over time the groups that tend to display the highest risk scores (hopefully to analyze the underlying reasons to keep overall risk down).

For example, a CPU running at 99% could generate a “Critical” alarm, while CPU running at 80% might generate a “Warning”. These alarms are represented by a numeric value (5 being the highest and 0 meaning “no alarm”). These measures or monitors are applied to various attributes of a device (CPU, Memory, Disk Usage, Chassis Fan Status, interface utilization, etc…) and depending on thresholds defined, will kick out an alert when a threshold is breached.

The device could be in an alarm condition for any amount of time (0 hours, meaning that it just occurred, to infinity..not really infinity, but as long as the condition is true, the alarm will remain valid)

Taking the above example, assume I have a group (GroupA) that contains 10 devices.

One device has 5 different “Critical” alarms (Level 5) that have been in the alarm state for 1 hour. Meaning that 5 separate measures have triggered a critical alarm, and the device has been in this condition for 1 hour.

Now compare that to another group (GroupB) of 10 devices where 5 of the devices have a single measure in a High condition (Level 4) for a total of 5 High alarms.

Which group is in worse condition? My logic would go with GroupB as there is a higher ratio of impacted devices.

Add into that mix, another group (GroupC) that only has 3 devices. 2 of the 3 devices have a Critical alarm.

Which group now is in worse condition?

Again, my own logic dictates that from a group risk standpoint, GroupC is worse off than GroupB which is worse than GroupA…but maybe my logic is flawed.

Sample Data:

Group A
------------
Crit High Low Warn
---- ----- ---- -----
device 1 5 0 0 0
device 2 0 0 0 0
device 3 0 0 0 0
device 4 0 0 0 0
device 5 0 0 0 0
device 6 0 0 0 0
device 7 0 0 0 0
device 8 0 0 0 0
device 9 0 0 0 0
device 10 0 0 0 0


Group B
------------
Crit High Low Warn
---- ----- ---- -----
device 1 0 1 0 0
device 2 0 1 0 0
device 3 0 1 0 0
device 4 0 1 0 0
device 5 0 1 0 0
device 6 0 0 0 0
device 7 0 0 0 0
device 8 0 0 0 0
device 9 0 0 0 0
device 10 0 0 0 0

Group C
-----------
Crit High Low Warn
---- ----- ---- -----
device 1 1 0 0 0
device 2 1 0 0 0
device 3 0 0 0 0
----------------------------------------------------------------------

Another factor that I won't bother with now, is individual device weighting to the overall group. For argument's sake here, we will assume each device equally contributes to the greater group.

I have attached a spreadsheet that I have put together which does seem to make logical statements around which group would be considered the most at risk. My concern here, is that this is 100% based off of my own logic and perception...which may be wildly inaccurate. Ideally, the "score" would be leverage some known and accepted statistical approach that would stand up to scrutiny. Or at least come from some commonly accepted approach to this type thing.

In researching this, I have looked at a number of use cases for health rankings using logarithmic regression as this seemed to have a pretty close parallel to what I was trying to determine (population (groups), symptoms (alarms), how long symptoms have been present (alarm duration)), but I can't seem to get my head around this.

Is the logarithmic regression idea seem like the right path? If so, if someone could help me make the connection between the two approaches, it would be GREATLY appreciated. If I am going down the wrong road and am missing something, please set me straight.

Any and all input is welcome

Thank you