Skip to content

Very long-standing bug in TaskQueueDB priorities? #8650

Description

@sfayer

Hi,

For the last couple of days I've been tracking down what seemed like a bug in my parameterised SQL version of TaskQueueDB (I've not created the PR for this yet): No matter what I did I couldn't get it to generate priorities that match the expected one in the integration tests.

After looking into the original version of this file, I think there may be a problem calculating the TaskQueue priorities in the current code: An SQL escaping error in the TaskQueue DB means the re-weighting function never matches any queues here (the userGroup here is already escaped at this point so it gets double quotes):

selSQL = f"SELECT Owner, COUNT(Owner) FROM `tq_TaskQueues` WHERE OwnerGroup='{userGroup}' GROUP BY Owner"

Generated SQL from the integration tests (note the double quoting around myGroup):

TaskQueueDB DEBUG: _query: SELECT Owner, COUNT(Owner) FROM `tq_TaskQueues` WHERE OwnerGroup='"myGroup"' GROUP BY Owner

This means it doesn't continue with the rest of the function (as the select returns an empty result), so the TaskQueue Priorities all remain set to the default of 1. I looked on my production server and indeed all of the TaskQueues there are all Priority = 1 in the DB. I think there are other paths to recalculate the priorities but I don't know how those are activated (or if they potentially use this same faulty function).

I suspect this has implications for my escape_string patch: That probably breaks thing here too (as it'll have multiple single quotes in a row), but as it's already broken the tests still get the same answer so don't fail! (I've not double checked this, but I think it's highly likely).

The patch for this would be to remove the double escaping in two places and increased the expected priority to 1000.0, which is the value that it generates when it does do the calculation. Here is the full patch:

diff --git a/src/DIRAC/WorkloadManagementSystem/DB/TaskQueueDB.py b/src/DIRAC/WorkloadManagementSystem/DB/TaskQueueDB.py
index 98c2bc3bcc..a478111a66 100755
--- a/src/DIRAC/WorkloadManagementSystem/DB/TaskQueueDB.py
+++ b/src/DIRAC/WorkloadManagementSystem/DB/TaskQueueDB.py
@@ -1243,7 +1243,7 @@ WHERE j.JobId = %s AND t.TQId = j.TQId"
             # If group has JobSharing just set prio for that entry, user is irrelevant
             return self.__setPrioritiesForEntity(user, userGroup, share, connObj=connObj)
 
-        selSQL = f"SELECT Owner, COUNT(Owner) FROM `tq_TaskQueues` WHERE OwnerGroup='{userGroup}' GROUP BY Owner"
+        selSQL = f"SELECT Owner, COUNT(Owner) FROM `tq_TaskQueues` WHERE OwnerGroup={userGroup} GROUP BY Owner"
         result = self._query(selSQL, conn=connObj)
         if not result["OK"]:
             return result
@@ -1275,7 +1275,7 @@ WHERE j.JobId = %s AND t.TQId = j.TQId"
         Set the priority for a user/userGroup combo given a splitted share
         """
         self.log.info("Setting priorities", f"to {user}@{userGroup} TQs")
-        tqCond = [f"t.OwnerGroup='{userGroup}'"]
+        tqCond = [f"t.OwnerGroup={userGroup}"]
         allowBgTQs = gConfig.getValue(f"/Registry/Groups/{userGroup}/AllowBackgroundTQs", False)
         if Properties.JOB_SHARING not in Registry.getPropertiesForGroup(userGroup):
             res = self._escapeString(user)
diff --git a/tests/Integration/WorkloadManagementSystem/Test_TaskQueueDB.py b/tests/Integration/WorkloadManagementSystem/Test_TaskQueueDB.py
index a2892a192a..135daaad3f 100644
--- a/tests/Integration/WorkloadManagementSystem/Test_TaskQueueDB.py
+++ b/tests/Integration/WorkloadManagementSystem/Test_TaskQueueDB.py
@@ -1359,7 +1359,7 @@ def test_TQ(tq_cleanup):
         "Jobs": 1,
         "OwnerGroup": "myGroup",
         "CPUTime": 86400,
-        "Priority": 1.0,
+        "Priority": 1000.0,
     }
     result = tqDB.findOrphanJobs()
     assert result["OK"]

This doesn't seem like a massive change, but given how sensitive things could be to these priorities, particularly for instances running large number of jobs, I thought it was best to discuss this:

  • Do you agree with my analysis of this problem?
  • How would you like to go about fixing this? Apply the patch? Modify things to stick with the "broken" behaviour? Just apply the parameterised version as the fix and not worry about older versions? Something else?

Regards,
Simon

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions