Problem with zPermissions (Deadlock) #146


  • Defect
  • Replied
Open
Assigned to _ForgeUser1494830
  • _ForgeUser7164447 created this issue May 28, 2014

    What steps will reproduce the problem?
    1. Connect multiple server to the same DB
    2. Put all server in read-only mod, except the hub, where you will do all the changes
    3. On grou update, there will a "Deadlock", and boom, transaction rollbacked

    What is the expected output? What do you see instead?

    The changegroup to not be deadlocked

    What version of the product are you using?

    1.1.1

    Do you have an error log of what happened?

    Yes,  http://pastie.org/private/hljjchkcoyxopye22f0ew (I have the same for INSERT too)

    Please provide any additional information below.

    After all my test, i moved all my server on a new DB to be sure there will be ABSOLUTELY nothing except the hub writing on this DB.
    All other server are set in 'Read-only' mod, and after i moved all servers to connect on the new DB, bam, error when we change people of group on the hub appear (sometimes).

    So yes, this is absolutely not my fault, now i'v changed of DB i can say i'm 100% sure there's absolutely NOTHING writing on the zPermissions DB, only the hub.

    So yes, all other server can technichally write on this db... but they are set in 'read only' mod.

    Please do something...

  • _ForgeUser7164447 added the tags New Defect May 28, 2014
  • _ForgeUser1494830 posted a comment May 28, 2014

    Increasing txn-max-retries did nothing?

  • _ForgeUser7164447 posted a comment May 28, 2014

    <<reply 2203527>>

    Nope.

    But i'm pretty sur the problem comes from read-only server, i'll try to give to these server to READ only information on this db to see...

  • _ForgeUser7164447 posted a comment May 28, 2014

    Also if the setgroup fail, he will create a double entry for this user :

    http://i.imgur.com/XC5tfdP.png

    WTF ? The user is supposed to be in only one group since i setgroupped him, it happen only to player when it fails

  • _ForgeUser1494830 posted a comment May 28, 2014

    @Shooty_: Go

    Nope it did nothing, or nope you didn't try?

    Deadlock exception means the transaction should be retried (it even says so). My code is already set up to do this (your exception stack trace passes through my RetryingAvajeTransactionStrategy class). By default it will retry once. Depending on how busy the database is, more retries may be needed.

    But again, deadlocks only occur with multiple writers. So I have no idea how it applies to your set up.

    But yeah, if you can, give at least one of your read-only servers a read-only database account. If zPerms is writing when it shouldn't, you should see a stack trace.

  • _ForgeUser1494830 posted a comment May 28, 2014

    I'm assuming you're using InnoDB with MySQL.

    Since you can reproduce it easily, can you make it happen, then connect to the database and issue the command:

    show engine innodb status

    I want to see the "LATEST DETECTED DEADLOCK" info.

  • _ForgeUser1494830 removed a tag New May 28, 2014
  • _ForgeUser1494830 added a tag Waiting May 28, 2014
  • _ForgeUser7164447 posted a comment May 29, 2014

    <<reply 2203767>>

    http://pastie.org/private/k0fdgqhciro1d5w6omrxq

    And yes i tried txn-something and it did nothing


    Edited May 29, 2014
  • _ForgeUser7164447 removed a tag Waiting May 29, 2014
  • _ForgeUser7164447 added a tag Replied May 29, 2014
  • _ForgeUser1494830 posted a comment May 29, 2014

    @Shooty_: Go

    And what zPerms command did you issue to cause that deadlock?

    Right now I'm seeing two transactions, one from localhost and the other from SOMEIPADDRESS. Which server is running on the same host as MySQL? I'm assuming the hub?

    What's your auto-refresh-interval set to? Also, how long does zPerms take to load all your permissions? It says when it starts up.

  • _ForgeUser1494830 removed a tag Replied May 29, 2014
  • _ForgeUser1494830 added a tag Waiting May 29, 2014
  • _ForgeUser1494830 posted a comment May 29, 2014

    In addition to answering my previous questions, I've made a special build of 1.1.1: Download

    The only difference is that this special build adds logging to failed transactions (when retrying) and there is a delay.

    You must add

    txn-max-retries: 5
    

    to your config.yml because obviously the default of 1 isn't working for you.

    I want to know if you see the message like the following before each exception:

    [14:20:31 WARN]: Transaction failed, will retry immediately
    

    This will give me more info on whether or not the transactions are being retried.

    I've already spent a lot of time on this problem (already 3 hours today, and I'm at work). I can easily reproduce these transaction failures, but a simple retry always works for me. There are other people who run a similar configuration to you (multi-server with only one read-write) but this is the first I've heard of this problem.

    If the new information you provide doesn't give me any new insight, then I am sorry to say I am stumped and can't currently help you.

  • _ForgeUser1494830 removed a tag Replied May 29, 2014
  • _ForgeUser1494830 added a tag Waiting May 29, 2014
  • _ForgeUser7164447 posted a comment May 30, 2014

    @ZerothAngel: Go

    The command i use is "perm player setgroup NewGroup", i never use another command. The transaction are from localhost, since my hub is on the same server. The other IP is one of my other server i guess (maybe skyblock or something else). But those are all set in read-only.

    AutoRefresh is set & forced for all server to 5 mins.

    zPerms take 5 secs or 10 secs to load on remote DB, localy it take maybe 1 sec or 2.

    @ZerothAngel: Go

    Ok thanks, i do have to install this build everywhere or just in the hub ? I guess just in the hub since it's the only one that are suppose to do change.

    I'v already set txn-max-retries to 5 to my config. I'll add your new versions right now.

    I'v haven't set my "read-only" users for all my server that are just suppose to "read" data, it takes some times to do. But i'll do it and hopefully it will solve problems (since it can't be deadlocked anymore if REALLY no one can write on it except the hub).

    Also you didn't answer if it's was normal that when if fail it creat a double entry for the same user. Maybe it's for that that the plugin doesn't retry, he things the transaction worked and there's 2 people with the same username in the database with different group. (While i was using perm player playername setgroup Newgroup) it normaly change the group and not add a double an entry like this : http://i.imgur.com/XC5tfdP.png.

    So i think your plugin DOES retry and it things it worked but he sent the wrong sql command ? cause this http://i.imgur.com/XC5tfdP.png is absolutely anormal after a failed setgroup. (And i only have this occur when the transaction have an error)


    Edited May 30, 2014
  • _ForgeUser7164447 removed a tag Waiting May 30, 2014
  • _ForgeUser7164447 added a tag Replied May 30, 2014
  • _ForgeUser7164447 posted a comment May 30, 2014

    Updated previous message

  • _ForgeUser1494830 posted a comment May 30, 2014

    Thanks for the answers and trying things out.

    As for your question, I'm still trying to figure out how it can fail that way. When you do a setgroup, it will normally delete all groups (that are not the one group you specified) and then add, if needed, that one group. This is all done in a single transaction. So a failure should have left it in the previous state. Do you know what group IDs 1 & 3 are in your system? Is one of them the group you use for setgroup?

    Permissions are kept in memory and permissions are kept in the database. This is done for performance reasons. And if everything is going perfectly, they will be in sync. But once you see any sort of write failure, they will be out of sync. So if you try to fix it by issuing the command again, it will probably send the wrong SQL commands like you say.

    So my recommendation is to /perm refresh whenever you see any write failures, before trying to fix it with another command. I'll add this to the error message (the transaction failure message specifically does not say this, but others do) and look into the possibility of making it automatic.

    Lastly, with MySQL InnoDB, even reading a row will lock it for that transaction. So I believe even your read-only servers will contribute to the deadlocks.

  • _ForgeUser1494830 posted a comment May 30, 2014

    Anyway, I think the main issue stems from the fact that your remote/read-only instances take a relatively long time to load permissions.

    Without the delay, zPermissions runs out of retry attempts almost immediately.

    I think you might still run out of attempts even with txn-max-retries to 5, so you might try increasing it to 10 or higher if you still see problems.

    Another possible fix: since you only have one read-write instance, it should be safe to relax the database transaction isolation. In bukkit.yml (or config.yml, if that's where you put it), change the isolation to REPEATABLEREAD:

    database:
      ...
      isolation: REPEATABLEREAD
      ...
    

    Unlike SERIALIZABLE, SELECTs won't actually lock rows.

  • _ForgeUser7164447 posted a comment May 31, 2014

    @ZerothAngel: Go

    Do i can do that even if i have other plugin that need to write ? (Like TeleportSigns)

  • _ForgeUser1494830 posted a comment May 31, 2014

    @Shooty_: Go

    Not sure. But if you have another plugin that uses the bukkit.yml settings, just copy the "database" section from bukkit.yml to zPerms config.yml. zPerms will use that instead. Then modify the isolation there.

    I forgot to mention: if you do move the database config to config.yml, you can't use the "{NAME}" substitution in the url like in bukkit.yml (if you're using it). Instead, just substitute it manually.


    Edited May 31, 2014
  • _ForgeUser7164447 posted a comment May 31, 2014

    @ZerothAngel: Go

    You've done an amazing follow'up those last days, i'll come again in the next few day to say where all my test leads.

    Thanks again.

  • _ForgeUser7164447 posted a comment Jun 10, 2014

    Ok seems to be fixed with your txn-max-retries to 15. :)

  • _ForgeUser1494830 posted a comment Jun 11, 2014

    @Shooty_: Go

    But you're using the modified build?

    Were you able to test the "isolation: REPEATABLEREAD" change? Because I think that would be a cleaner fix rather than forcing write transactions to wait for remote servers to finish their loading.

    Anyway, thanks for reporting back.

  • _ForgeUser7164447 posted a comment Jun 11, 2014

    @ZerothAngel: Go

    Yup i used REPEATABLEREAD with the new build.

    But i haven't test without the modified build, so for now i use the modified build. And it works.


To post a comment, please login or register a new account.