0

I am encountering duplicate results in my sqlalchemy+postgres query. I have a model structured like below:

class Audio(db.Model):
    __tablename__ = "audio"

    id = db.Column(db.Integer, primary_key=True)
    file_location = db.Column(db.String(), unique=True)
    upload_time = db.Column(ArrowType, default=arrow.utcnow())
    keyword = db.Column(JSON)
    transcript = db.Column(JSON)
    diarization = db.Column(JSON)

    user_id = db.Column(db.Integer, db.ForeignKey('user.id', onupdate='cascade', ondelete='cascade'))
    company_id = db.Column(db.Integer, db.ForeignKey('company.id', onupdate='cascade', ondelete='cascade'))
    client_id = db.Column(db.Integer, db.ForeignKey('client.id', onupdate='cascade', ondelete='cascade'))

    def __init__(self, file_location, upload_time, keyword, transcript, diarization, user_id, company_id, client_id):
        self.file_location = file_location
        self.upload_time = upload_time
        self.keyword = keyword
        self.transcript = transcript
        self.diarization = diarization
        self.user_id = user_id
        self.company_id = company_id
        self.client_id = client_id

    def __repr__(self):
        return '<Audio ID %r>' % self.id

And under transcript, the input is a json object:

{"transcript": [
    {"p": 0, "s": 0, "e": 320, "c": 0.545, "w": "This"}, 
    {"p": 1, "s": 320, "e": 620, "c": 0.825, "w": "call"}, 
    {"p": 2, "s": 620, "e": 780, "c": 0.909, "w": "is"}, 
    {"p": 3, "s": 780, "e": 1010, "c": 0.853, "w": "being"}
    ...
    ]}

I am trying to filter entries based on the value from "w" and query its corresponding Audio.id and "p". I have tried the following:

  transcript_subquery = s.query(func.json_array_elements(Audio.transcript['transcript']).label('transcript')).subquery()
    temp = transcript_subquery.c.transcript.op('->>')('w').cast(String)
    temp1 = transcript_subquery.c.transcript.op('->>')('s').cast(Integer)
    query = s.query(temp1, Audio.id).filter(temp.ilike("all"))

The problem that I have is that right now I only have 5 data entries and one of the data entry's transcript field is blank. However, I got a result that contains the following:

(665610, 5), (736413, 5), (907230, 5), (942340, 5), (1020852, 5), (1023942, 5), (1037101, 5), (1078521, 5), (1105581, 5), (1117551, 5), (1372730, 5), (1501960, 5), (1508410, 5)

This is really odd to be given that the entry where Audio.id =5 doesn't even have transcript json object yet it produces such result. Moreover, all the entries, each has 112 results. And I suspect that 112 is the total number that the word appears and it somewhat is double counting everything for each Audio ID. And I don't really know how to fix my sqlalchemy query.

3
  • You're implicitly cross joining your subquery and the audio table in the code you've presented. In other words every row from your subquery will be joined to every row from audio, since you have no predicates that would make it an inner join. Commented Dec 9, 2016 at 9:02
  • @IljaEverilä Is there a way to resolve this problem? I saw this post saying there there's no better way to do this in sqlalchemy: stackoverflow.com/questions/14813516/… Commented Dec 9, 2016 at 18:59
  • 1
    That question has nothing to do with this. You want to avoid cross joins, not to explicitly use cross joins. Your problem is that you're doing SELECT (foo.transcript->>'s')::int, audio.id FROM (SELECT ... AS transcript FROM audio) foo, audio when it appears that you actually want to do SELECT (foo.transcript->>'s')::int, foo.id FROM (SELECT ... AS transcript, id FROM audio) foo instead. Commented Dec 9, 2016 at 22:18

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.