I am encountering duplicate results in my sqlalchemy+postgres query. I have a model structured like below:
class Audio(db.Model):
__tablename__ = "audio"
id = db.Column(db.Integer, primary_key=True)
file_location = db.Column(db.String(), unique=True)
upload_time = db.Column(ArrowType, default=arrow.utcnow())
keyword = db.Column(JSON)
transcript = db.Column(JSON)
diarization = db.Column(JSON)
user_id = db.Column(db.Integer, db.ForeignKey('user.id', onupdate='cascade', ondelete='cascade'))
company_id = db.Column(db.Integer, db.ForeignKey('company.id', onupdate='cascade', ondelete='cascade'))
client_id = db.Column(db.Integer, db.ForeignKey('client.id', onupdate='cascade', ondelete='cascade'))
def __init__(self, file_location, upload_time, keyword, transcript, diarization, user_id, company_id, client_id):
self.file_location = file_location
self.upload_time = upload_time
self.keyword = keyword
self.transcript = transcript
self.diarization = diarization
self.user_id = user_id
self.company_id = company_id
self.client_id = client_id
def __repr__(self):
return '<Audio ID %r>' % self.id
And under transcript, the input is a json object:
{"transcript": [
{"p": 0, "s": 0, "e": 320, "c": 0.545, "w": "This"},
{"p": 1, "s": 320, "e": 620, "c": 0.825, "w": "call"},
{"p": 2, "s": 620, "e": 780, "c": 0.909, "w": "is"},
{"p": 3, "s": 780, "e": 1010, "c": 0.853, "w": "being"}
...
]}
I am trying to filter entries based on the value from "w" and query its corresponding Audio.id and "p". I have tried the following:
transcript_subquery = s.query(func.json_array_elements(Audio.transcript['transcript']).label('transcript')).subquery()
temp = transcript_subquery.c.transcript.op('->>')('w').cast(String)
temp1 = transcript_subquery.c.transcript.op('->>')('s').cast(Integer)
query = s.query(temp1, Audio.id).filter(temp.ilike("all"))
The problem that I have is that right now I only have 5 data entries and one of the data entry's transcript field is blank. However, I got a result that contains the following:
(665610, 5), (736413, 5), (907230, 5), (942340, 5), (1020852, 5), (1023942, 5), (1037101, 5), (1078521, 5), (1105581, 5), (1117551, 5), (1372730, 5), (1501960, 5), (1508410, 5)
This is really odd to be given that the entry where Audio.id =5 doesn't even have transcript json object yet it produces such result. Moreover, all the entries, each has 112 results. And I suspect that 112 is the total number that the word appears and it somewhat is double counting everything for each Audio ID. And I don't really know how to fix my sqlalchemy query.
SELECT (foo.transcript->>'s')::int, audio.id FROM (SELECT ... AS transcript FROM audio) foo, audio
when it appears that you actually want to doSELECT (foo.transcript->>'s')::int, foo.id FROM (SELECT ... AS transcript, id FROM audio) foo
instead.