Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I am working on a machine learning algorithm on python's scikit.learn, but this time the data are in Mongodb documents format. I would like to pull my data into a dataframe. Here is an exemple of the documents:

{
    "_id" : ObjectId("552b9525359c6a09f061cb53"),
    "Interrupt" : true,
    "Url" : "Coco_mademoiselle.jpg",
    "Target" : {
        "FemaleInPercent" : 100,
        "MaleInPercent" : 0,
        "AgeProperties" : 6
    },
    "MaxDisplayTime" : 7,
    "MinDisplayTime" : 2,
    "MediaType" : 0,
    "IsLocked" : false,
    "FaceTagged" : [ 
        {
            "FaceId" : 36,
            "GenderConfidence" : -0.1731295609721586,
            "Age" : 23,
            "TotalAttention" : 14.92099999999997,
            "AttentionInsideThisContent" : 2.273999999999992,
            "Gender" : "Unknown",
            "AngleYaw" : [ 
                0
            ],
            "XPos" : [ 
                0.07704142996903575
            ],
            "YPos" : [ 
                0.7182761555157026
            ],
            "Distance" : [ 
                0.7223960254002589
            ]
        }, 
        {
            "FaceId" : 37,
            "GenderConfidence" : 0.3932732620245187,
            "Age" : 51,
            "TotalAttention" : 14.92099999999997,
            "AttentionInsideThisContent" : 2.273999999999992,
            "Gender" : "Female",
            "AngleYaw" : [ 
                0
            ],
            "XPos" : [ 
                0.9852976840852283
            ],
            "YPos" : [ 
                -0.9149562017596122
            ],
            "Distance" : [ 
                1.344602683844596
            ]
        }
    ],
    "PanelId" : "PANEL_1",
    "ScenarioId" : "Scenario-1",
    "StartTime" : ISODate("2015-04-13T10:06:22.622Z"),
    "EndTime" : ISODate("2015-04-13T10:06:29.640Z")
}

I used this function to put my data into a pandas dataframe but I have some issues with my embedded documents and array of documents:

def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Make a query to the specific DB and Collection
    cursor = collection.find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

Finally, i get a Dataframe with one column containing the FaceTagged informations gathered all together :

data.FaceTagged.to_frame()
                                           FaceTagged
0   [{u'Distance': [0.871754460354], u'XPos': [0.7...
1   [{u'Distance': [0.845591660012], u'XPos': [0.6...
2   [{u'Distance': [1.01813052012], u'XPos': [-0.7...

each line contain all the fields from only one document besides the fact FaceTagged is an array of documents, and each document contains severals fields.

Anyone can relate to this?

share|improve this question
    
"I have some issues" - can you be specific, please. What error messages are thrown, or what design problems are you suffering? – Michael Green Apr 24 '15 at 10:36
    
I just updated the text of my problem, The problem is related to the embedded documents stored in the array "FaceTagged". hope it's clear now. – MabroukAljane Apr 24 '15 at 11:38

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.